Better data about SolrCloud and Zookeeper in the General section.
Added information about GC pause tools
|Deletions are marked like this.||Additions are marked like this.|
|Line 92:||Line 92:|
|===== Tools and Garbage Collection =====
Unless they are caused by the heap being too small, tools like JVisualVM and JConsole will '''NOT''' show that you are having problems with GC pauses. You can only see information about totals and averages.
The following free tools are good at revealing pause problems. There may be more tools available:
This page will attempt to answer questions like the following:
- Why is Solr performance so bad?
- Why does Solr take so long to start up?
Why is SolrCloud acting like my servers are failing when they are fine?
This is an attempt to give basic information only. For a better understanding of the issues involved, read the included links, look for other resources, and ask well thought out questions via Solr support resources.
A major driving factor for Solr performance is RAM. Solr requires sufficient memory for two separate things: One is the Java heap, the other is "free" memory for the OS disk cache.
It is strongly recommended that Solr runs on a 64-bit Java. A 64-bit Java requires a 64-bit operating system, and a 64-bit operating system requires a 64-bit CPU. There's nothing wrong with 32-bit software or hardware, but a 32-bit Java is limited to a 2GB heap, which can result in artificial limitations that don't exist with a larger heap. The Java heap is discussed in a later section of this page.
Because SolrCloud relies heavily on Zookeeper, it can be very unstable if you have underlying performance issues that result in operations taking longer than the zkClientTimeout. Increasing that timeout can help, but addressing the underlying performance issues will yield better results. The default timeout (15 seconds) is quite long and should be more than enough for a well-tuned SolrCloud install.
Zookeeper's design assumes that it has extremely fast read and write access to its database. If the Zookeeper database is stored on the same disks that hold the Solr data, any performance problems with Solr will delay Zookeeper's access to its own database. This can lead to a performance death spiral where each ZK timeout results in recovery operations which cause further timeouts.
"Extremely fast" reads and writes mean that the OS must be able to cache the database in its disk cache. If the disk cache is too small, the OS will have to read the disk in order to get zookeeper data. Disks are slow, and when there is a lot of I/O because Solr is having performance issues, a Zookeeper read or write may get buried in the I/O scheduler queue and take even longer to complete. We strongly recommend storing the Zookeeper data on separate physical disks from the Solr data. Having dedicated machines for all three ZK nodes is even better, but not a requirement.
OS Disk Cache
For index updates, Solr relies on fast bulk reads and writes. For search, fast random reads are essential. The best way to satisfy these requirements is to ensure that a large disk cache is available. Visit Uwe's blog entry for some good Lucene/Solr specific information. You can also utilize Solid State Drives to speed up Solr, but be aware that this is not a complete replacement for OS disk cache. See the SSD section later in this document for more details.
In a nutshell, you want to have enough memory available in the OS disk cache so that the important parts of your index, or ideally your entire index, will fit into the cache. Let's say that you have a Solr index size of 8GB. If your OS, Solr's Java heap, and all other running programs require 4GB of memory, then an ideal memory size for that server is at least 12GB. You might be able to make it work with 8GB total memory (leaving 4GB for disk cache), but that also might NOT be enough.
The exact minimum requirements are highly variable and depend on things like your schema, your index contents, and your queries. If your index has a lot of stored fields, those requirements would be on the smaller end of the scale. If you have very little stored data, you would want to be on the higher end of the scale. The size of stored data doesn't affect search speed very much, though it might affect how long it takes to retrieve the search results once the required documents have been determined.
The java heap is the memory that Solr requires in order to actually run. Certain things will require a lot of heap memory. The following list is incomplete, but in no particular order, these include:
- A large index.
- Frequent updates.
- Super large documents.
Extensive use of faceting with the default facet.method value.
- Using a lot of different sort parameters.
- Very large Solr caches
- A large RAMBufferSizeMB.
- Use of Lucene's RAMDirectoryFactory.
How much heap space do I need?
The short version: This is one of those questions that has no generic answer. You want a heap that's large enough so that you don't have OutOfMemory (OOM) errors and problems with constant garbage collection, but small enough that you're not wasting memory or running into huge garbage collection pauses.
The long version: You'll have to experiment. The Java Development Kit (JDK) comes with two GUI tools (jconsole and jvisulavm) that you can connect to the running instance of Solr and see how much heap gets used over time. For longer-term JVM heap, memory spaces, and garbage collection monitoring, you can use tools like SPM. The post on JVM Memory Pool Monitoring shows what to look for in memory pool reports to avoid OOME.
The chart in this jconsole example shows a typical sawtooth pattern - memory usage climbs to a peak, then garbage collection frees up some memory. Figuring out how many collections is too many will depend on your query/update volume. One possible rule of thumb: Look at the number of queries per second Solr is seeing. If the number of garbage collections per minute exceeds that value, your heap MIGHT be too small.
If you let your Solr server run with a high query and update load, the low points in the sawtooth pattern will represent the absolute minimum required memory. Try setting your max heap between 125% and 150% of this value, then repeat the monitoring to see if the low points in the sawtooth pattern are noticeably higher than they were before, or if the garbage collections are happening very frequently. If they are, repeat the test with a higher max heap.
Reducing heap requirements
Here is an incomplete list, in no particular order, of how to reduce heap requirements, based on the list above for things that require a lot of heap:
- Take a large index and make it distributed - shard your index onto multiple servers.
One very easy way to do this is to switch to SolrCloud.
- This doesn't actually reduce the overall heap requirement for a large index, but spreads it across multiple servers.
- Don't store all your fields, especially the really big ones.
- Instead, have your application retrieve detail data from the original data source, not Solr.
Note that doing this will mean that you cannot use Atomic Updates.
Use facet.method=enum for your facets.
- Reduce the number of different sort parameters.
- Reduce the size of your Solr caches.
- Reduce RAMBufferSizeMB. The default in recent Solr versions is 100.
- This value can be particularly important if you have a lot of cores, because a buffer will be used for each core.
Don't use RAMDirectoryFactory - instead, use the default and install enough system RAM so the OS can cache your entire index as discussed above.
GC pause problems
When you have a large heap (about 4GB or larger), garbage collection pauses can be a major problem. This is usually caused by occasionally required full garbage collections that must "stop the world" -- pause all program execution to clean up memory. There are two main solutions: One is to use a commercial low-pause JVM like Zing, which does come with a price tag. The other is to tune the free JVM you've already got. GC tuning is an art form, and what works for one person may not work for you.
The G1 garbage collector is often mentioned as an alternative for better GC performance. The G1 collector was introduced in later Java 6 releases as an experimental feature and upgraded to a standard feature in Java 7. Although the G1 collector does offer much faster collections on average, anecdotal evidence suggests that it doesn't really do anything about the occasional very long GC pause. The original author of this page saw *longer* stop-the-world pauses with G1 compared to un-tuned CMS. The average collection time was dramatically improved, but the worst-case collections were longer. If a good set of G1 tuning parameters can be found, it is likely that G1 will become a real contender.
Using the ConcurrentMarkSweep (CMS) collector with tuning parameters seems to the be right choice for Solr. Here are some ideas that hopefully you will find helpful:
Normal Solr operation creates a lot of short-lived objects, so having a young generation (eden) that's larger than the Java default is important. Making eden too large can be a problem as well, because Solr does use longer-lived objects as well. The old (tenured) generation is also important.
If your max heap is just a little bit too small, you may end up with a slightly different garbage collection problem. This problem is usually worse than the problems associated with a large heap: Every time Solr wants to allocate memory for operation, it has to do a full garbage collection in order to free up enough memory to complete the allocation.
Tools and Garbage Collection
Unless they are caused by the heap being too small, tools like JVisualVM and JConsole will NOT show that you are having problems with GC pauses. You can only see information about totals and averages.
The following free tools are good at revealing pause problems. There may be more tools available:
Solid State Disks are amazing. They have high transfer rates and pretty much eliminate the latency problems associated with randomly accessing data. If you put your index on solid state disks, performance will increase. Most of the time the performance increase will be enormous.
Often SSD will be touted as a replacement for RAM used as disk cache. This is both true and untrue. Despite the incredible speed of SSD, RAM (the OS disk cache) is still significantly faster, and RAM still plays a big role in the performance of SSD-based systems. You probably don't need as much RAM with SSD as you do with spinning disks, but you can't eliminate the requirement. With spinning disks you need between 50 and 100 percent of your index size as cache. With SSD, that might be 25 to 50 percent, less if your index is very small.
Note that SSDs are still a young technology and that the amount of independent Solr-oriented performance tests is very limited. One such test indicates that a disk cache of only 10% index size might be enough for high search performance with SSDs. See Memory is overrated. Note that if your index has very few stored fields, 10% may not be enough. If you have a lot of (or very large) stored fields, it might be.
One potential problem with SSD is that they require operating system TRIM support for good long-term performance. For single disks, TRIM is usually well supported, but if you want to add any kind of hardware RAID (and most software RAID as well), TRIM support disappears. At the time of this writing, it seems that only Intel supports a solution and that is limited to Windows 7 and RAID 0. One way to make this less of a problem with Solr is to put your OS and Solr itself on a RAID of regular disks, and put your index data on a lone SSD. If the SSD fails, your redundant server(s) will still be there to handle requests.
Although there could be other causes, the most common reason for this problem is the updateLog feature introduced in Solr4.0. The problem is not with the feature itself, but depending on how other parts of Solr are configured and used when the feature is turned on, the transaction log can grow out of control.
The updateLog feature adds a transaction log for all updates. When used correctly, the transaction log is a good thing, and it is required for SolrCloud. This version also introduced the concept of a soft commit.
If you send a large number of document updates to your index without doing any commits at all or only doing soft commits, the transaction log will get very very large. When Solr starts up, the entire transaction log is replayed, to ensure that index updates are not lost. With very large logs, this goes very slowly. Large logs can also be caused by a large import using the DataImportHandler, which optionally does a hard commit at the end.
To fix the slow startup, you need to keep your transaction log size down. The only way to do this is by sending frequent hard commits. A hard commit closes the current transaction log and starts a new one. Solr only keeps a few of these logs, so by frequently creating new ones, the total transaction log size will be small. Replaying small transaction logs goes quickly.
Turning on autoCommit in your solrconfig.xml update handler definition is the solution:
<updateHandler class="solr.DirectUpdateHandler2"> <autoCommit> <maxDocs>25000</maxDocs> <maxTime>300000</maxTime> <openSearcher>false</openSearcher> </autoCommit> <updateLog /> </updateHandler>
One reason that people will send a large number of updates without doing any commits is that they don't want their deletes or updates to be visible until they are all completed. This requirement is maintained by the openSearcher=false setting in the above config. If you use this option, you will need to send an explicit hard or soft commit to make the changes visible.
You'll want to adjust the maxDocs and maxTime parameters in your autoCommit configuration to fit your requirements. The values provided (25000 docs or five minutes) are good general-purpose defaults, but they may require adjustment in situations with a very high or very low update volume.
The major causes of slow commit times include:
- Large autowarmCount values on Solr caches.
- Extremely frequent commits.
- Not enough RAM, discussed above.
If you have large autowarmCount values on your Solr caches, it can take a very long time to do that cache warming. The filterCache is particularly slow to warm. The solution is to reduce the autowarmCount, reduce the complexity of your queries, or both.
If you commit very frequently, you may send a new commit before the previous commit is finished. If you have cache warming enabled as just discussed, this is more of a problem. If you have a high maxWarmingSearchers in your solrconfig.xml, you can end up with a lot of new searchers warming at the same time, which is very I/O intensive, so the problem compounds itself.