Differences between revisions 2 and 3
Revision 2 as of 2007-06-09 15:54:55
Size: 3737
Comment:
Revision 3 as of 2007-06-09 16:05:13
Size: 3954
Comment:
Deletions are marked like this. Additions are marked like this.
Line 10: Line 10:
 Call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't sett it too lage otherwise you may hit the LUCENE-845 issue. Somewhere around 2-3X your "typical" flush count should be OK.  Call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't set it too large otherwise you may hit [http://issues.apache.org/jira/browse/LUCENE-845 LUCENE-845]. Somewhere around 2-3X your "typical" flush count should be OK.
Line 14: Line 14:
 More RAM before flushing means Lucene writes larger segments to begin with which means less merging later.  More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in [http://issues.apache.org/jira/browse/LUCENE-843 LUCENE-843] found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.

 * '''Increase mergeFactor, but not too much.'''

 Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.
Line 20: Line 24:
 * '''Increase mergeFactor, but not too much.'''

 Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking of the hard drives.
Line 34: Line 35:
 * '''Make sure document creation is not slow.'''  * '''Speed up document construction.'''
Line 42: Line 43:
 Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory archiectures, native command queueing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.  Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures, native command queuing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.
Line 44: Line 45:
 * '''Index on separate indices then merge.'''  * '''Index into separate indices then merge.'''
Line 52: Line 53:
 If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called <a href="http://www.khelekore.org/jmp/">JMP</a>. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.  If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called [http://www.khelekore.org/jmp JMP]. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.

Here are some things to try to speed up the indexing speed of your Lucene application.

  • Make sure you are using the latest version of Lucene.

  • Open a single writer and re-use it for the duration of your indexing session.

  • Flush by RAM usage instead of document count.

    Call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't set it too large otherwise you may hit [http://issues.apache.org/jira/browse/LUCENE-845 LUCENE-845]. Somewhere around 2-3X your "typical" flush count should be OK.

  • Use as much RAM as you can afford.

    More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in [http://issues.apache.org/jira/browse/LUCENE-843 LUCENE-843] found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.

  • Increase mergeFactor, but not too much. Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.

  • Turn off compound file format.

    Building the compound file format takes time during indexing (7-33% in testing for [http://issues.apache.org/jira/browse/LUCENE-888 LUCENE-888]). However, note that doing this will greatly increase the number of file descriptors use by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.

  • Instead of indexing many small text fields, aggregate the text into a single "contents" field and index only that (you can still store the other fields).

  • Turn off any features you are not in fact using. If you are storing fields but not using them at query time, don't store them. Likewise for term vectors. If you are indexing many fields, turning off norms for those fields may help performance.

  • Use a faster analyzer.

    Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite time consuming. If you can get by with a simpler analyzer, then try it.

  • Speed up document construction. Often the process of retrieving a document from somewhere external (database, filesystem, crawled from a Web site, etc.) is very time consuming.

  • Don't optimize unless you really need to (for faster searching).

  • Use multiple threads with one IndexWriter. Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures, native command queuing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.

  • Index into separate indices then merge. If you have a very large amount of content to index then you can break your content into N "silos", index each silo on a separate machine, then use the writer.addIndexesNoOptimize to merge them all into one final index.

  • Use a faster machine, especially a faster IO system.

  • Run a Java profiler.

    If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called [http://www.khelekore.org/jmp JMP]. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.

ImproveIndexingSpeed (last edited 2011-11-03 23:17:42 by pool-96-255-133-21)