Near Realtime Search

Near realtime search in Lucene refers to features added to IndexWriter in Lucene version 2.9 that enable updates to be efficiently searched hopefully within milliseconds after an update is completed.

One goal of the near realtime search design is to make NRT as transparent as possible to the user. Another is minimize the latency after an update is made to perform a search that includes the update. Now Lucene offers a unified API where one calls getReader and any updates are immediately searchable.

At this point NRT is a workaround for the latency of fsync on most operating systems. Meaning for updates, instead of syncing what's in RAM to disk, we perform merges and keep deletes in RAM until a maximum bytes size is reached or commit is called by the user. In the future NRT may involve searching directly on the ram buffer without first encoding a full Lucene segment. Prior to 2.9 loading field caches was an IO latency issue on the search side.

Index Writer manages the subreaders internally so there is no need to call reopen, instead getReader may be used. A benefit of this design is the efficiency of deletes where they are not written to disk until commit is called. Deletes are carried over in the RAM segment reader held by index writer. Here we're leveraging the index reader clone method which when used, keeps references to deletes and norms via a copy by reference mechanism. Otherwise we could be calling fsync just to update one document.

NRT adds an internal ram directory (LUCENE-1313) to index writer where documents are flushed to before being merged to disk. This technique decreases the turnaround time required for updating the index when calling getReader.

Sample code:

IndexWriter writer; // create an IndexWriter here
Document doc = null; // create a document here
writer.addDocument(doc); // update a document
IndexReader reader = writer.getReader(); // get a reader with the new doc
Document addedDoc = reader.document(0);

Internals

IO Cache

Large merges potentially bump existing segments out of the IO cache. A query that was fast may suddenly be slow due to the latency of accessing the hard drive. One way to address this is to implement a JNI based Directory that implements fadvise or madvise. The advise calls would allow segment merger to tell the OS not to load the segments being merged into the IO cache.

NearRealtimeSearch (last edited 2009-09-30 23:53:04 by JasonRutherglen)