Synopsis

The KinoSearch merge model is an algorithm for merging inverted documents, originally developed by Marvin Humphrey for an early version of the search engine library KinoSearch. It differs from the Java Lucene 2.0 merge model in two main ways.

The algorithm is particularly well-suited for use with interpreted-language implementations of Lucene, so Lucy will use it.

Algorithm

initialize a segment
create a sort pool (an object which performs 
  external sorting)

for each document {
  invert document
  write stored fields and term vectors
  add serialized postings to sort pool  
}

for each merge-worthy segment {
  write stored fields and term vectors to current 
    segment, skipping deletions
  add serialized postings to sort pool, mapping to 
    new doc numbers
}

sort the sort pool
write the term dictionary, iterating through the 
  items in the sort pool

commit changes

Serialized Posting Format

  1. Field name encoded as UTF-8
  2. null byte
  3. Term text as UTF-8
  4. null byte
  5. document number encoded as big-endian 32-bit integer
  6. auxiliary data (not used for sorting): positions, start offsets,
    • end offsets, field number, term text length

External Sorter

KinoSearch's external sorter implements an N-way merge sort. The C function memcmp is used to determine sort order.

The external sorter, in the form of a "sortex" object, feeds serialized postings into a memory cache. The memory consumed by the cache is tracked, and whenever it exceeds a user-settable threshold, the cache is sorted and written to disk, producing a "run".

When the sorter is told to "sort" its entire sort pool, it sorts the cache and flushes to disk one last time, then enters "fetch" mode. A small read buffer is created for each run, and items from those buffers are merged into a single, sorted stream.

Rationale

The KinoSearch merge model offers several advantages over the Lucene 2.0 merge model:

1. Adherence to the principle of MinimizingObjectOverhead

2. Stored fields and TermVectors are only written once, minimizing disk i/o

3. Lower RAM requirements

4. RAM ceiling set in bytes rather than documents

5. Single commit per indexing session

KinoSearchMergeModel (last edited 2009-09-20 21:59:25 by localhost)