Lucene Papers

To understand the fundamental ideas behind Lucene, you should first get familiar with InformationRetrieval. This page tries to collect links to resources that explain some advanced topics.

Storage

Postings list encoding

In addition to VInt encoding, Lucene supports (or plans to support) other postings list encoding formats (FOR-delta, PFOR-delta, Simple9, ...).

The Pulsing codec

An optimized codec for fields that have lots of rare terms.

DocumentsWritersPerThread

Improved concurrency of index updates.

Query execution

Terms dictionary

In addition to its binary-search based terms dictionary, Lucene has a "block tree" terms dictionary, inspired of burst tries.

NumericRangeQuery

Lucene has an optimized range query implementation for numeric types:

BKD trees

BKD trees have been implemented to support geo capabilities in Lucene and have superseded NumericRangeQuery for one-dimensional data.

Automaton-based fuzzy query

Lucene 4.0 supports an improved fuzzy query implementation that is based on Levenshtein automata.

Scoring models

In addition to its default TF-IDF scoring algorithm, Lucene supports other scoring models such as Okapi BM25 and models based on language models.

Incorporating non-textual signals into the final score

The below paper describes implementation ideas behind Lucene's FeatureField to fold non-textual static signals like pagerank, url length, etc. into the final score.

Block Max WAND

Block MAX WAND is an iteration over WAND that helps efficiently skip scoring non-relevant documents.

Misc

FST compression

Lucene uses FSTs a lot, so their in-memory size is important.

Twitter Earlybird

Modifications that Twitter made to Lucene to support lock-free updates and efficient early query termination for time-based relevance.

  • No labels