Lucene Papers

To understand the fundamental ideas behind Lucene, you should first get familiar with InformationRetrieval. This page tries to collect links to resources that explain some advanced topics.

Storage

Postings list encoding

In addition to VInt encoding, Lucene supports (or plans to support) other postings list encoding formats (FOR-delta, PFOR-delta, Simple9, ...).

The Pulsing codec

An optimized codec for fields that have lots of rare terms.

DocumentsWritersPerThread

Improved concurrency of index updates.

Query execution

Terms dictionary

In addition to its binary-search based terms dictionary, Lucene has a "block tree" terms dictionary, inspired of burst tries.

NumericRangeQuery

Lucene has an optimized range query implementation for numeric types:

Automaton-based fuzzy query

Lucene 4.0 supports an improved fuzzy query implementation that is based on Levenshtein automata.

Scoring models

In addition to its default TF-IDF scoring algorithm, Lucene supports other scoring models such as Okapi BM25 and models based on language models.

Misc

FST compression

Lucene uses FSTs a lot, so their in-memory size is important.

Twitter Earlybird

Modifications that Twitter made to Lucene to support lock-free updates and efficient early query termination for time-based relevance.

LucenePapers (last edited 2012-06-24 21:42:58 by jpountz)