Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Questions should only be added to this Wiki page when they already have an answer that can be added at the same time.

Table of Contents

Lucene FAQ

General

How do I start using Lucene?

Lucene has no external dependencies, so just add lucene-core-x.y-dev.jar to your development environment's classpath. After that,

If you think Lucene is too low-level for you, you might want to consider using Solr, which usually requires less Java programming.

...

What Java version is required to run Lucene?

See Lucene System Requirements for most recent Lucene versions

  • Lucene 9 requires Java 11
  • Lucene 8.8.2 requires Java 8 or greater
  • Lucene 7.7.3 requires Java 8 or greater
  • Lucene >= 1.9 requires Java 1.4

...

  • Lucene 1.4 will run with JDK 1.3 and up but requires at least JDK 1.4 to compile.

Will Lucene work with my Java application?

...

How can I get the latest greatest development code?

See SourceRepository Lucene Dev Source Code

Where can I get the javadocs for the org.apache.lucene classes?

...

  • Always make sure that you explicitly close all file handles you open, especially in case of errors. Use a try/catch/finally block to open the files, i.e. open them in the try block, close them in the finally block. Remember that Java doesn't have destructors, so don't close file handles in a finalize method – this method is not guaranteed to be executed.
  • Use the compound file format (it's activated by default starting with Lucene 1.4) by calling IndexWriter's setUseCompoundFile(true)
  • Don't set IndexWriter's mergeFactor to large values. Large values speed up indexing but increase the number of files that need to be opened simultaneously.
  • Make sure you only open one IndexSearcher, and share it among all of the threads that are doing searches – this is safe, and it will minimize the number of files that are open concurently.
  • Try to increase the number of files that can be opened simultaneously. On Linux using bash this can be done by calling ulimit -n <number>.

When I compile Lucene x.y.z from source, the version number in the jar file name and MANIFEST.MF is different. What's up with that?

...

How do I contribute an improvement?

Please follow all of these steps to submit a Lucene patch.

Why hasn't patch FOO been committed?

...

If you are looking at example code (in an article or book perhaps) and just need to understand how the example would change to work with 2.0 (without needing to actually compile it) you can review the javadocs for Lucene 1.9 and lookup any methods used in the examples that are no longer part of Lucene. The 1.9 javadocs will have a clear deprecation message explaining how to get the same effect using the 2.x methods.

How is Lucene's indexing and search performance measured?

Check Lucene bench: https://home.apache.org/~mikemccand/lucenebench/

I am having a performance issue. How do I ask for help on the java-user@lucene.apache.org mailing list?

  1. Make sure you have looked through the BasicsOfPerformance
  2. Describe your problem, giving details about how you are using Lucene
  3. What version of Lucene are you using? What JDK? Can you upgrade to the latest?
  4. Make sure it truly is a Lucene problem. That is, isolate the problem and/or profile your application.
  5. Search the java-user and java-dev Mailing lists, see http://lucene.apache.org/java/docs/mailinglists.html

What does l.a.o and o.a.l.xxxx stand for?

...

  • The desired term is in a field that was not defined as 'indexed'. Re-index the document and make the field indexed.
  • The term is in a field that was not tokenized during indexing and therefore, the entire content of the field was considered as a single term. Re-index the documents and make sure the field is tokenized.
  • The field specified in the query simply does not exist. You won't get an error message in this case, you'll just get no matches.
  • The field specified in the query has wrong case. Field names are case sensitive.
  • The term you are searching is a stop word that was dropped by the analyzer you use. For example, if your analyzer uses the StopFilter, a search for the word 'the' will always fail (i.e. produce no hits).
  • You are using different analyzers (or the same analyzer but with different stop words) for indexing and searching and as a result, the same term is transformed differently during indexing and searching.
  • The analyzer you are using is case sensitive (e.g. it does not use the LowerCaseFilter) and the term in the query has different case than the term in the document.
  • The documents you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors. See IndexWriter.setMaxFieldLength(int).
  • Make sure to open a new IndexSearcher after adding documents. An IndexSearcher will only see the documents that were in the index when it was opened.
  • If you are using the QueryParser, it may not be parsing your BooleanQuerySyntax the way you think it is.
  • Span and phrase queries won't work if omitTf() has been called for a field since that causes positional information about tokens to not be saved in the index. Span queries & phrase queries require the positional information in order to work.

If none of the possible causes above apply to your case, this will help you to debug the problem:

  • Use the Query's toString() method to see how it actually got parsed.
  • Use Luke to browse your index: on the "Documents" tab, navigate to a document, then use the "Reconstruct & Edit" to look at how the fields have been stored ("Stored original" tab) and indexed ("Tokenized" tab)

Why am I getting a TooManyClauses exception?

The following types of queries are expanded by Lucene before it does the search: RangeQuery, PrefixQuery, WildcardQuery, FuzzyQuery. For example, if the indexed documents contain the terms "car" and "cars" the query "ca*" will be expanded to "car OR cars" before the search takes place. The number of these terms is limited to 1024 by default. Here's a few different approaches that can be used to avoid the TooManyClauses exception:

  • Use a filter to replace the part of the query that causes the exception. For example, a RangeFilter can replace a RangeQuery on date fields and it will never throw the TooManyClauses exception – You can even use ConstantScoreRangeQuery to execute your RangeFilter as a Query. Note that filters are slower than queries when used for the first time, so you should cache them using CachingWrapperFilter. Using Filters in place of Queries generated by QueryParser can be achieved by subclassing QueryParser and overriding the appropriate function to return a ConstantScore version of your Query.
  • Increase the number of terms using BooleanQuery.setMaxClauseCount(). Note that this will increase the memory requirements for searches that expand to many terms. To deactivate any limits, use BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE).
  • A specfic solution that can work on very precise fields is to reduce the precision of the data in order to reduce the number of terms in the index. For example, the DateField class uses a microsecond resultion, which is often not required. Instead you can save your dates in the "yyyymmddHHMM" format, maybe even without hours and minutes if you don't need them (this was simplified in Lucene 1.9 thanks to the new DateTools class).

How can I search over multiple fields?

...

This is not supported by QueryParser, but you could extend the QueryParser to build a MultiPhraseQuery in those cases.

Is the QueryParser thread-safe?

No, it's not.

How do I restrict searches to only return results from a limited subset of documents in the index (e.g. for privacy reasons)? What is the best way to approach this?

...

Then grab the last Term in TermDocs that this method returns.

Does MultiSearcher do anything particularly efficient to search multiple indices or does it simply search one after the other?

MultiSearcher searches indices sequentially. Use ParallelMultiSearcher as a searcher that performs multiple searches in parallel. Please note that there's a known bug in Lucene < 1.9 in the MultiSearcher's result ranking.

...

No, not by default. Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query. These queries are case-insensitive anyway because QueryParser makes them lowercase. This behavior can be changed using the setLowercaseExpandedTerms(boolean) method.

Why does IndexReader's maxDoc() return an 'incorrect' number of documents sometimes?

According to the Javadoc for IndexReader maxDoc() method "returns one greater than the largest possible document number".

...

Also consider using a JSP tag for caching, see http://www.opensymphony.com/oscache/ for one tab library that's easy and works well.

Is the IndexSearcher thread-safe?

Yes, IndexSearcher is thread-safe. Multiple search threads may use the same instance of IndexSearcher concurrently without any problems. It is recommended to use only one IndexSearcher from all threads in order to save memory.

...

The trick is to enumerate terms with that field. Terms are sorted first by field, then by text, so all terms with a given field are adjacent in enumerations. Term enumeration is also efficient.

No Format

try
{
    TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", ""));
    while ("FIELD-NAME-HERE".equals(terms.term().field()))
    {
        // ... collect terms.term().text() ...

        if (!terms.next())
            break;
    }
}
finally
{
    terms.close();
}

...

  • Use QueryFilter with the previous query as the filter. Doug Cutting recommends against this, because a QueryFilter does not affect ranking.
  • Combine the previous query with the current query using BooleanQuery, using the previous query as required.

The BooleanQuery approach is the recommended one.

...

One Explanation...

No Format

  > Does anyone have an example of limiting results returned based on a
  > score threshold? For example if I'm only interested in documents with
  > a score > 0.05.

I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches).  The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall.

...

See ImproveSearchingSpeed.

Indexing

Does Lucene support Approximate Nearest Neighbor (ANN) / k-Nearest Neighbors (k-NN) search?

Yes, Lucene 9 and greater support ANN / kNN search, whereas see

Does Lucene support auto-suggest / autocomplete?

Yes, see https://lucene.apache.org/core/9_4_2/suggest/index.html

What is the default relevance / similarity implementation of Lucene?

The default implementation to determine the relevance of documents for a particular query is based on BM25.

Also see IndexSearcher#getDefaultSimilarity()

Indexing

Can I use Lucene to crawl my site or other Can I use Lucene to crawl my site or other sites on the Internet?

No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene focuses on the indexing and searching and does it great. However, several crawlers are available which you could use: list of Open Source Crawlers in Java. regain is an Open Source tool that crawls web sites, stores them in a Lucene index and offers a search web interface. Also see Nutch for a powerful Lucene-based search engine.

...

What is the different between Stored, Tokenized, Indexed, and Vector?

  • Stored = as-is value stored in the Lucene index
  • Tokenized = field is analyzed using the specified Analyzer - the tokens emitted are indexed
  • Indexed = the text (either as-is with keyword fields, or the tokens from tokenized fields) is made searchable (aka inverted)
  • Vectored = term frequency per document is stored in the index in an easily retrievable fashion.

What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?

No, there will be multiple copies of the same document in the index.

How do I delete documents from the index?

IndexWriter allows you to delete by Term or by Query. The deletes are buffered and then periodically flushed to the index, and made visible once commit() or close() is called.

IndexReader can also delete documents, by Term or document number, but you must close any open IndexWriter before using IndexReader to make changes (and, vice/versa). IndexReader also buffers the deletions and does not write changes to the index until close() is called, but if you use that same IndexReader for searching, the buffered deletions will immediately take effect. Unlike IndexWriter's delete methods, IndexReader's methods return the number of documents that were deleted.

Generally it's best to use IndexWriter for deletions, unless 1) you must delete by document number, 2) you need your searches to immediately reflect the deletions or 3) you must know how many documents were deleted for a given deleteDocuments invocation.

  • = as-is value stored in the Lucene index
  • Tokenized = field is analyzed using the specified Analyzer - the tokens emitted are indexed
  • Indexed = the text (either as-is with keyword fields, or the tokens from tokenized fields) is made searchable (aka inverted)
  • Vectored = term frequency per document is stored in the index in an easily retrievable fashion.

What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?

No, there will be multiple copies of the same document in the index.

How do I delete documents from the index?

IndexWriter allows you to delete by Term or by Query. The deletes are buffered and then periodically flushed to the index, and made visible once commit() or close() is called.

If you would like to delete documents by document number, IndexWriter provides tryDeleteDocument. Note however that this method only succeeds if the segment where the doc ID belongs has not been merged away. It is generally preferred to use a primary key field that holds a unique ID for each document and to use this field to delete by Term by passing it to If you must delete by document number but would otherwise like to use IndexWriter, one common approach is to make a primary key field, that holds a unique ID string for each document. Then you can delete a single document by creating the Term containing the ID, and passing that to IndexWriter's deleteDocuments(Term) method.

Once a document is deleted it will not appear in TermDocs nor TermPositions enumerations, nor any search results. Attempts to load the document will result in an exceptionin search results. The presence of this document may still be reflected in the docFreq statistics, and thus alter search scores, though this will be corrected eventually as segments containing deletions are merged.

...

Also be careful with Fields that are not tokenized (like Keywords). During indexation, the Analyzer won't be called for these fields, but for a search, the QueryParser can't know this and will pass all search strings through the selected Analyzer. Usually searches for Keywords are constructed in code, but during development it can be handy to use general purpose tools (e.g. Luke) to examine your index. Those tools won't know which fields are tokenized either. In the contrib/analyzers area there's a KeywordTokenizer with an example KeywordAnalyzer for cases like this.

...

No, creating the IndexWriter with "true" should remove all old files in the old index (actually with Lucene < 1.9 it removes all files in the index directory, no matter if they belong to Lucene).the old index (actually with Lucene < 1.9 it removes all files in the index directory, no matter if they belong to Lucene).

When I uprade Lucene, for example from 8.8.2 to 9.0.0, do I have to reindex?

Not necessarily, you can add a version specific lucene-backward-codecs library, for example lucene-backward-codecs-9.0.0.jar will enable Lucene 9 to use the index of the previous version Lucene 8.

But it is recommended to upgrade, because a Lucene 8.x index may not be able to be read by an eventual Lucene 10 release.

Certain changes always require to reindex though, also see https://solr.apache.org/guide/8_0/reindexing.html

How can I index and search digits and other non-alphabetic characters?

The components responsible for this are various Analyzers. Make sure you use the appropriate analyzer. For examaple, StandardAnaylzer does not remove numbers, but it removes most punctuation.


Wiki Markup
Is the [IndexWriter] class, and especially the method addIndexes(Directory\[\]) thread safe?



Wiki Markup
Yes, {{IndexWriter.addIndexes(Directory\[\])}} method is thread safe (it is a {{synchronized}} method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.

...

Here is an example:

No Format

public class MyAnalyzer extends ReusableAnalyzerBase {
  private Version matchVersion;

  public MyAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream sink = new LowerCaseFilter(matchVersion, source);
    sink = new LengthFilter(sink, 3, Integer.MAX_VALUE);
    return new TokenStreamComponents(source, sink);
  }
}

...

If you want your custom token modification to come after the filters that lucene's StandardAnalyzer class would normally call, do the following:

No Format

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream sink = new StandardFilter(matchVersion, source);
    sink = new LowerCaseFilter(matchVersion, sink);
    sink = new StopFilter(matchVersion, sink,
                          StopAnalyzer.ENGLISH_STOP_WORDS_SET, false);
    sink = new CaseNumberFilter(sink);
    sink = new NameFilter(sink);
    return new TokenStreamComponents(source, sink);
  }

...

Lucene only uses Java strings, so you normally do not need to care about this. Just remember that you may need to specify an encoding when you read in external strings from e.g. a file (otherwise the system's default encoding will be used). If you really need to recode a String you can use this hack:

No Format

String newStr = new String(someString.getBytes("UTF-8"));

...

In order to index XML documents you need to first parse them to extract text that you want to index from them. Have a look at Tika, the content analysis toolkit.See also this article Parsing, indexing, and searching XML with Digester and Lucene.

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?

Have a look at Tika, the content analysis toolkit.

...

How can I index PDF documents?

...

Note that the article uses an older version of apache lucene. For parsing the java source files and extracting that information, the ASTParser of the eclipse java development tools is used.


Wiki Markup
What is the difference between [IndexWriter].addIndexes(IndexReader\[\]) and [IndexWriter].addIndexes(Directory\[\]), besides them taking different arguments?


When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed.

...