Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

How do I contribute an improvement?

Please follow all of these steps to submit a Lucene patch.

Why hasn't patch FOO been committed?

...

If you are looking at example code (in an article or book perhaps) and just need to understand how the example would change to work with 2.0 (without needing to actually compile it) you can review the javadocs for Lucene 1.9 and lookup any methods used in the examples that are no longer part of Lucene. The 1.9 javadocs will have a clear deprecation message explaining how to get the same effect using the 2.x methods.

How is Lucene's indexing and search performance measured?

Check Lucene bench: https://home.apache.org/~mikemccand/lucenebench/

I am having a performance issue. How do I ask for help on the java-user@lucene.apache.org mailing list?

...

The trick is to enumerate terms with that field. Terms are sorted first by field, then by text, so all terms with a given field are adjacent in enumerations. Term enumeration is also efficient.

No Format

try
{
    TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", ""));
    while ("FIELD-NAME-HERE".equals(terms.term().field()))
    {
        // ... collect terms.term().text() ...

        if (!terms.next())
            break;
    }
}
finally
{
    terms.close();
}

...

One Explanation...

No Format

  > Does anyone have an example of limiting results returned based on a
  > score threshold? For example if I'm only interested in documents with
  > a score > 0.05.

I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches).  The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall.

...

See ImproveSearchingSpeed.

Does Lucene support Approximate Nearest Neighbor (ANN

...

) / k-Nearest Neighbors (k-NN) search?

Yes, Lucene 9 and greater support ANN / kNN search, whereas see JiraserverASF JIRAserverId5aa69414-a9e9-3523-82ec-

Indexing

Can I use Lucene to crawl my site or other sites on the Internet?

Does Lucene support auto-suggest / autocomplete?

Yes, see https://lucene.apache.org/core/9_4_2/suggest/index.html

What is the default relevance / similarity implementation of Lucene?

The default implementation to determine the relevance of documents for a particular query is based on BM25.

Also see IndexSearcher#getDefaultSimilarity()

Indexing

Can I use Lucene to crawl my site or other sites on the Internet?

No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene focuses on the indexing and searching and does it great. However, several crawlers are available which you could use: list of Open Source Crawlers in Java. regain is an Open Source tool that crawls web sites, stores them in a Lucene index and offers a search web interface. Also see Nutch for a powerful Lucene-based search engine.

...

What is the different between Stored, Tokenized, Indexed, and Vector?

  • Stored = as-is value stored in the Lucene index
  • Tokenized = field is analyzed using the specified Analyzer - the tokens emitted are indexed
  • Indexed = the text (either as-is with keyword fields, or the tokens from tokenized fields) is made searchable (aka inverted)
  • Vectored = term frequency per document is stored in the index in an easily retrievable fashion.

What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?

No, there will be multiple copies of the same document in the index.

How do I delete documents from the index?

IndexWriter allows you to delete by Term or by Query. The deletes are buffered and then periodically flushed to the index, and made visible once commit() or close() is called.

IndexReader can also delete documents, by Term or document number, but you must close any open IndexWriter before using IndexReader to make changes (and, vice/versa). IndexReader also buffers the deletions and does not write changes to the index until close() is called, but if you use that same IndexReader for searching, the buffered deletions will immediately take effect. Unlike IndexWriter's delete methods, IndexReader's methods return the number of documents that were deleted.

Generally it's best to use IndexWriter for deletions, unless 1) you must delete by document number, 2) you need your searches to immediately reflect the deletions or 3) you must know how many documents were deleted for a given deleteDocuments invocation.

  • = as-is value stored in the Lucene index
  • Tokenized = field is analyzed using the specified Analyzer - the tokens emitted are indexed
  • Indexed = the text (either as-is with keyword fields, or the tokens from tokenized fields) is made searchable (aka inverted)
  • Vectored = term frequency per document is stored in the index in an easily retrievable fashion.

What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?

No, there will be multiple copies of the same document in the index.

How do I delete documents from the index?

IndexWriter allows you to delete by Term or by Query. The deletes are buffered and then periodically flushed to the index, and made visible once commit() or close() is called.

If you would like to delete documents by document number, IndexWriter provides tryDeleteDocument. Note however that this method only succeeds if the segment where the doc ID belongs has not been merged away. It is generally preferred to use a primary key field that holds a unique ID for each document and to use this field to delete by Term by passing it to If you must delete by document number but would otherwise like to use IndexWriter, one common approach is to make a primary key field, that holds a unique ID string for each document. Then you can delete a single document by creating the Term containing the ID, and passing that to IndexWriter's deleteDocuments(Term) method.

Once a document is deleted it will not appear in TermDocs nor TermPositions enumerations, nor any search results. Attempts to load the document will result in an exceptionin search results. The presence of this document may still be reflected in the docFreq statistics, and thus alter search scores, though this will be corrected eventually as segments containing deletions are merged.

...

No, creating the IndexWriter with "true" should remove all old files in the old index (actually with Lucene < 1.9 it removes all files in the index directory, no matter if they belong to Lucene).the old index (actually with Lucene < 1.9 it removes all files in the index directory, no matter if they belong to Lucene).

When I uprade Lucene, for example from 8.8.2 to 9.0.0, do I have to reindex?

Not necessarily, you can add a version specific lucene-backward-codecs library, for example lucene-backward-codecs-9.0.0.jar will enable Lucene 9 to use the index of the previous version Lucene 8.

But it is recommended to upgrade, because a Lucene 8.x index may not be able to be read by an eventual Lucene 10 release.

Certain changes always require to reindex though, also see https://solr.apache.org/guide/8_0/reindexing.html

How can I index and search digits and other non-alphabetic characters?

The components responsible for this are various Analyzers. Make sure you use the appropriate analyzer. For examaple, StandardAnaylzer does not remove numbers, but it removes most punctuation.


Wiki Markup
Is the [IndexWriter] class, and especially the method addIndexes(Directory\[\]) thread safe?



Wiki Markup
Yes, {{IndexWriter.addIndexes(Directory\[\])}} method is thread safe (it is a {{synchronized}} method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.


When is it possible for document IDs to change?

...

Here is an example:

No Format

public class MyAnalyzer extends ReusableAnalyzerBase {
  private Version matchVersion;

  public MyAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream sink = new LowerCaseFilter(matchVersion, source);
    sink = new LengthFilter(sink, 3, Integer.MAX_VALUE);
    return new TokenStreamComponents(source, sink);
  }
}

...

If you want your custom token modification to come after the filters that lucene's StandardAnalyzer class would normally call, do the following:

No Format

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream sink = new StandardFilter(matchVersion, source);
    sink = new LowerCaseFilter(matchVersion, sink);
    sink = new StopFilter(matchVersion, sink,
                          StopAnalyzer.ENGLISH_STOP_WORDS_SET, false);
    sink = new CaseNumberFilter(sink);
    sink = new NameFilter(sink);
    return new TokenStreamComponents(source, sink);
  }

...

Lucene only uses Java strings, so you normally do not need to care about this. Just remember that you may need to specify an encoding when you read in external strings from e.g. a file (otherwise the system's default encoding will be used). If you really need to recode a String you can use this hack:

No Format

String newStr = new String(someString.getBytes("UTF-8"));

...

In order to index XML documents you need to first parse them to extract text that you want to index from them. Have a look at Tika, the content analysis toolkit.See also this article Parsing, indexing, and searching XML with Digester and Lucene.

How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft Word, Excel, PowerPoint, Visio, etc?

...

Note that the article uses an older version of apache lucene. For parsing the java source files and extracting that information, the ASTParser of the eclipse java development tools is used.


Wiki Markup
What is the difference between [IndexWriter].addIndexes(IndexReader\[\]) and [IndexWriter].addIndexes(Directory\[\]), besides them taking different arguments?


When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed.

...