Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

If you are looking at example code (in an article or book perhaps) and just need to understand how the example would change to work with 2.0 (without needing to actually compile it) you can review the javadocs for Lucene 1.9 and lookup any methods used in the examples that are no longer part of Lucene. The 1.9 javadocs will have a clear deprecation message explaining how to get the same effect using the 2.x methods.

How is Lucene's indexing and search performance measured?

Check Lucene bench: https://home.apache.org/~mikemccand/lucenebench/

I am having a performance issue. How do I ask for help on the java-user@lucene.apache.org mailing list?

...

The trick is to enumerate terms with that field. Terms are sorted first by field, then by text, so all terms with a given field are adjacent in enumerations. Term enumeration is also efficient.

No Format

try
{
    TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", ""));
    while ("FIELD-NAME-HERE".equals(terms.term().field()))
    {
        // ... collect terms.term().text() ...

        if (!terms.next())
            break;
    }
}
finally
{
    terms.close();
}

...

One Explanation...

No Format

  > Does anyone have an example of limiting results returned based on a
  > score threshold? For example if I'm only interested in documents with
  > a score > 0.05.

I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches).  The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall.

...

Does Lucene support auto-suggest / autocomplete?

Yes, see https://lucene.apache.org/core/9_14_02/suggest/index.html

What is the default relevance / similarity implementation of Lucene?

The default implementation to determine the relevance of documents for a particular query is based on BM25.

Also see IndexSearcher#getDefaultSimilarity()

Indexing

Can I use Lucene to crawl my site or other sites on the Internet?

...

Here is an example:

No Format

public class MyAnalyzer extends ReusableAnalyzerBase {
  private Version matchVersion;

  public MyAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
    TokenStream sink = new LowerCaseFilter(matchVersion, source);
    sink = new LengthFilter(sink, 3, Integer.MAX_VALUE);
    return new TokenStreamComponents(source, sink);
  }
}

...

If you want your custom token modification to come after the filters that lucene's StandardAnalyzer class would normally call, do the following:

No Format

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream sink = new StandardFilter(matchVersion, source);
    sink = new LowerCaseFilter(matchVersion, sink);
    sink = new StopFilter(matchVersion, sink,
                          StopAnalyzer.ENGLISH_STOP_WORDS_SET, false);
    sink = new CaseNumberFilter(sink);
    sink = new NameFilter(sink);
    return new TokenStreamComponents(source, sink);
  }

...

Lucene only uses Java strings, so you normally do not need to care about this. Just remember that you may need to specify an encoding when you read in external strings from e.g. a file (otherwise the system's default encoding will be used). If you really need to recode a String you can use this hack:

No Format

String newStr = new String(someString.getBytes("UTF-8"));

...