Differences between revisions 8 and 9
Revision 8 as of 2007-07-30 17:44:22
Size: 6111
Comment:
Revision 9 as of 2009-09-20 21:48:05
Size: 6143
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
If it's not in the index, you can't query it. [http://www.getopt.org/luke/ Luke] If it's not in the index, you can't query it. [[http://www.getopt.org/luke/|Luke]]
Line 14: Line 14:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html Analyzer] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html|Analyzer]]
Line 18: Line 18:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html QueryParser] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html|QueryParser]]
Line 23: Line 23:
behavior. See: ["LuceneFAQ"], "Why am I getting no hits / incorrect hits?". behavior. See: [[LuceneFAQ]], "Why am I getting no hits / incorrect hits?".
Line 30: Line 30:
Rule of thumb: an average page of English text contains about 250 words. (Source: [http://answers.google.com/answers/threadview?id=608972 Google Answers].) This means only about 40 pages are indexed by default. If any of your documents are longer than this (and you want them indexed in full), you should raise the limit with [http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int) IndexWriter.setMaxFieldLength()]. Rule of thumb: an average page of English text contains about 250 words. (Source: [[http://answers.google.com/answers/threadview?id=608972|Google Answers]].) This means only about 40 pages are indexed by default. If any of your documents are longer than this (and you want them indexed in full), you should raise the limit with [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)|IndexWriter.setMaxFieldLength()]].
Line 34: Line 34:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardAnalyzer.html StandardAnalyzer] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardAnalyzer.html|StandardAnalyzer]]
Line 53: Line 53:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html IndexReader] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html|IndexReader]]
Line 55: Line 55:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/IndexSearcher.html IndexSearcher] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/IndexSearcher.html|IndexSearcher]]
Line 60: Line 60:
See also: ["LuceneFAQ"], "Is the IndexSearcher thread-safe?". See also: [[LuceneFAQ]], "Is the IndexSearcher thread-safe?".
Line 68: Line 68:
See: ["LuceneFAQ"], "How do I implement paging, i.e. showing result from 1-10, 11-20 etc?". See: [[LuceneFAQ]], "How do I implement paging, i.e. showing result from 1-10, 11-20 etc?".
Line 74: Line 74:
But if you really need to do this, see: ["LuceneFAQ"], "Can I cache search results with Lucene?".
See also: ["LuceneFAQ"], "Can Lucene do a "search within search", so that the second search is constrained by the results of the first query?".
But if you really need to do this, see: [[LuceneFAQ]], "Can I cache search results with Lucene?".
See also: [[LuceneFAQ]], "Can Lucene do a "search within search", so that the second search is constrained by the results of the first query?".
Line 80: Line 80:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/RangeQuery.html RangeQuery] expands every term in the range to a boolean expression, [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/RangeQuery.html|RangeQuery]] expands every term in the range to a boolean expression,
Line 84: Line 84:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/RangeFilter.html RangeFilter] doesn't suffer from this limitation. [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/RangeFilter.html|RangeFilter]] doesn't suffer from this limitation.
Line 86: Line 86:
See: ["LuceneFAQ"], "Why am I getting a TooManyClauses exception?". See: [[LuceneFAQ]], "Why am I getting a TooManyClauses exception?".
Line 93: Line 93:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html QueryParser] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html|QueryParser]]
Line 95: Line 95:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Query.html Query] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Query.html|Query]]
Line 97: Line 97:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html QueryParser]. [[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html|QueryParser]].
Line 101: Line 101:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Query.html Query] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Query.html|Query]]
Line 108: Line 108:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html QueryParser] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html|QueryParser]]
Line 118: Line 118:
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html HitCollector] [[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html|HitCollector]]
Line 127: Line 127:
See: ["LuceneFAQ"], "Is there a way to get a text summary of an indexed document with Lucene (a.k.a. a "snippet" or "fragment") to display along with the search result?". See: [[LuceneFAQ]], "Is there a way to get a text summary of an indexed document with Lucene (a.k.a. a "snippet" or "fragment") to display along with the search result?".

A list of classic Lucene mistakes. While these solutions are all documented in one place or another, they won't suffer from additional repetition. See also HowTo for more targeted tips.

Indexing

Luke is your friend

Luke is an invaluable tool to learn what actually went in your index. If it's not in the index, you can't query it. Luke

Use the same analyzer for indexing and querying

Make sure you use the same Analyzer class when building your index and later querying against that index. Analysis dictates what goes into your index and how. The QueryParser needs this information to generate the proper queries.

Dissimilar or incompatible analyzers lead to mysterious search behavior. See: LuceneFAQ, "Why am I getting no hits / incorrect hits?".

Documents are truncated by default

The indexer by default truncates documents to IndexWriter.DEFAULT_MAX_FIELD_LENGTH or 10,000 terms in Lucene 2.0.

Rule of thumb: an average page of English text contains about 250 words. (Source: Google Answers.) This means only about 40 pages are indexed by default. If any of your documents are longer than this (and you want them indexed in full), you should raise the limit with IndexWriter.setMaxFieldLength().

Stopwords are removed

StandardAnalyzer (the most commonly recommended analyzer) does not index "stopwords". Stopwords are common English words such "the", "a", etc. -- the default list is StopAnalyzer.ENGLISH_STOP_WORDS. These words are completely ignored and cannot be searched for, at all. This means that even phrase queries aren't exact. E.g. the phrase query "to be or not to be" finds nothing at all.

Performance

Keep the index open

Especially in a Web application, where keeping state between requests requires additional work, it is tempting to open and close the index on every request. Unfortunately, this leads to very poor performance. At first this might work with small indexes or beefy hardware, but performance problems soon crop up -- e.g. large garbage collections.

You should keep the index open as long as possible. Both IndexReader and IndexSearcher are thread-safe and don't require additional synchronization. One could cache the index searcher e.g. in the application context.

See also: LuceneFAQ, "Is the IndexSearcher thread-safe?".

No need to cache search results

Lucene is amazingly fast at searching. Rather than caching hits and paging through them, merely re-executing the query is often fast enough.

See: LuceneFAQ, "How do I implement paging, i.e. showing result from 1-10, 11-20 etc?".

When doing sub-searches (searching the results of a previous search), it's easier (and fast enough) to add the first query as a conditional to the second.

But if you really need to do this, see: LuceneFAQ, "Can I cache search results with Lucene?". See also: LuceneFAQ, "Can Lucene do a "search within search", so that the second search is constrained by the results of the first query?".

Use RangeFilter instead of RangeQuery

RangeQuery expands every term in the range to a boolean expression, and easily blows past the built-in BooleanQuery.maxClauseCount limit (Lucene 2.0 defaults to about 1000).

RangeFilter doesn't suffer from this limitation.

See: LuceneFAQ, "Why am I getting a TooManyClauses exception?".

Querying

Edit the query rather than the string

Parsing free text is a surprisingly hard problem at which QueryParser does a pretty good job. Rather than editing the query string, change the Query objects returned by the QueryParser.

E.g. when offering additional search options in a Web form, it's easier and safer to combine them with the parsed Query object rather doing text manipulations on the query string.

Lucene is not a true boolean system

Or: apple AND banana OR orange doesn't work. Surprising at first. QueryParser does its best to translate from a boolean syntax to Lucene's own set-oriented queries, but it falls short. Either use parentheses everywhere or try to design your user interface accordingly.

See BooleanQuerySyntax.

Iterating over all hits takes a long time

This is by design. Try using a HitCollector instead if you need access to all the hits for a search.

Highlighting search results

The Lucene-contributed highlighter does the best it can, but it doesn't have the information it really needs to correctly highlight the search results.

See: LuceneFAQ, "Is there a way to get a text summary of an indexed document with Lucene (a.k.a. a "snippet" or "fragment") to display along with the search result?".

LuceneCaveats (last edited 2009-09-20 21:48:05 by localhost)