Differences between revisions 158 and 159
Revision 158 as of 2015-10-26 19:29:50
Size: 52264
Comment: Link javadoc
Revision 159 as of 2016-08-11 21:54:08
Size: 24008
Comment: Remove content & add links to Ref Guide
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
{{{#!wiki important
This page exists for the Solr Community to share Tips, Tricks, and Advice about [[https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters|Analyzers, Tokenizers and Filters]].
 
Reference material previously located on this page has been migrated to the [[https://cwiki.apache.org/solr/|Official Solr Reference Guide]]. If you need help, please consult the Reference Guide for the version of Solr you are using. The sections below will point to corresponding sections of the Reference Guide for each specific feature.
 
If you'd like to share information about how you use this feature, please [[FrontPage#How_to_edit_this_Wiki|add it to this page]].
/* cwikimigrated */
}}}
Line 2: Line 11:
When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a [[http://en.wikipedia.org/wiki/Soundex|Soundex]] transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'.

The lists below provide an overview of '''''some''''' of the more heavily used Tokenizers and !TokenFilters provided by Solr "out of the box" along with tips/examples of using them. '''This list should by no means be considered the "complete" list of all Analysis classes available in Solr!''' In addition to new classes being added on an ongoing basis, you can load your own custom Analysis code as a [[SolrPlugins|Plugin]].

Analyzers, per field type, are configured in the [[SchemaXml|Solr Schema]].

For a more complete list of what Tokenizers and !TokenFilters come out of the box, please consult the [[http://lucene.apache.org/core/4_7_0/index.html|Lucene javadocs]], [[http://lucene.apache.org/solr/4_7_0/|Solr javadocs]], and [[http://www.solr-start.com/info/analyzers/|Automatically generated list at solr-start.com]]. Please look at analyzer-*. There are quite a few. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below.

For information about some language-specific Tokenizers and !TokenFilters available in Solr, please consult LanguageAnalysis.

For a complete list of what Tokenizers and !TokenFilters come out of the box, please consult the [[http://lucene.apache.org/core/6_2_0/index.html|Lucene javadocs]], [[http://lucene.apache.org/solr/6_2_0/|Solr javadocs]], and [[http://www.solr-start.com/info/analyzers/|Automatically generated list at solr-start.com]]. Please look at analyzer-*. There are quite a few. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below.
Line 17: Line 19:
Try searches for "analyzer", "token", and "stemming". (TODO: update for LIA 2.)
Line 23: Line 23:
There are different types of [[http://en.wikipedia.org/wiki/Stemming|stemming]] strategies. The ones supported by Lucene/Solr all use stemming by reduction, and must thus be applied on both index- and query side:

 * [[http://tartarus.org/~martin/PorterStemmer/|Porter]] is a transforming algorithm for English language that reduces any of the forms of a word such as "walks, walking, walked", to its elemental root e.g., "walk". Porter is rules based and does not need a dictionary.
 * [[AnalyzersTokenizersTokenFilters/Kstem|KStem]], an less aggressive alternative to Porter for the English language.
 * [[LanguageAnalysis#Notes_about_solr.SnowballPorterFilterFactory|Snowball]] provides stemming for several languages, including two implementations of the Porter algorithm. Snowball is a small string processing language designed for creating stemming algorithms.
 * [[Hunspell]] provides stemming for all languages that have !OpenOffice spellcheck dictionaries. Being dictionary based, it requires high quality and well maintained dictionaries to work well for stemming - in which case it may give more precise stemming than the Snowball algorithms.

A related technology to stemming is [[http://en.wikipedia.org/wiki/Lemmatisation|lemmatization]], which allows for "stemming" by expansion, taking a root word and 'expanding' it to all of its various forms. Lemmatization can be used ''either'' at insertion time ''or'' at query time. Lucene/Solr does not have built-in support for lemmatization but it can be simulated by using your own dictionaries and the [[#SynonymFilter|SynonymFilterFactory]]

See LanguageAnalysis for details about stemming for various languages.

<!> [[Solr4.3]]

A repeated question is "how can I have the original term contribute more to the score than the stemmed version"? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. This filter emits two tokens for each input token, one of them is marked with the Keyword attribute. Stemmers that respect keyword attributes will pass through the token so marked without change. So the effect of this filter would be to index both the original word and the stemmed version. The 4 stemmers listed above all respect the keyword attribute.

For terms that are not changed by stemming, this will result in duplicate, identical tokens in the document. This can be alleviated by adding the RemoveDuplicatesTokenFilterFactory.

{{{
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.KeywordRepeatFilterFactory"/>
   <filter class="solr.PorterStemFilterFactory"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
</fieldType>
}}}

Individual Solr stemmers are documented in the Solr Reference Guide section [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions|Filter Descriptions]].
Line 51: Line 27:
Analyzers are components that pre-process input text at index time and/or at search time. It's important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.

On wildcard and fuzzy searches, no text analysis is performed on the search word.

Most Solr users define custom Analyzers for their text field types consisting of zero or more Char Filter Factories, one Tokenizer Factory, and zero or more Token Filter Factories; but it is also possible to configure a field type to use a concrete Analyzer implementation

The Solr web admin interface may be used to show the results of text analysis, and even the results after each analysis phase when a configuration based analyzer is used.
Analyzers are documented in the Solr Reference Guide section [[https://cwiki.apache.org/confluence/display/solr/Analyzers|Analyzers]].
Line 60: Line 30:
<!> [[Solr1.4]]

Char Filter is a component that pre-processes input characters (consuming and producing a character stream) that can add, change, or remove characters while preserving character position information.

Char Filters can be chained.
!CharFilters are documented in the Solr Reference Guide section [[https://cwiki.apache.org/confluence/display/solr/CharFilterFactories|CharFilterFactories]].
Line 67: Line 33:
A Tokenizer splits a stream of characters (from each individual field value) into a series of tokens.

There can be only one Tokenizer in each Analyzer.
Tokenizers are documented in the Solr Reference Guide section [[https://cwiki.apache.org/confluence/display/solr/Tokenizers|Tokenizers]].
Line 72: Line 36:
Tokens produced by the Tokenizer are passed through a series of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream. Token Filters are documented in the Solr Reference Guide section [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions|Filter Descriptions]].
Line 75: Line 39:
A Solr schema.xml file allows two methods for specifying the way a text field is analyzed. (Normally only field types of `solr.TextField` will have Analyzers explicitly specified in the schema):

 1. Specifying the '''class name''' of an Analyzer — This can be anything extending org.apache.lucene.analysis.Analyzer which has either a default constructor, or a single argument constructor taking in a lucene "Version" object <<BR>> Example: <<BR>>
 {{{
<fieldtype name="nametext" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype>
}}}
 1. Specifying a '''!TokenizerFactory''' followed by a list of optional !TokenFilterFactories that are applied in the listed order. Factories that can create the tokenizers or token filters are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation via reflection. <<BR>> Example: <<BR>>
 {{{
<fieldtype name="text" class="solr.TextField">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>
}}}

Any Analyzer, !CharFilterFactory, !TokenizerFactory, or !TokenFilterFactory may be specified using its full class name with package -- just make sure they are in Solr's classpath when you start your appserver. Classes in the `org.apache.solr.analysis.*` package can be referenced using the short alias `solr.*`.
Line 108: Line 49:
As of Solr 4.0 !BaseTokenFilterFactory has been renamed to !TokenFilterFactory and moved to the package `org.apache.lucene.analysis.util`.
Line 111: Line 51:
There are several pairs of !CharFilters and !TokenFilters that have related (ie: !MappingCharFilter and ASCIIFoldingFilter) or nearly identical functionality (ie: !PatternReplaceCharFilterFactory and !PatternReplaceFilterFactory) and it may not always be obvious which is the best choice. There are several pairs of !CharFilters and !TokenFilters that have related (ie: !MappingCharFilter and !ASCIIFoldingFilter) or nearly identical functionality (ie: !PatternReplaceCharFilterFactory and !PatternReplaceFilterFactory) and it may not always be obvious which is the best choice.
Line 121: Line 61:
<!> [[Solr1.4]]
Line 124: Line 63:
Creates `org.apache.lucene.analysis.MappingCharFilter`. Documentation at [[https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.MappingCharFilterFactory|MappingCharFilterFactory]].
Line 127: Line 66:
Creates `org.apache.solr.analysis.PatternReplaceCharFilter`. Applies a regex pattern to string in char stream, replacing match occurances with the specified replacement string.

Example:

{{{
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-z])" replacement=""/>
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory|PatternReplaceCharFilterFactory]].
Line 135: Line 69:
Creates `org.apache.solr.analysis.HTMLStripCharFilter`. `HTMLStripCharFilter` strips HTML from the input stream and passes the result to either `CharFilter` or `Tokenizer`. Like other CharFilters, it's specified using a <charFilter> tag, and must come before the <tokenizer>. An example:

{{{
<analyzer>
  <charFilter class="solr.HTMLStripCharFilterFactory"/>
  <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
}}}
HTML stripping features:

 * The input need not be an HTML document as only constructs that look like HTML will be removed.
 * Removes HTML/XML tags while keeping the content
  * Attributes within tags are also removed, and attribute quoting is optional.
 * Removes XML processing instructions: {{{<?foo bar?>}}}
 * Removes XML comments
 * Removes XML elements starting with {{{<!}}} and ending with {{{>}}}
 * Removes contents of {{{<script>}}} and {{{<style>}}} elements.
  * Handles XML comments inside these elements (normal comment processing won't always work)
  * Replaces numeric character entities references like {{{&#65;}}} or {{{&#x7f;}}}
   * The terminating '`;`' is optional if the entity reference is followed by whitespace.
  * Replaces all [[http://www.w3.org/TR/REC-html40/sgml/entities.html|named character entity references]].
   * {{{&nbsp;}}} is replaced with a space instead of the non-breaking space character {{{\u00A0}}}
   * terminating '`;`' is mandatory to avoid false matches on something like "`Alpha&Omega Corp`"

HTML stripping examples:
||{{{my <a href="www.foo.bar">link</a> }}} ||`my link ` ||
||{{{<br>hello<!--comment--> }}} ||`hello ` ||
||{{{hello<script><!-- f('<!--internal--></script>'); --></script> }}} ||`hello ` ||
||{{{if a<b then print a; }}} ||`if a<b then print a; ` ||
||{{{hello <td height=22 nowrap align="left"> }}} ||`hello ` ||
||{{{a<b &#65; Alpha&Omega O}}} ||`a<b A Alpha&Omega O ` ||
||{{{M&eacute;xico}}} ||`México` ||


Documentation at [[https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.HTMLStripCharFilterFactory|HTMLStripCharFilterFactory]].
Line 175: Line 75:
Creates `org.apache.lucene.analysis.core.KeywordTokenizer`.

Treats the entire field as a single token, regardless of its content.

 . Example: `"http://example.com/I-am+example?Text=-Hello" ==> "http://example.com/I-am+example?Text=-Hello"`
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-KeywordTokenizer|Keyword Tokenizer]].
Line 182: Line 78:
Creates `org.apache.lucene.analysis.LetterTokenizer`.

Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded.

 . Example: `"I can't" ==> "I", "can", "t"`

<<Anchor(WhitespaceTokenizer)>>
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-LetterTokenizer|Letter Tokenizer]].
Line 191: Line 81:
Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.

Creates tokens of characters separated by splitting on whitespace.
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-WhiteSpaceTokenizer|White Space Tokenizer]].
Line 196: Line 84:
Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.

Creates tokens by lowercasing all letters and dropping non-letters.

 . Example: `"I can't" ==> "i", "can", "t"`
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-LowerCaseTokenizer|Lower Case Tokenizer]].
Line 203: Line 87:
Line 205: Line 88:
Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.

A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware of the same token types. There aren't any filters that use StandardTokenizer's types.
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-StandardTokenizer|Standard Tokenizer]].
Line 213: Line 95:
||'''arg''' ||'''default value''' ||'''note''' ||
||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than this are silently ignored. ||



Line 220: Line 96:
Line 222: Line 97:
<!> [[Solr3.1]]

Creates `org.apache.lucene.analysis.standard.ClassicTokenizer`.

This tokenizer preserves !StandardTokenizer's behavior pre-Solr 3.1: A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware and use the same types. The !ClassicFilter (Formerly known as !StandardFilter) is currently the only Lucene filter that utilizes these token types.

Some token types are number, alphanumeric, email, acronym, URL, etc. —

 . Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`

||'''arg''' ||'''default value''' ||'''note''' ||
||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than `maxTokenLength` are silently ignored. ||
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ClassicTokenizer|Classic Tokenizer]].
Line 237: Line 100:
Line 239: Line 101:
<!> [[Solr3.1]]

Creates `org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer`.

Like !StandardTokenizer, this tokenizer implements the word boundary rules from [[http://unicode.org/reports/tr29/#Word_Boundaries|Unicode standard annex UAX#29]]. In addition, this tokenizer recognizes: full URLs using the `file:://`, `http(s)://`, and `ftp://` schemes; hostnames with a registered TLD (top level domain, e.g. ".com"); IPv4 and IPv6 addresses; and e-mail addresses.

In addition to the token types output by !StandardTokenizer from [[Solr3.1]] onward, UAX29URLEmailTokenizer can also output `<URL>` and `<EMAIL>` token types.

 . Example: `"Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"`
 . `==> ALPHANUM:"Visit", URL:"http://accarol.com/contact.htm?from=external&a=10", ALPHANUM:"or", ALPHANUM:"e-mail" EMAIL:"bob.cratchet@accarol.com"`

||'''arg''' ||'''default value''' ||'''note''' ||
||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than `maxTokenLength` are silently ignored. ||
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-UAX29URLEmailTokenizer|UAX29 URL Email Tokenizer]].
Line 255: Line 104:
Breaks text at the specified regular expression pattern.

For example, you have a list of terms, delimited by a semicolon and zero or more spaces: `mice; kittens; dogs`.

{{{
   <fieldType name="semicolonDelimited" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern=";\s*" />
      </analyzer>
   </fieldType>
}}}
See the [[https://lucene.apache.org/core/4_1_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizerFactory.html|javadoc]] for details.

=== solr.PathHierarchyTokenizerFactory ===
<!> [[Solr3.1]] Outputs file path hierarchies as synonyms.
||'''Input String''' ||'''Output Tokens''' ||'''Position Inc''' ||
||/usr/local/apache ||/usr<<BR>>/usr/local<<BR>>/usr/local/apache ||1<<BR>>0<<BR>>0 ||
||c:\usr\local\apache<<BR>>(w/ delimiter="\" replace="/") ||c:<<BR>>c:/usr<<BR>>c:/usr/local<<BR>>c:/usr/local/apache ||1<<BR>>0<<BR>>0<<BR>>0 ||




{{{
  <fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
    </analyzer>
  </fieldType>
}}}
From Solr 3.2, it also supports ''reverse'' and ''skip'' parameters.
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-RegularExpressionPatternTokenizer|Regular Expression Pattern Tokenizer]].
Line 287: Line 107:
<!> [[Solr3.1]] Uses [[http://site.icu-project.org/|ICU]]'s text bounds capabilities to tokenize text.

This tokenizer first identifies the writing system "Script" for runs of text within the document. Then, it tokenizes the text according to rules or dictionaries depending upon the writing system. For example, if it encounters Thai, it will apply dictionary-based segmentation to split the Thai text (Thai uses no spaces between words).
||'''Input String''' ||'''Output Tokens''' ||'''Script Attribute''' ||
||Testing บริษัทชื่อ נאסק"ר ||Testing<<BR>>บริษัท<<BR>>ชื่อ<<BR>>נאסק"ר ||Latin<<BR>>Thai<<BR>>Thai<<BR>>Hebrew ||




{{{
    <fieldType name="text_icu" class="solr.TextField" autoGeneratePhraseQueries="false">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
      </analyzer>
    </fieldType>
}}}
Note: to use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer|ICU Tokenizer]].
Line 306: Line 110:
Overall documented at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions|Filter Descriptions]].
Line 307: Line 113:
Line 309: Line 114:
Creates `org.apache.lucene.analysis.standard.ClassicFilter`.

Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens produced by !ClassicTokenizer or equivalent.

 . Example of !ClassicTokenizer followed by !ClassicFilter:
  . `"I.B.M. cat's can't" ==> "IBM", "cat", "can't"`
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ClassicFilter|Classic Filter]].
Line 317: Line 117:
Line 326: Line 125:
Line 328: Line 126:
Creates `org.apache.lucene.analysis.LowerCaseFilter`.

Lowercases the letters in each token. Leaves non-letter tokens alone.

 . Example: `"I.B.M.", "Solr" ==> "i.b.m.", "solr"`.
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-LowerCaseFilter|Lower Case Filter]].
Line 335: Line 129:
Line 337: Line 130:
Creates `org.apache.lucene.analysis.core.TypeTokenFilter`.

Blacklists or whitelists a specified list of token types; tokens may have "type" metadata associated with them. For example, the UAX29URLEmailTokenizer emits "<URL>" and "<EMAIL>" typed tokens, as well as other types. To pull out only e-mail addresses from text as tokens, this definition will do the trick:

{{{
    <fieldType name="emails" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
        <filter class="solr.TypeTokenFilterFactory" types="email_type.txt" useWhitelist="true"/>
      </analyzer>
    </fieldType>
}}}
where the email_type.txt file contains just <EMAIL>. useWhiteList defaults to false, operates as a blacklist.
Documented at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TypeTokenFilter|Type Token Filter]].
Line 352: Line 133:
Line 354: Line 134:
<!> [[Solr1.2]]

Creates `org.apache.solr.analysis.TrimFilter`.

Trims whitespace at either end of a token.

 . Example: `" Kittens! ", "Duck" ==> "Kittens!", "Duck"`.

Optionally, the "updateOffsets" attribute will update the start and end position offsets. (/!\ removed in Solr 4.4)
Documented at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TrimFilter|Trim Filter]].
Line 365: Line 137:
Line 379: Line 150:
Line 398: Line 168:
Like the !PatternReplaceCharFilterFactory, but operates post-tokenization. See "When to use a Char Filter vs. a Token Filter" above. Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PatternReplaceFilter|Pattern Replace Filter]].
Line 401: Line 171:
Line 403: Line 172:
Creates `org.apache.lucene.analysis.StopFilter`.

Discards common words.

A customized stop word list may be specified with the "words" attribute in the schema. Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing to the stopword list.

{{{
<fieldtype name="teststop" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.LowerCaseTokenizerFactory"/>
     <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
   </analyzer>
</fieldtype>
}}}
The default English stop words are in the following list. This is the format of file referenced in the "words" attribute for this filter:

{{{
#Standard english stop words taken from Lucene's StopAnalyzer
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
s
such
t
that
the
their
then
there
these
they
this
to
was
will
with
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter|Stop Filter]].
Line 458: Line 175:
Line 460: Line 176:
Creates `org.apache.solr.analysis.CommonGramsFilter`. <!> [[Solr1.4]]

Makes shingles (i.e. the_cat) by combining common tokens (usually the same as the stop words list) and regular tokens. !CommonGramsFilter is useful for issuing phrase queries (i.e. "the cat") that contain stop words. Normally phrases containing stop words would not match their intended target and instead, the query "the cat" would match all documents containing "cat", which can be undesirable behavior. Phrase query slop (eg, "the cat"~2) will not function as intended because common grams are indexed as shingled tokens that are adjacent to each other (i.e. the_cat is indexed as a single term). The !CommonGramsQueryFilter converts the phrase query "the cat" into the single term query the_cat.

A customized common word list may be specified with the "words" attribute in the schema. Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing to the common words list.

{{{
<fieldtype name="testcommongrams" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.LowerCaseTokenizerFactory"/>
     <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
   </analyzer>
</fieldtype>
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-CommonGramsFilter|Common Grams Filter]].
Line 475: Line 179:
Line 477: Line 180:
Creates `org.apache.solr.analysis.EdgeNGramTokenFilter`.

By default, create n-grams from the beginning edge of a input token.

With the configuration below the string value '''Nigerian''' gets broken down to the following terms

Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria", "nigeria", "nigerian"

By default, minGramSize is 1, maxGramSize is 1 and side is "front". You can also set side to "back" to generate the ngrams from right to left.

minGramSize - the minimum number of characters to start with. For example, minGramSize=4 would mean that a word like '''Apache''' => "Apac", "Apach", "Apache" would be the 3 tokens output.
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter|Edge N-Gram Filter]].
Line 502: Line 195:
Line 503: Line 197:
Line 505: Line 198:
Creates `org.apache.solr.analysis.KeepWordFilter`. <!> [[Solr1.3]]

Keep words on a list. This is the inverse behavior of !StopFilterFactory. The word file format is identical.

{{{
<fieldtype name="testkeep" class="solr.TextField">
   <analyzer>
     <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>
   </analyzer>
</fieldtype>
}}}
<<Anchor(LengthFilter)>>

=== solr.LengthFilterFactory ===
Creates `solr.LengthFilter`.

Filters out those tokens *not* having length min through max inclusive.

{{{
<fieldtype name="lengthfilt" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="2" max="5" />
  </analyzer>
</fieldtype>
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-KeepWordFilter|Keep Word Filter]].
Line 532: Line 201:
Line 534: Line 202:
Creates `solr.analysis.WordDelimiterFilter`.

Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules:

 * split on intra-word delimiters (all non alpha-numeric characters).
  * `"Wi-Fi" -> "Wi", "Fi"`
 * split on case transitions (can be turned off - see splitOnCaseChange parameter)
  * `"PowerShot" -> "Power", "Shot"`
 * split on letter-number transitions (can be turned off - see splitOnNumerics parameter)
  * `"SD500" -> "SD", "500"`
 * leading and trailing intra-word delimiters on each subword are ignored
  * `"//hello---there, 'dude'" -> "hello", "there", "dude"`
 * trailing "'s" are removed for each subword (can be turned off - see stemEnglishPossessive parameter)
  * `"O'Neil's" -> "O", "Neil"`
   * Note: this step isn't performed in a separate filter because of possible subword combinations.

Splitting is affected by the following parameters:

 * '''splitOnCaseChange="1"''' causes lowercase => uppercase transitions to generate a new part [Solr 1.3]:
  * `"PowerShot" => "Power" "Shot"`
  * `"TransAM" => "Trans" "AM"`
  * default is true ("1"); set to 0 to turn off
 * '''splitOnNumerics="1"''' causes alphabet => number transitions to generate a new part [Solr 1.3]:
  * `"j2se" => "j" "2" "se"`
  * default is true ("1"); set to 0 to turn off
 * '''stemEnglishPossessive="1"''' causes trailing "'s" to be removed for each subword.
  * `"Doug's" => "Doug"`
  * default is true ("1"); set to 0 to turn off

Note that this is the default behaviour in all released versions of Solr.

There are also a number of parameters that affect what tokens are present in the final output and if subwords are combined:

 * '''generateWordParts="1"''' causes parts of words to be generated:
  * `"PowerShot" => "Power" "Shot"` (if `splitOnCaseChange=1`)
  * `"Power-Shot" => "Power" "Shot"`
  * default is 1
 * '''generateNumberParts="1"''' causes number subwords to be generated:
  * `"500-42" => "500" "42"`
  * default is 1
 * '''catenateWords="1"''' causes maximum runs of word parts to be catenated:
  * `"wi-fi" => "wifi"`
  * default is 0
 * '''catenateNumbers="1"''' causes maximum runs of number parts to be catenated:
  * `"500-42" => "50042"`
  * default is 0
 * '''catenateAll="1"''' causes all subword parts to be catenated:
  * `"wi-fi-4000" => "wifi4000"`
  * default is 0
 * '''preserveOriginal="1"''' causes the original token to be indexed without modifications (in addition to the tokens produced due to other options)
  * default is 0
 * '''protected="protwords.txt"''' specifies a text file containing a list of words that should be protected and passed through unchanged.
  * default is empty (no protected words)
 * '''types="wdfftypes.txt"''' allows customized tokenization for this filter. The file should exist in the solr/conf directory, and entries are of the form (without quotes) "% => ALPHA" or "\u002C => DIGIT". Allowable types are: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM. [Solr3.1]
  * See SOLR-2059,

These parameters may be combined in any way.

 * Example of generateWordParts="1" and catenateWords="1":
  * `"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"` <<BR>> (where 0,1,1 are token positions)
  * `"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"`
  * `"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"`

One use for !WordDelimiterFilter is to help match words with [[SolrRelevancyCookbook#IntraWordDelimiters|different delimiters]]. One way of doing so is to specify `generateWordParts="1" catenateWords="1"` in the analyzer used for indexing, and `generateWordParts="1"` in the analyzer used for querying. Given that the current !StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as !WhitespaceTokenizer).
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter|Word Delimiter Filter]].

One use for !WordDelimiterFilter is to help match words with [[SolrRelevancyCookbook#IntraWordDelimiters|different delimiters]]. One way of doing so is to specify `generateWordParts="1" catenateWords="1"` in the analyzer used for indexing, and `generateWordParts="1"` in the analyzer used for querying. Given that the current !StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as !WhitespaceTokenizer).
Line 634: Line 241:
Creates `SynonymFilter`.

Matches strings of tokens and replaces them with other strings of tokens.

 * The '''synonyms''' parameter names an external file defining the synonyms.
 * If '''ignoreCase''' is true, matching will lowercase before checking equality.
 * If '''expand''' is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.
 * The optional '''tokenizerFactory''' parameter names a tokenizer factory class to analyze synonyms (see https://issues.apache.org/jira/browse/SOLR-319), which can help with the synonym+stemming problem described in http://search-lucene.com/m/hg9ri2mDvGk1 .

Example usage in schema:

{{{
    <fieldtype name="syn" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="false"/>
      </analyzer>
    </fieldtype>
}}}
Synonym file format:

{{{
# blank lines and lines starting with pound are comments.

#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS. These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit

#Equivalent synonyms may be separated with commas and give
#no explicit mapping. In this case the mapping behavior will
#be taken from the expand parameter in the schema. This allows
#the same synonym file to be used in different synonym handling strategies.
#Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos

# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod

#multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
#is equivalent to
foo => foo bar, baz
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymFilter|Synonym Filter]].
Line 699: Line 257:
Line 701: Line 258:
Creates `org.apache.solr.analysis.RemoveDuplicatesTokenFilter`.

Filters out any tokens which are at the same logical position in the tokenstream as a previous token with the same text. This situation can arise from a number of situations depending on what the "up stream" token filters are -- notably when stemming synonyms with similar roots. It is usefull to remove the duplicates to prevent `idf` inflation at index time, or `tf` inflation (in a !MultiPhraseQuery) at query time.
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-RemoveDuplicatesTokenFilter|Remove Duplicates Token Filter]].
Line 706: Line 261:
Line 708: Line 262:
Creates `org.apache.lucene.analysis.ISOLatin1AccentFilter`.
Line 713: Line 265:
Line 715: Line 266:
Creates `org.apache.lucene.analysis.ASCIIFoldingFilter`.

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

{{{
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
}}}
See the [[http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/ASCIIFoldingFilterFactory.html|ASCIIFoldingFilter Javadocs]] for more details.
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ASCIIFoldingFilter|ASCII Folding Filter]].
Line 725: Line 269:
Line 727: Line 270:
<!> [[Solr1.2]]

Creates `org.apache.solr.analysis.PhoneticFilter`.

Uses [[http://jakarta.apache.org/commons/codec/|commons codec]] to generate phonetically similar tokens. This currently supports [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/package-summary.html|five methods]].
||'''arg''' ||'''value''' ||
||encoder ||one of: [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/DoubleMetaphone.html|DoubleMetaphone]], [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/Metaphone.html|Metaphone]], [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/Soundex.html|Soundex]], [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/RefinedSoundex.html|RefinedSoundex]], [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/Caverphone.html|Caverphone]] <!> [[Solr3.1]] ||
||inject ||true/false -- true will add tokens to the stream, false will replace the existing token ||
||maxCodeLength ||integer -- sets the maximum length of the code to be generated. Supported only for Metaphone and !DoubleMetaphone encodings ||




{{{
  <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PhoneticFilter|Phonetic Filter]].


=== solr.DoubleMetaphoneFilterFactor ===
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-DoubleMetaphoneFilter|Double Metaphone Filter]].
Line 744: Line 277:

=== solr.DoubleMetaphoneFilterFactor ===
Creates `org.apache.solr.analysis.DoubleMetaphoneFilter`.

The advantage over PhoneticFilter is that this supports secondary Metaphone tokens.

{{{
  <filter class="solr.DoubleMetaphoneFilterFactory" inject="true" maxCodeLength="4"/>
}}}
Line 754: Line 278:
<!> [[https://wiki.apache.org/solr/Solr3.6|Solr3.6]]

Creates `org.apache.solr.analysis.BeiderMorsePhoneticFilter`.

Uses [[http://jakarta.apache.org/commons/codec/|commons codec]] to generate phonetically similar tokens that are optimized for surnames that sound alike but have different spellings.
This is especially useful for Central European and Eastern European surnames. For example, one can use this filter factory to find documents that contain the surname "Kracovsky" when the original search term was "Crakowski", or vice versa. For more information, check out the paper about Beider-Morse Phonetic Matching (BMPM) at http://stevemorse.org/phonetics/bmpm.htm.

{{{
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-Beider-MorseFilter|Beider-Morse Filter]].

This is especially useful for Central European and Eastern European surnames. For example, one can use this filter factory to find documents that contain the surname "Kracovsky" when the original search term was "Crakowski", or vice versa. For more information, check out the paper about Beider-Morse Phonetic Matching (BMPM) at http://stevemorse.org/phonetics/bmpm.htm.
Line 764: Line 284:
Line 766: Line 285:
<!> [[Solr1.3]]

Creates [[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html|org.apache.lucene.analysis.shingle.ShingleFilter]].

A !ShingleFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.

For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".
||'''arg''' ||'''default value''' ||'''note''' ||
||maxShingleSize ||2 || ||
||minShingleSize ||2 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-1740|SOLR-1740]] ||
||outputUnigrams ||true || ||
||outputUnigramsIfNoShingles ||false || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-744|SOLR-744]] ||
||tokenSeparator ||" " || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-1740|SOLR-1740]] ||




{{{
  <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
}}}
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ShingleFilter|Shingle Filter]].
Line 789: Line 290:
'''This filter was deprecated and removed from Lucene in 5.0'''
Line 800: Line 303:

Line 805: Line 306:
Line 819: Line 321:
Line 843: Line 346:
Line 846: Line 350:
<!> [[Solr1.4]]

A filter that reverses tokens to provide faster leading wildcard and prefix queries. Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser (SolrQuerySyntax) will use this to reverse wildcard and prefix queries to improve performance (for example, translating myfield:*foo into myfield:oof*). To avoid collisions and false matches, reversed tokens are indexed with a prefix that should not otherwise appear in indexed text.

See the [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/ReversedWildcardFilterFactory.html|javadoc]] for more details, or the [[http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup|example schema]].
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ReversedWildcardFilter|Reversed Wildcard Filter]].

Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser (SolrQuerySyntax) will use this to reverse wildcard and prefix queries to improve performance (for example, translating myfield:*foo into myfield:oof*). To avoid collisions and false matches, reversed tokens are indexed with a prefix that should not otherwise appear in indexed text.
Line 853: Line 356:
Line 855: Line 357:
<!> [[Solr3.1]]

A filter that lets one specify:

 1. A system collator associated with a locale, or
 1. A collator based on custom rules

This can be used for changing sort order for non-english languages as well as to modify the collation sequence for certain languages. You must use the same !CollationKeyFilter at both index-time and query-time for correct results. Also, the JVM vendor, version (including patch version) of the slave should be exactly same as the master (or indexer) for consistent results.

Also see

 1. [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/CollationKeyFilterFactory.html|Javadocs]]
 1. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/package-summary.html|Lucene 2.9.1 contrib-collation documentation]]
 1. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/CollationKeyFilter.html|Lucene's CollationKeyFilter javadocs]]
 1. UnicodeCollation
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-CollationKeyFilter|Collation Key Filter]]. Also discussed in the section [[https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-UnicodeCollation|Unicode Collation]].
Line 872: Line 360:
<!> [[Solr3.1]] See [[https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-UnicodeCollation|Unicode Collation]].
Line 889: Line 377:
<!> [[Solr3.1]]

This filter normalizes text to a [[http://unicode.org/reports/tr15/|Unicode Normalization Form]].

{{{
    <fieldType name="normalized" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose"/>
      </analyzer>
    </fieldType>
}}}
These are the supported normalization forms:

{{{
NFC: name="nfc" mode="compose"
NFD: name="nfc" mode="decompose"
NFKC: name="nfkc" mode="compose"
NFKD: name="nfkc" mode="decompose"
NFKC_Casefold: name="nfkc_cf" mode="compose"
}}}
NFKC_Casefold (nfkc_cf) means applying the Unicode Case-Folding algorithm in conjunction with NFKC normalization. Unicode Case-Folding is more than lowercasing, e.g. it handles cases like ß/SS. Behind the scenes this is its own form (nfkc_cf), but both algorithms have been recursively computed across all of Unicode offline, so that its an efficient single-pass algorithm. For practical purposes this means you can use this factory with nfkc_cf as a better substitute for the combined behavior of LowerCaseFilter and NFKC normalization.

If you want to do more advanced normalization (e.g. apply a filter to work only on a subset of Unicode), see the javadocs.

Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ICUNormalizer2Filter|ICU Normalizer 2 Filter]].
Line 917: Line 380:
<!> [[Solr3.1]]

This filter is a custom unicode normalization form that applies the foldings specified in [[http://www.unicode.org/reports/tr30/tr30-4.html|UTR#30]] in addition to NFKC_Casefold.

{{{
    <fieldType name="folded" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
      </analyzer>
    </fieldType>
}}}
This means NFKC normalization, Unicode case folding, and search term folding (removing accents, etc) have been recursively computed across all of Unicode offline, so that its an efficient single-pass through the string. For practical purposes this means you can use this factory as a better substitute for the combined behavior of ASCIIFoldingFilter, LowerCaseFilter, and ICUNormalizer2Filter

Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ICUFoldingFilter|ICU Folding Filter]].
Line 934: Line 383:
<!> [[Solr3.1]]

This filter applies [[http://userguide.icu-project.org/transforms/general|ICU Transforms]] to text.

Currently the filter only supports System transforms (or compounds consisting of), and custom rulesets are not yet supported.

{{{
    <fieldType name="transformed" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
      </analyzer>
    </fieldType>
}}}
You can see a list of the supported System transforms by going to [[http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html|this link]], clicking the drop-down, and scrolling down to System.

Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
Documentation at [[https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ICUTransformFilter|ICU Transform Filter]].

This page exists for the Solr Community to share Tips, Tricks, and Advice about Analyzers, Tokenizers and Filters.

Reference material previously located on this page has been migrated to the Official Solr Reference Guide. If you need help, please consult the Reference Guide for the version of Solr you are using. The sections below will point to corresponding sections of the Reference Guide for each specific feature.

If you'd like to share information about how you use this feature, please add it to this page.

Analyzers, Tokenizers, and Token Filters

For a complete list of what Tokenizers and TokenFilters come out of the box, please consult the Lucene javadocs, Solr javadocs, and Automatically generated list at solr-start.com. Please look at analyzer-*. There are quite a few. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below.

Note: For a good background on Lucene Analysis, it's recommended that you read the following sections in Lucene In Action:

  • 1.5.3 : Analyzer
  • Chapter 4.0 through 4.7 at least

Contents

  1. Analyzers, Tokenizers, and Token Filters
  2. High Level Concepts
    1. Stemming
    2. Analyzers
      1. Char Filters
      2. Tokenizers
      3. Token Filters
      4. Specifying an Analyzer in the schema
    3. When To use a CharFilter vs a TokenFilter
  3. Notes On Specific Factories
    1. CharFilterFactories
      1. solr.MappingCharFilterFactory
      2. solr.PatternReplaceCharFilterFactory
      3. solr.HTMLStripCharFilterFactory
    2. TokenizerFactories
      1. solr.KeywordTokenizerFactory
      2. solr.LetterTokenizerFactory
      3. solr.WhitespaceTokenizerFactory
      4. solr.LowerCaseTokenizerFactory
      5. solr.StandardTokenizerFactory
      6. solr.ClassicTokenizerFactory
      7. solr.UAX29URLEmailTokenizerFactory
      8. solr.PatternTokenizerFactory
      9. solr.ICUTokenizerFactory
    3. TokenFilterFactories
      1. solr.ClassicFilterFactory
      2. solr.ApostropheFilterFactory
      3. solr.LowerCaseFilterFactory
      4. solr.TypeTokenFilterFactory
      5. solr.TrimFilterFactory
      6. solr.TruncateTokenFilterFactory
      7. solr.PatternCaptureGroupFilterFactory
      8. solr.PatternReplaceFilterFactory
      9. solr.StopFilterFactory
      10. solr.CommonGramsFilterFactory
      11. solr.EdgeNGramFilterFactory
      12. solr.KeepWordFilterFactory
      13. solr.WordDelimiterFilterFactory
      14. solr.SynonymFilterFactory
      15. solr.RemoveDuplicatesTokenFilterFactory
      16. solr.ISOLatin1AccentFilterFactory
      17. solr.ASCIIFoldingFilterFactory
      18. solr.PhoneticFilterFactory
      19. solr.DoubleMetaphoneFilterFactor
      20. solr.BeiderMorseFilterFactory
      21. solr.ShingleFilterFactory
      22. solr.PositionFilterFactory
      23. solr.ReversedWildcardFilterFactory
      24. solr.CollationKeyFilterFactory
      25. solr.ICUCollationKeyFilterFactory
      26. solr.ICUNormalizer2FilterFactory
      27. solr.ICUFoldingFilterFactory
      28. solr.ICUTransformFilterFactory

High Level Concepts

Stemming

Individual Solr stemmers are documented in the Solr Reference Guide section Filter Descriptions.

Analyzers

Analyzers are documented in the Solr Reference Guide section Analyzers.

Char Filters

CharFilters are documented in the Solr Reference Guide section CharFilterFactories.

Tokenizers

Tokenizers are documented in the Solr Reference Guide section Tokenizers.

Token Filters

Token Filters are documented in the Solr Reference Guide section Filter Descriptions.

Specifying an Analyzer in the schema

If you want to use custom CharFilters, Tokenizers or TokenFilters, you'll need to write a very simple factory that subclasses BaseTokenizerFactory or BaseTokenFilterFactory, something like this...

public class MyCustomFilterFactory extends BaseTokenFilterFactory {
  public TokenStream create(TokenStream input) {
    return new MyCustomFilter(input);
  }
}

When To use a CharFilter vs a TokenFilter

There are several pairs of CharFilters and TokenFilters that have related (ie: MappingCharFilter and !ASCIIFoldingFilter) or nearly identical functionality (ie: PatternReplaceCharFilterFactory and PatternReplaceFilterFactory) and it may not always be obvious which is the best choice.

The ultimate decision depends largely on what Tokenizer you are using, and whether you need to "out smart" it by preprocessing the stream of characters.

For example, maybe you have a tokenizer such as StandardTokenizer and you are pretty happy with how it works overall, but you want to customize how some specific characters behave.

In such a situation you could modify the rules and re-build your own tokenizer with javacc, but perhaps its easier to simply map some of the characters before tokenization with a CharFilter.

Notes On Specific Factories

CharFilterFactories

solr.MappingCharFilterFactory

Documentation at MappingCharFilterFactory.

solr.PatternReplaceCharFilterFactory

Documentation at PatternReplaceCharFilterFactory.

solr.HTMLStripCharFilterFactory

Documentation at HTMLStripCharFilterFactory.

TokenizerFactories

Solr provides the following TokenizerFactories (Tokenizers and TokenFilters):

solr.KeywordTokenizerFactory

Documentation at Keyword Tokenizer.

solr.LetterTokenizerFactory

Documentation at Letter Tokenizer.

solr.WhitespaceTokenizerFactory

Documentation at White Space Tokenizer.

solr.LowerCaseTokenizerFactory

Documentation at Lower Case Tokenizer.

solr.StandardTokenizerFactory

Documentation at Standard Tokenizer.

Solr Version

Behavior

pre-3.1

Some token types are number, alphanumeric, email, acronym, URL, etc. —

Example: "I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"

<!> Solr3.1

Word boundary rules from Unicode standard annex UAX#29
Token types: <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>

Example: "I.B.M. 8.5 can't!!!" ==> ALPHANUM: "I.B.M.", NUM:"8.5", ALPHANUM:"can't"

solr.ClassicTokenizerFactory

Documentation at Classic Tokenizer.

solr.UAX29URLEmailTokenizerFactory

Documentation at UAX29 URL Email Tokenizer.

solr.PatternTokenizerFactory

Documentation at Regular Expression Pattern Tokenizer.

solr.ICUTokenizerFactory

Documentation at ICU Tokenizer.

TokenFilterFactories

Overall documented at Filter Descriptions.

solr.ClassicFilterFactory

Documentation at Classic Filter.

solr.ApostropheFilterFactory

Creates org.apache.lucene.analysis.tr.ApostropheFilter.

Strips all characters after an apostrophe (including the apostrophe itself).

  • Example: "Türkiye'de", "2003'te" ==> "Türkiye", "2003".

solr.LowerCaseFilterFactory

Documentation at Lower Case Filter.

solr.TypeTokenFilterFactory

Documented at Type Token Filter.

solr.TrimFilterFactory

Documented at Trim Filter.

solr.TruncateTokenFilterFactory

<!> Solr4.8

Creates org.apache.lucene.analysis.miscellaneous.TruncateTokenFilter.

A token filter for truncating the terms into a specific length.

<filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
  • Example: "abcdefg", "1234567" ==> "abcde", "12345".

solr.PatternCaptureGroupFilterFactory

<!> Solr4.4

Emits tokens for each capture group in a regular expression

For example, the following definition will tokenize the input text of "http://www.foo.com/index" into "http://www.foo.com" and "www.foo.com".

   <fieldType name="url_base" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.KeywordTokenizerFactory"/>
       <filter class="solr.PatternCaptureGroupFilterFactory" pattern="(https?://([a-zA-Z\-_0-9.]+))" preserve_original="false"/>
     </analyzer>
   </fieldType>

If none of the patterns match, or if preserve_original is true, the original token will also be emitted.

solr.PatternReplaceFilterFactory

Documentation at Pattern Replace Filter.

solr.StopFilterFactory

Documentation at Stop Filter.

solr.CommonGramsFilterFactory

Documentation at Common Grams Filter.

solr.EdgeNGramFilterFactory

Documentation at Edge N-Gram Filter.

This FilterFactory is very useful in matching prefix substrings (or suffix substrings if side="back") of particular terms in the index during query time. Edge n-gram analysis can be performed at either index or query time (or both), but typically it is more useful, as shown in this example, to generate the n-grams at index time with all of the n-grams indexed at the same position. At query time the query term can be matched directly without any n-gram analysis. Unlike wildcards, n-gram query terms can be used within quoted phrases.

<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
   </analyzer>
</fieldType>

solr.KeepWordFilterFactory

Documentation at Keep Word Filter.

solr.WordDelimiterFilterFactory

Documentation at Word Delimiter Filter.

One use for WordDelimiterFilter is to help match words with different delimiters. One way of doing so is to specify generateWordParts="1" catenateWords="1" in the analyzer used for indexing, and generateWordParts="1" in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer).

    <fieldtype name="subword" class="solr.TextField">
      <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="1"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                preserveOriginal="1"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldtype>

In some cases you might want to adjust how WordDelimiterFilter splits on a per-character basis. To do this, you can supply a configuration file with the "types" attribute that specifies custom character categories. An example file is in subversion here. This is especially useful to add "hashtag or currency" searches.

solr.SynonymFilterFactory

Documentation at Synonym Filter.

Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:

  1. The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.

  2. Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document

Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:

  • An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"

  • Many thousands of documents containing the term "text:TV"
  • A few hundred documents containing the term "text:Television"

A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.

solr.RemoveDuplicatesTokenFilterFactory

Documentation at Remove Duplicates Token Filter.

solr.ISOLatin1AccentFilterFactory

Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. In Solr 3.x, this filter is deprecated. This filter does not exist at all in 4.x versions. Use ASCIIFoldingFilterFactory instead.

solr.ASCIIFoldingFilterFactory

Documentation at ASCII Folding Filter.

solr.PhoneticFilterFactory

Documentation at Phonetic Filter.

solr.DoubleMetaphoneFilterFactor

Documentation at Double Metaphone Filter.

solr.BeiderMorseFilterFactory

Documentation at Beider-Morse Filter.

This is especially useful for Central European and Eastern European surnames. For example, one can use this filter factory to find documents that contain the surname "Kracovsky" when the original search term was "Crakowski", or vice versa. For more information, check out the paper about Beider-Morse Phonetic Matching (BMPM) at http://stevemorse.org/phonetics/bmpm.htm.

solr.ShingleFilterFactory

Documentation at Shingle Filter.

solr.PositionFilterFactory

This filter was deprecated and removed from Lucene in 5.0

<!> Solr1.4

Creates org.apache.lucene.analysis.position.PositionFilter.

A PositionFilter manipulates the position of tokens in the stream.

Set the positionIncrement of all tokens to the "positionIncrement", except the first return token which retains its original positionIncrement value.

arg

value

positionIncrement

default 0

  <filter class="solr.PositionFilterFactory" />

PositionFilter can be used with a query Analyzer to prevent expensive Phrase and MultiPhraseQueries. When QueryParser parses a query, it first divides text on whitespace, and then Analyzes each whitespace token. Some TokenStreams such as StandardTokenizer or WordDelimiterFilter may divide one of these whitespace-separate tokens into multiple tokens.

The QueryParser will turn "multiple tokens" into a Phrase or MultiPhraseQuery, but "multiple tokens at the same position with only a position count of 1" is treated as a special case. You can use PositionFilter at the end of your QueryAnalyzer to force any subsequent tokens after the first one to have a position increment of zero, to trigger this case.

For example, by default a query of "Wi-Fi" with StandardTokenizer will create a PhraseQuery:

field:"Wi Fi"

If you instead wrap the StandardTokenizer with PositionFilter, the "Fi" will have a position increment of zero, creating a BooleanQuery:

field:Wi field:Fi

Another example is when exact matching hits are wanted for _any_ shingle within the query. (This was done at http://sesam.no to replace three proprietary 'FAST Query-Matching servers' with two open sourced Solr indexes, background reading in sesat and on the mailing list). It was needed that in the query all words and shingles to be placed at the same position, so that all shingles to be treated as synonyms of each other.

With only the ShingleFilter the shingles generated are synonyms only to the first term in each shingle group. For example the query "abcd efgh ijkl" results in a query like:

  • ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")

where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym of "efgh".

ShingleFilter does not offer a way to alter this behaviour.

Using the PositionFilter in combination makes it possible to make all shingles synonyms of each other. Such a configuration could look like:

   <fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99"/>
        <filter class="solr.PositionFilterFactory" />
      </analyzer>
    </fieldType>

solr.ReversedWildcardFilterFactory

Documentation at Reversed Wildcard Filter.

Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser (SolrQuerySyntax) will use this to reverse wildcard and prefix queries to improve performance (for example, translating myfield:*foo into myfield:oof*). To avoid collisions and false matches, reversed tokens are indexed with a prefix that should not otherwise appear in indexed text.

solr.CollationKeyFilterFactory

Documentation at Collation Key Filter. Also discussed in the section Unicode Collation.

solr.ICUCollationKeyFilterFactory

See Unicode Collation.

This filter works like CollationKeyFilterFactory, except it uses ICU for collation. This makes smaller and faster sort keys, and it supports more locales. See UnicodeCollation for some more information, the same concepts apply.

The only configuration difference is that locales should be specified to this filter with RFC 3066 locale IDs.

    <fieldType name="icu_sort_en" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.ICUCollationKeyFilterFactory" locale="en" strength="primary"/>
      </analyzer>
    </fieldType>

Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib

solr.ICUNormalizer2FilterFactory

Documentation at ICU Normalizer 2 Filter.

solr.ICUFoldingFilterFactory

Documentation at ICU Folding Filter.

solr.ICUTransformFilterFactory

Documentation at ICU Transform Filter.

AnalyzersTokenizersTokenFilters (last edited 2016-08-15 18:18:54 by StefanMatheis)