Analyzers, Tokenizers, and Token Filters
Overview
When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At indexing time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a
Soundex transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'.
The lists below provide an overview of some of the more heavily used Tokenizers and TokenFilters provided by Solr "out of the box" along with tips/examples of using them. This list should by no means be considered the "complete" list of all Analysis classes available in Solr, in addition to new classes being added on an ongoing basis, you can load your own custom Analysis code as a Plugin.
For a more complete list of what Tokenizes and TokenFilters come out of the box, please consult the
javadocs for the analysis package. if you have any tips/tricks you'd like to mention about using any of these classes, please add them below.
Note: For a good background on Lucene Analysis, it's recommended that you read the following sections in
Lucene In Action:
1.5.3 : Analyzer
Chapter 4.0 through 4.7 at least
Try searches for "analyzer", "token", and "stemming".
-
Analyzers, Tokenizers, and Token Filters
- Overview
- Stemming
- Analyzers
- Tokens and Token Filters
-
Specifying an Analyzer in the schema
- TokenizerFactories
-
TokenFilterFactories
- solr.StandardFilterFactory
- solr.LowerCaseFilterFactory
- solr.TrimFilterFactory
- solr.StopFilterFactory
- solr.KeepWordFilterFactory
- solr.LengthFilterFactory
- solr.PorterStemFilterFactory
- solr.EnglishPorterFilterFactory
- solr.SnowballPorterFilterFactory
- solr.WordDelimiterFilterFactory
- solr.SynonymFilterFactory
- solr.RemoveDuplicatesTokenFilterFactory
- solr.ISOLatin1AccentFilterFactory
- solr.PhoneticFilterFactory
Stemming
There are two types of stemming strategies:
Porter or Reduction stemming A transforming algorithm that reduces any of the forms of a word such "runs, running, ran", to its elemental root e.g., "run". Porter stemming must be performed both at insertion time and at query time. Expansion stemming Takes a root word and 'expands' it to all of its various forms can be used either at insertion time or at query time.
Analyzers
Analyzers are components that pre-process input text at index time and/or at search time. It's important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.
The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to specify a custom Analyzer in the Solr schema.
Tokens and Token Filters
An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer is normally implemented by creating a Tokenizer that splits-up a stream (normally a single field value) into a series of tokens. These tokens are then passed through a series of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream.
The Solr web admin interface may be used to show the results of text analysis, and even the results after each analysis phase if a custom analyzer is used.
Specifying an Analyzer in the schema
A Solr schema.xml file allows two methods for specifying the way a text field is analyzed. (Normally only field types of solr.TextField will have Analyzers explicitly specified in the schema):
Specifying the class name of an Analyzer anything extending org.apache.lucene.analysis.Analyzer.
Example:
<fieldtype name="nametext" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/> </fieldtype>
Specifying a TokenizerFactory followed by a list of optional TokenFilterFactories that are applied in the listed order. Factories that can create the tokenizers or token filters are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation via reflection.
Example:
<fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldtype>
Any Analyzer, TokenizerFactory, or TokenFilterFactory may be specified using its full class name with package -- just make sure they are in Solr's classpath when you start your appserver. Classes in the org.apache.solr.analysis.* package can be referenced using the short alias solr.*.
If you want to use custom Tokenizers or TokenFilters, you'll need to write a very simple factory that subclasses BaseTokenizerFactory or BaseTokenFilterFactory, something like this...
public class MyCustomFilterFactory extends BaseTokenFilterFactory {
public TokenStream create(TokenStream input) {
return new MyCustomFilter(input);
}
}
TokenizerFactories
Solr provides the following TokenizerFactories (Tokenizers and TokenFilters):
solr.LetterTokenizerFactory
Creates org.apache.lucene.analysis.LetterTokenizer.
Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded.
Example: "I can't" ==> "I", "can", "t"
solr.WhitespaceTokenizerFactory
Creates org.apache.lucene.analysis.WhitespaceTokenizer.
Creates tokens of characters separated by splitting on whitespace.
solr.LowerCaseTokenizerFactory
Creates org.apache.lucene.analysis.LowerCaseTokenizer.
Creates tokens by lowercasing all letters and dropping non-letters.
Example: "I can't" ==> "i", "can", "t"
solr.StandardTokenizerFactory
Creates org.apache.lucene.analysis.standard.StandardTokenizer.
A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The StandardFilter is currently the only Lucene filter that utilizes token types.
Some token types are number, alphanumeric, email, acronym, URL, etc.
Example: "I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"
solr.HTMLStripWhitespaceTokenizerFactory
Strips HTML from the input stream and passes the result to a WhitespaceTokenizer.
HTML stripping features:
The input need not be an HTML document as only constructs that look like HTML will be removed.
Removes HTML/XML tags while keeping the content
Attributes within tags are also removed, and attribute quoting is optional.
Removes XML processing instructions: <?foo bar?>
Removes XML comments
Removes XML elements starting with <! and ending with >
Removes contents of <script> and <style> elements.
Handles XML comments inside these elements (normal comment processing won't always work)
Replaces numeric character entities references like A or 
The terminating ';' is optional if the entity reference is followed by whitespace.
Replaces all
named character entity references. is replaced with a space instead of 0xa0
terminating ';' is mandatory to avoid false matches on something like "Alpha&Omega Corp"
HTML stripping examples:
|
my <a href="www.foo.bar">link</a> |
my link |
|
<?xml?><br>hello<!--comment--> |
hello |
|
hello<script><-- f('<--internal--></script>'); --></script> |
hello |
|
if a<b then print a; |
if a<b then print a; |
|
hello <td height=22 nowrap align="left"> |
hello |
|
a<b A Alpha&Omega Ω |
a<b A Alpha&Omega Ω |
solr.HTMLStripStandardTokenizerFactory
Strips HTML from the input stream and passes the result to a StandardTokenizer.
See solr.HTMLStripWhitespaceTokenizerFactory for details on HTML stripping.
TokenFilterFactories
solr.StandardFilterFactory
Creates org.apache.lucene.analysis.standard.StandardFilter.
Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens, i.e., those produced by StandardTokenizer or equivalent.
Example of StandardTokenizer followed by StandardFilter:
"I.B.M. cat's can't" ==> "IBM", "cat", "can't"
solr.LowerCaseFilterFactory
Creates org.apache.lucene.analysis.LowerCaseFilter.
Lowercases the letters in each token. Leaves non-letter tokens alone.<br>
Example: "I.B.M.", "Solr" ==> "i.b.m.", "solr".
solr.TrimFilterFactory
Creates org.apache.solr.analysis.TrimFilter.
Trims whitespace at either end of a token.<br>
Example: " Kittens! ", "Duck" ==> "Kittens!", "Duck".
Optionally, the "updateOffsets" attribute will update the start and end position offsets.
solr.StopFilterFactory
Creates org.apache.lucene.analysis.StopFilter.
Discards common words.
The default English stop words are:
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
A customized stop word list may be specified with the "words" attribute in the schema. Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing to the stopword list.
<fieldtype name="teststop" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldtype>
solr.KeepWordFilterFactory
Creates org.apache.solr.analysis.KeepWordFilter.
Solr1.3
Keep words on a list. This is the inverse behavior of StopFilterFactory. The word file format is identical.
<fieldtype name="testkeep" class="solr.TextField">
<analyzer>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>
</analyzer>
</fieldtype>
solr.LengthFilterFactory
Creates solr.LengthFilter.
Filters out those tokens *not* having length min through max inclusive.
<fieldtype name="lengthfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="2" max="5" />
</analyzer>
</fieldtype>
solr.PorterStemFilterFactory
Creates org.apache.lucene.analysis.PorterStemFilter.
Standard Lucene implementation of the
Porter Stemming Algorithm, a normalization process that removes common endings from words.
Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
solr.EnglishPorterFilterFactory
Creates solr.EnglishPorterFilter.
Creates an
English Porter2 stemmer from the Java classes generated from a
Snowball specification.
A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by the stemmer.
A
sample Solr protwords.txt with comments can be found in the Source Repository.
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
</analyzer>
</fieldtype>
Note: Due to performance concerns, this implementation does not utilize org.apache.lucene.analysis.snowball.SnowballFilter, as that class uses Java reflection to stem every word.
solr.SnowballPorterFilterFactory
Creates org.apache.lucene.analysis.SnowballPorterFilter.
Creates an
Porter2 stemmer from the Java classes generated from a
Snowball specification. The language attribute is used to specify the language of the stemmer.
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" />
</analyzer>
</fieldtype>
Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"):
Danish
Dutch
English
Finnish
French
German2
German
Italian
Kp
Lovins
Norwegian
Porter
Portuguese
Russian
Spanish
Swedish
solr.WordDelimiterFilterFactory
Creates solr.analysis.WordDelimiterFilter.
Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:
split on intra-word delimiters (by default, all non alpha-numeric characters).
"Wi-Fi" -> "Wi", "Fi"
split on case transitions
"PowerShot" -> "Power", "Shot"
split on letter-number transitions
"SD500" -> "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored
"//hello---there, 'dude'" -> "hello", "there", "dude"
trailing "'s" are removed for each subword
"O'Neil's" -> "O", "Neil"
Note: this step isn't performed in a separate filter because of possible subword combinations.
Splitting is affected by the following parameter:
splitOnCaseChange="1" causes lowercase => uppercase transitions to generate a new part [Solr 1.3]:
"PowerShot" => "Power" "Shot"
"TransAM" => "Trans" "AM"
Note that this is the default behaviour in all released versions of Solr.
There are also a number of parameters that affect what tokens are present in the final output and if subwords are combined:
generateWordParts="1" causes parts of words to be generated:
"PowerShot" => "Power" "Shot" (if splitOnCaseChange=1)
"Power-Shot" => "Power" "Shot"
generateNumberParts="1" causes number subwords to be generated:
"500-42" => "500" "42"
catenateWords="1" causes maximum runs of word parts to be catenated:
"wi-fi" => "wifi"
catenateNumbers="1" causes maximum runs of number parts to be catenated:
"500-42" => "50042"
catenateAll="1" causes all subword parts to be catenated:
"wi-fi-4000" => "wifi4000"
These parameters may be combined in any way.
Example of generateWordParts="1" and catenateWords="1":
"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"
(where 0,1,1 are token positions)"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"
"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for WordDelimiterFilter is to help match words with different delimiters. One way of doing so is to specify generateWordParts="1" catenateWords="1" in the analyzer used for indexing, and generateWordParts="1" in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer).
<fieldtype name="subword" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldtype>
solr.SynonymFilterFactory
Creates SynonymFilter.
Matches strings of tokens and replaces them with other strings of tokens.
The synonyms parameter names an external file defining the synonyms.
If ignoreCase is true, matching will lowercase before checking equality.
If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.
Example usage in schema:
<fieldtype name="syn" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory synonyms="syn.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldtype>
Synonym file format:
# blank lines and lines starting with pound are comments. #Explicit mappings match any token sequence on the LHS of "=>" #and replace with all alternatives on the RHS. These types of mappings #ignore the expand parameter in the schema. #Examples: i-pod, i pod => ipod, sea biscuit, sea biscit => seabiscuit #Equivalent synonyms may be separated with commas and give #no explicit mapping. In this case the mapping behavior will #be taken from the expand parameter in the schema. This allows #the same synonym file to be used in different synonym handling strategies. #Examples: ipod, i-pod, i pod foozball , foosball universe , cosmos # If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping: ipod, i-pod, i pod => ipod, i-pod, i pod # If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping: ipod, i-pod, i pod => ipod #multiple synonym mapping entries are merged. foo => foo bar foo => baz #is equivalent to foo => foo bar, baz
Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:
The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabisuit" occuring in a document
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
solr.RemoveDuplicatesTokenFilterFactory
Creates org.apache.solr.analysis.RemoveDuplicatesTokenFilter.
Filters out any tokens which are at the same logical position in the tokenstream as a previous token with the same text. This situation can arise from a number of situations depending on what the "up stream" token filters are -- notably when stemming synonyms with similar roots. It is usefull to remove the duplicates to prevent idf inflation at index time, or tf inflation (in a MultiPhraseQuery) at query time.
solr.ISOLatin1AccentFilterFactory
Creates org.apache.lucene.analysis.ISOLatin1AccentFilter.
Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent.
solr.PhoneticFilterFactory
Creates org.apache.lucene.analysis.PhoneticFilter.
Uses
commons codec to generate phonetically similar tokens. This currently supports
four methods.
|
arg |
value |
|
encoder |
one of: |
|
inject |
true/false -- true will add tokens to the stream, false will replace the existing token |
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>