SpellCheckComponent

<!> Solr1.3

/!\ :TODO: /!\ HOOK in links to Javadocs.

  1. Introduction
  2. Configuration
  3. Spell Checking Analysis
  4. Request Parameters
    1. spellcheck
    2. q OR spellcheck.q
    3. spellcheck.build
    4. spellcheck.reload
    5. spellcheck.count
    6. spellcheck.onlyMorePopular
    7. spellcheck.extendedResults
    8. spellcheck.collate
    9. spellcheck.dictionary
  5. Use in the Solr Example
    1. Simple results
    2. Extended Results
    3. Collate Results
  6. Implementing a SolrSpellChecker
  7. Implementing a QueryConverter
  8. Building on Commits
  9. Building on Optimize

Introduction

The SpellCheckComponent is designed to provide inline spell checking of queries without having to issue separate requests. Another and possibly clearer way of stating this is that it makes query suggestions (as do well-known web search engines), for example if it thinks the input query might have been misspelled. (Some people tend to think that "spellchecker" is actually a misnomer, and something along the lines of "query suggest" would have been more appropriate.)

For discussion of the development of this feature, see [WWW] SOLR-572.

The SpellCheckComponent can use the [WWW] Lucene SpellChecker to give suggestion for given words, or one can implement their own spell checker using the SolrSpellChecker abstract base class.

See also SpellCheckerRequestHandler for an alternate, older piece of code to do spell checking.

Configuration

The first step to use SpellCheckComponent is to specify the source of words which should be used for suggestions in [WWW] solrconfig.xml. The words can be loaded from a field in Solr, text files or even from fields in arbitary Lucene indices. A sample configuration for loading words from a field in Solr looks like the following:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  
    <lst name="spellchecker">
      <!-- 
           Optional, it is required when more than one spellchecker is configured.
           Select non-default name with spellcheck.dictionary in request handler.
      -->
      <str name="name">default</str>
      <!-- The classname is optional, defaults to IndexBasedSpellChecker -->
      <str name="classname">solr.IndexBasedSpellChecker</str>
      <!--
               Load tokens from the following field for spell checking, 
               analyzer for the field's type as defined in schema.xml are used
      -->
      <str name="field">spell</str>
      <!-- Optional, by default use in-memory index (RAMDirectory) -->
      <str name="spellcheckIndexDir">./spellchecker</str>
      <!-- Set the accuracy (float) to be used for the suggestions. Default is 0.5 -->
      <str name="accuracy">0.7</str>
    </lst>
    <!-- Example of using different distance measure -->
    <lst name="spellchecker">
      <str name="name">jarowinkler</str>
      <str name="field">lowerfilt</str>
      <!-- Use a different Distance Measure -->
      <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
      <str name="spellcheckIndexDir">./spellchecker</str>

    </lst>

    <!-- This field type's analyzer is used by the QueryConverter to tokenize the value for "q" parameter -->
    <str name="queryAnalyzerFieldType">textSpell</str>
</searchComponent>
<!-- 
  The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens.  Uses a simple regular expression
  to strip off field markup, boosts, ranges, etc. but it is not guaranteed to match an exact parse from the query parser.

  Optional, defaults to solr.SpellingQueryConverter
-->
<queryConverter name="queryConverter" class="solr.SpellingQueryConverter"/>

<!--  Add to a RequestHandler 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
NOTE:  YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT.  THIS IS DONE HERE SOLELY FOR 
THE SIMPLICITY OF THE EXAMPLE.  YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
<requestHandler name="/spellCheckCompRH" class="solr.SearchHandler">
    <lst name="defaults">
      <!-- Optional, must match spell checker's name as defined above, defaults to "default" -->
      <str name="spellcheck.dictionary">default</str>
      <!-- omp = Only More Popular -->
      <str name="spellcheck.onlyMorePopular">false</str>
      <!-- exr = Extended Results -->
      <str name="spellcheck.extendedResults">false</str>
      <!--  The number of suggestions to return -->
      <str name="spellcheck.count">1</str>
    </lst>
<!--  Add to a RequestHandler 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REPEAT NOTE:  YOU LIKELY DO NOT WANT A SEPARATE REQUEST HANDLER FOR THIS COMPONENT.  THIS IS DONE HERE SOLELY FOR 
THE SIMPLICITY OF THE EXAMPLE.  YOU WILL LIKELY WANT TO BIND THE COMPONENT TO THE /select STANDARD REQUEST HANDLER.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-->
    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>

When adding <str name="field">FieldName</str> be aware all fieldType processing is done prior to the dictionary creation. It is best to avoid a heavily processed field (ie synonyms and stemming) to get more accurate results. If the field has many word variations from processing then the dictionary will be created with those in addition to more valid spell checking data.

Multiple "spellchecker" instances can be configured in the same way. The currently available spellchecker implementations are:

Spell Checking Analysis

SpellCheckingAnalysis - Provides details on how Analysis and Spell Checking work together

Request Parameters

spellcheck

Turn on or off spellcheck suggestions for this request. If true, then spelling suggestions will be generated.

q OR spellcheck.q

The query to spellcheck. If spellcheck.q is defined, then it is used, otherwise the original input query is used. The spellcheck.q parameter is intended to be the original query, minus any extra markup like field names, boosts, etc. If the q parameter is specified, then the SpellingQueryConverter class is used to parse it into tokens, otherwise the WhitesepaceTokenizer is used. The choice of which one to use is up to the application. Essentially, if you have a spelling "ready" version in your application, then it is probably better to send spellcheck.q, otherwise, if you just want Solr to do the job, use the q parameter

Note: The SpellingQueryConverter class does not deal properly with non-ASCII characters. In this case, you have either to use spellcheck.q, or to implement your own QueryConverter.

spellcheck.build

Create the dictionary for use by the SolrSpellChecker. In typical applications, one needs to build the dictionary before using it. However, it may not always be necessary as it is possible to setup the spellchecker with a dictionary that already exists.

spellcheck.reload

Reload the spell checker. Depends on the implementation of SolrSpellChecker.reload() but usually means reloading the dictionary

spellcheck.count

The maximum number of suggestions to return

spellcheck.onlyMorePopular

Only return suggestions that result in more hits for the query than the existing query.

spellcheck.extendedResults

Provide additional information about the suggestion, such as the frequency in the index.

spellcheck.collate

Take the best suggestion for each token (if it exists) and construct a new query from the suggestions. For example, if the input query was "jawa class lording" and the best suggestion for "jawa" was "java" and "lording" was "loading", then the resulting collation would be "java class loading". Please Note: This only returns a query to be used it does not actually run the query.

spellcheck.dictionary

The name of the spellchecker to use. This defaults to "default". Can be used to invoke a specific spellchecker on a per request basis.

Use in the Solr Example

The Solr example (in solr/example) comes with a preconfigured SearchComponent and an associated RequestHandler for demonstration purposes. See the example solrconfig.xml (solr/example/solr/conf/solrconfig.xml) for setup parameters.

Simple results

A simple result using the spellcheck.q parameter. Note the spellcheck.build=true which is needed only once to build the index. It should not be specified with for each request.

http://localhost:8983/solr/spellCheckCompRH?q=*:*&spellcheck.q=hell%20ultrashar&spellcheck=true&spellcheck.build=true
<lst name="spellcheck">
        <lst name="suggestions">
                <lst name="hell">
                        <int name="numFound">1</int>
                        <int name="startOffset">0</int>
                        <int name="endOffset">4</int>
                        <arr name="suggestion">
                                <str>dell</str>
                        </arr>
                </lst>
                <lst name="ultrashar">
                        <int name="numFound">1</int>
                        <int name="startOffset">5</int>
                        <int name="endOffset">14</int>
                        <arr name="suggestion">
                                <str>ultrasharp</str>
                        </arr>
                </lst>
        </lst>
</lst>

Extended Results

The spellcheck.extendedResults=true parameter provides frequency of each original term in the index (origFreq) as well as the frequency of each suggestion in the index (frequency)

http://localhost:8983/solr/spellCheckCompRH?q=*:*&spellcheck.q=hell+ultrashar&spellcheck=true&spellcheck.extendedResults=true
<lst name="spellcheck">
        <lst name="suggestions">
                <lst name="hell">
                        <int name="numFound">1</int>
                        <int name="startOffset">0</int>
                        <int name="endOffset">4</int>
                        <int name="origFreq">0</int>
                        <lst name="suggestion">
                                <int name="frequency">1</int>
                                <str name="word">dell</str>
                        </lst>
                </lst>
                <lst name="ultrashar">
                        <int name="numFound">1</int>
                        <int name="startOffset">5</int>
                        <int name="endOffset">14</int>
                        <int name="origFreq">0</int>
                        <lst name="suggestion">
                                <int name="frequency">1</int>
                                <str name="word">ultrasharp</str>
                        </lst>
                </lst>
                <bool name="correctlySpelled">false</bool>
        </lst>
</lst>

Collate Results

Adding the spellcheck.collate=true parameter returns a query with the misspelled terms replaced by the top suggestions. Note that the non-spellcheckable terms such as those for range queries, prefix queries etc. are detected and excluded for spellchecking. Such non-spellcheckable terms are preserved in the collated output so that the original query can be run again, as is.

http://localhost:8983/solr/spellCheckCompRH?q=price:[80 TO 100] hell ultrashar&spellcheck=true&spellcheck.extendedResults=true&spellcheck.collate=true
<lst name="spellcheck">
        <lst name="suggestions">
                <lst name="hell">
                        <int name="numFound">1</int>
                        <int name="startOffset">18</int>
                        <int name="endOffset">22</int>
                        <int name="origFreq">0</int>
                        <lst name="suggestion">
                                <int name="frequency">1</int>
                                <str name="word">dell</str>
                        </lst>
                </lst>
                <lst name="ultrashar">
                        <int name="numFound">1</int>
                        <int name="startOffset">23</int>
                        <int name="endOffset">32</int>
                        <int name="origFreq">0</int>
                        <lst name="suggestion">
                                <int name="frequency">1</int>
                                <str name="word">ultrasharp</str>
                        </lst>
                </lst>
                <bool name="correctlySpelled">false</bool>
                <str name="collation">price:[80 TO 100] dell ultrasharp</str>
        </lst>
</lst>

Implementing a SolrSpellChecker

The SolrSpellChecker class provides an abstract base class for defining common spelling constructs for use in the SpellCheckComponent. Implementing classes need to define the following methods:

  1. reload - How to reload the dictionary/spell checker. This method is called when the application knows there are changes to the dictionary and that they should be loaded.

  2. build - Create the appropriate spelling resources. Also called when the resources needs to be rebuilt. Not all implementations may need to implement this. For instance, an implementation may always use the same underlying resources and they are immutable. The Lucene IndexBasedSpellChecker, on the other hand, actually creates the appropriate underlying dictionary from the specified index.

  3. getSuggestions(Collection<Token> tokens, IndexReader reader, int count, boolean onlyMorePopular, boolean extendedResults) - The main method called for returning suggestions. See the javadocs for more explanation.

Implementing a QueryConverter

The QueryConverter is an abstract base class defining a method for converting input "raw" queries into a set of tokens for spell checking. It is used to "parse" the CommonParams.Q (the input query) and convert it to tokens. It is only invoked for the CommonParams.Q parameter, and not the "spellcheck.q" parameter. Systems that use their own query parser or those that find issues with the basic implementation will want to implement their own QueryConverter. Instead of using the provided implementation (SpellingQueryConverter), they should override the appropriate methods on the SpellingQueryConverter in their custom QueryConverter and register it in the solrconfig.xml via:

<queryConverter name="queryConverter" class="org.apache.solr.spelling.SpellingQueryConverter"/>

The existing converter uses a relatively simple Regex to extract out the basic query terms from a query and create tokens from them.

Building on Commits

SpellCheckComponent can be configured to automatically (re)build indices based on fields in Solr index when a commit is done. In order to do so you must enable this feature by adding the following line in your SpellCheckComponent configuration for each spellchecker where you wish it to apply

<str name="buildOnCommit">true</str>

For example:

    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spell</str>
      <str name="spellcheckIndexDir">./spellchecker1</str>
      <str name="buildOnCommit">true</str>
    </lst>

Building on Optimize

<!> Solr1.4

SpellCheckComponent can be configured to automatically (re)build indices based on fields in Solr index when an optimize command is done. In order to do so you must enable this feature by adding the following line in your SpellCheckComponent configuration

<str name="buildOnOptimize">true</str>

last edited 2009-06-05 08:44:58 by MichaelLudwig