<!> Solr3.5 To configure the Hunspell stemmer in Solr, you need to download the .dic and .aff files for your language(s) and then add the HunspellStemFilterFactory to your analysis, like this (british):

 <filter class="solr.HunspellStemFilterFactory"
    dictionary="en_GB.dic"
    affix="en_GB.aff"
    ignoreCase="true" />

The dictionary parameter optionally takes a comma-separated list of dictionaries, in which case all will be loaded, in the order specified. This lets you maintain your own custom additions without needing to edit the originals. We encourage your to contribute your changes/additions back to the maintainers of the original dictionaries.

The ignoreCase parameter allows case insensitive matching of the dictionaries, which can be useful to stem variations for proper names such as Apache/Apaches. Default value is false.

An example of how Hunspell may be more accurate than the Snowball stemmer, from Norwegian:

              bil (car)    biler (cars)   billig (cheap)   billige           billigere (cheaper)
Snowball      bil          bil            bil (car)        bil               billiger (N/A)
Hunspell      bil          bil            billig           billig            billig (cheap)
                           bile (drive)                    billige (pl)      billige (pl)

(warning) Note that Hunspell's suitability for stemming purposes will vary depending on the quality of the dictionaries and affix files. Always test the quality of various stemmers before deciding on which to choose for your language. Another potential disadvantage with a dictionary based stemmer is that it only works for words listed in the dictionary, so be prepared to invest some time in adding new or domain specific vocabulary to the dictionaries.

  • No labels