Solr's Language Detection

Introduction

This feature adds the ability to detect the language of a document before indexing and then make appropriate decisions about analysis, etc. It is implemented as an UpdateRequestProcessor, and there are two implementations:

Tika implementation based upon Tika's language detection capabilities, which covers many, but not all, languages. See http://tika.apache.org/0.10/detection.html for more information on the languages supported.
LangDetect implementation based upon http://code.google.com/p/language-detection/ which supports more languages (53) and has some advanced CJK support.

The component also supports automatic renaming of fields according to detected language and other advanced parameters, all explained in the next section.

Configuration

The UpdateRequestProcessor is configured in solrconfig.xml, and supports many parameters. All parameters listed may also be overridded on the update request itself. A minimal configuration specifies the input fields for language identification as well as the output field for the detected language code:

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
   </lst>
</processor>

Alternatively, using the implementation based on http://code.google.com/p/language-detection/

<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
   </lst>
</processor>

NOTE: The processor supports the defaults/appends/invariants concept for its config. However, it is also possible to skip this level and configure the parameters directly underneath the <processor> tag.

Below follows a list of each configuration parameters and their meaning:

langid

Lets you enable/disable this processor

Value: true/false

Default: true

langid.fl

Specifies the list of field names to take as input for the language detection

Value: Same format as fl, i.e. a comma or space delimited list of field names

Default: N/A (This parameter is mandatory)

langid.langField

Specifies the field to output detected language into. The value written is the language code as emitted by Tika or LangDetect.

Value: Name of field

Default: N/A (This parameter is mandatory)

langid.langsField

Specifies the field to output a list of detected languages into. This must be a multiValued String field. If you use langid.map.individual, each detected language will be added to this field.

Value: Name of field

Default: (Empty - Nothing is written by default)

langid.overwrite

Specifies whether the output in langField and langsField shall be overwritten if langField already contains a value. If not set and langField contains a value, langField will be subject to white list filtering and then copied to langsField, which will be overwritten.

Value: true/false

Default: false

langid.threshold

Specifies a threshold between 0-1 for how close the language identification match must be before being accepted. For long texts a high value like 0.8 will give the best results, but for shorter texts you may need to specify lower thresholds, and at the same time risking a lower quality detection. Experiment on your data to find a good value.

Value: A float value between 0.0 and 1.0

Default: 0.5

langid.whitelist

Specifies an optional list of language codes that shall be the only allowed outputs from language identification. This means that if another language is detected, it will not be accepted and you'll fall back to fallback language. This is great in combination with langid.map=true to make sure you only index documents into fields that exist in your schema.

Value: A comma separated list of language codes accepted. Note that these are codes as output from your detector before mapping with langid.map.lcmap

Default: (Empty - all languages are allowed)

langid.map

To enable field name mapping, set langid.map=true. It will then map field names for all fields in langid.fl.

If the set of fields to map is different from langid.fl, supply langid.map.fl. Those fields will then be renamed with a language suffix equal to the language detected from the langid.fl fields.

Value: true/false

Default: false

langid.map.fl

Optional list of fields to do field name mapping for. See langid.map

Value: A comma separated list of fields

Default: (Empty - by default all fields in langid.fl will be mapped)

langid.map.keepOrig

If set to true, the mapping operation will leave the original field in place, i.e. it will act as a field copy instead of a move/map.

Value: true/false

Default: false

langid.map.individual

If you require detecting languages separately for each field, supply langid.map.individual=true. The supplied fields will then be renamed according to detected language on an individual field basis.

Value: true/false

Default: false

langid.map.individual.fl

If the set of fields to detect individually is different from the already supplied langid.fl or langid.map.fl, supply langid.map.individual.fl. The fields listed in langid.map.individual.fl will then be detected individually, while the rest of the mapping fields will be mapped according to global document language.

Value: A comma separated list of fields

Default: (Empty - by default all fields in langid.fl or langid.map.fl will be mapped)

langid.fallbackFields

If no language is detected with sufficient score (see langid.threshold), or if the detected language is not in the whitelist (see langid.whitelist), we will lookup the field(s) from langid.fallbackFields one by one to see if we find a language code. If found it will be used as the fallback language. If not, we will continue to look for langid.fallback

Value: Comma separated list of field names in which to look for language code. May be only one as well.