Using PreAnalyzedField type for integration with external document processing pipelines.

This field type is available since Solr 4.0. See also SOLR-1535, SOLR-4619.

PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g. tokenized, annotated, stemmed, inserted synonyms, etc), while using all the rich attributes that Lucene's TokenStream provides (per-token attributes).

Pluggable serialization

The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two out of the box implementations:

Configuration options

There is only one configuration parameter, parserImpl. The value of this parameter should be a fully qualified class name of a class that implements PreAnalyzedParser interface (since Solr 4.3 you can also use json or simple as shortcuts for the two included implementations). The default value of this parameter is org.apache.solr.schema.JsonPreAnalyzedParser (or json).

Here's an example of how to define the type and a field that uses this type in schema.xml:

<types>
  ...
  <fieldType name="preanalyzed" class="solr.PreAnalyzedField" parserImpl="org.apache.solr.schema.JsonPreAnalyzedParser"/>
  ...
</types>
<fields>
  ...
  <field name="pre" type="preanalyzed" indexed="true" stored="true"/>
  ...
</fields>

And here's an example XML that adds documents with fields of this type:

<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="id">1</field>
<field name="pre">{"v":"1","str":"document one","tokens":[{"t":"one"},{"t":"two"},{"t":"three","i":100}]}</field>
</doc>
<doc>
<field name="id">2</field>
<field name="pre">{"v":"1","str":"document two","tokens":[{"t":"four"},{"t":"five"},{"t":"six","i":100}]}</field>
</doc>
<doc>
<field name="id">3</field>
<field name="pre">{"v":"1","str":"document three","tokens":[{"t":"seven"},{"t":"eight"},{"t":"nine","i":100}]}</field>
</doc>
</add>

PreAnalyzedField (last edited 2013-04-05 09:55:10 by AndrzejBialecki)