Differences between revisions 3 and 4
Revision 3 as of 2013-03-21 12:36:18
Size: 2488
Comment:
Revision 4 as of 2013-04-05 09:55:10
Size: 2603
Comment:
Deletions are marked like this. Additions are marked like this.
Line 14: Line 14:
There is only one configuration parameter, `parserImpl`. The value of this parameter should be a fully qualified class name of a class that implements PreAnalyzedParser interface. The default value of this parameter is `org.apache.solr.schema.JsonPreAnalyzedParser`. There is only one configuration parameter, `parserImpl`. The value of this parameter should be a fully qualified class name of a class that implements PreAnalyzedParser interface (since Solr 4.3 you can also use `json` or `simple` as shortcuts for the two included implementations). The default value of this parameter is `org.apache.solr.schema.JsonPreAnalyzedParser` (or `json`).

Using PreAnalyzedField type for integration with external document processing pipelines.

This field type is available since Solr 4.0. See also SOLR-1535, SOLR-4619.

PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g. tokenized, annotated, stemmed, inserted synonyms, etc), while using all the rich attributes that Lucene's TokenStream provides (per-token attributes).

Pluggable serialization

The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two out of the box implementations:

  • JsonPreAnalyzedParser - as the name suggests, it parses content that uses JSON to represent field's content. This is the default parser to use if the field type is not configured otherwise.

  • SimplePreAnalyzedParser - uses a simple strict plain text format, which in some situations may be easier to create than JSON.

Configuration options

There is only one configuration parameter, parserImpl. The value of this parameter should be a fully qualified class name of a class that implements PreAnalyzedParser interface (since Solr 4.3 you can also use json or simple as shortcuts for the two included implementations). The default value of this parameter is org.apache.solr.schema.JsonPreAnalyzedParser (or json).

Here's an example of how to define the type and a field that uses this type in schema.xml:

<types>
  ...
  <fieldType name="preanalyzed" class="solr.PreAnalyzedField" parserImpl="org.apache.solr.schema.JsonPreAnalyzedParser"/>
  ...
</types>
<fields>
  ...
  <field name="pre" type="preanalyzed" indexed="true" stored="true"/>
  ...
</fields>

And here's an example XML that adds documents with fields of this type:

<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="id">1</field>
<field name="pre">{"v":"1","str":"document one","tokens":[{"t":"one"},{"t":"two"},{"t":"three","i":100}]}</field>
</doc>
<doc>
<field name="id">2</field>
<field name="pre">{"v":"1","str":"document two","tokens":[{"t":"four"},{"t":"five"},{"t":"six","i":100}]}</field>
</doc>
<doc>
<field name="id">3</field>
<field name="pre">{"v":"1","str":"document three","tokens":[{"t":"seven"},{"t":"eight"},{"t":"nine","i":100}]}</field>
</doc>
</add>

PreAnalyzedField (last edited 2013-04-05 09:55:10 by AndrzejBialecki)