TikaEntityProcessor

<!> Solr3.1

Simple configuration

<dataConfig>
  <document>
   <dataSource type="BinURLDataSource" name="bin"/>
   <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" dataSource="bin" format="text">
      <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
      <field column="Author" meta="true" name="author"/>
      <field column="title" meta="true" name="docTitle"/>
      <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
      <field column="text"/>
    </entity>
    <document>
</dataConfig>

attributes

fields

Each field may have an optional attribute meta="true". Which means this field is to be obtained from the MetaData of the document. The column value is used as the key on metadata. Checkout the list of available keys from here DublinCore , MSOffice

DataSource

use any DataSource of type DataSource<InputStream>. The inbuilt ones are

Advanced Parsing

The TikaEntityProcessor can be nested with XPathEntityProcessor for indexing documents partly

example:

<dataConfig>
  <document>
   <dataSource type="BinURLDataSource" name="bin"/>   
   <dataSource type="FieldReaderDataSource" name="fld"/>   
   <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" dataSource="bin" format="html" rootEntity="false">
      <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
      <field column="Author" meta="true" name="author"/>
      <field column="title" meta="true" name="docTitle"/>
      <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
      <field column="text"/>
      <entity type="XPathEntityProcessor" forEach="/html" dataField="text">
         <field xpath="//div"  column="foo"/>
         <field xpath="//h1"  column="h1" />
      </entity>
    </entity>
    <document>
</dataConfig>

TikaEntityProcessor (last edited 2011-04-11 13:35:22 by KojiSekiguchi)