TikaEntityProcessor
Simple configuration
<dataConfig>
<document>
<dataSource type="BinURLDataSource" name="bin"/>
<entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" dataSource="bin" format="text">
<!--Do appropriate mapping here meta="true" means it is a metadata field -->
<field column="Author" meta="true" name="author"/>
<field column="title" meta="true" name="docTitle"/>
<!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
<field column="text"/>
</entity>
<document>
</dataConfig>
attributes
url : (required) The url to the source. This depends on the DataSource being used
- tikaConfig : (optional).The tika config file . If missing , default config is used. If the path is relative it is w.r.t the conf dir.
- format : (optional) output format. values are text|xml|html|none . default is 'text'. irrespective of the format, the body is emitted as a field called 'text'. Just that the content format would be different. Use 'none' if the body is not to be parsed i.e only metadata is emitted.
parser : (optional) Default is org.apache.tika.parser.AutoDetectParser . Povide a FQN of a class which implements org.apache.tika.parser.Parser
fields
Each field may have an optional attribute meta="true". Which means this field is to be obtained from the MetaData of the document. The column value is used as the key on metadata. Checkout the list of available keys from here DublinCore , MSOffice
DataSource
use any DataSource of type DataSource<InputStream>. The inbuilt ones are
- !BinURLDataSource : use for both http as well as for files
BinContentStreamDataSource : Use for uploading content
BinFileDataSource : use for reading from file system
Advanced Parsing
The TikaEntityProcessor can be nested with XPathEntityProcessor for indexing documents partly
example:
<dataConfig>
<document>
<dataSource type="BinURLDataSource" name="bin"/>
<dataSource type="FieldReaderDataSource" name="fld"/>
<entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" dataSource="bin" format="html" rootEntity="false">
<!--Do appropriate mapping here meta="true" means it is a metadata field -->
<field column="Author" meta="true" name="author"/>
<field column="title" meta="true" name="docTitle"/>
<!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
<field column="text"/>
<entity type="XPathEntityProcessor" forEach="/html" dataField="text">
<field xpath="//div" column="foo"/>
<field xpath="//h1" column="h1" />
</entity>
</entity>
<document>
</dataConfig>