a.k.a the "Solr Cell" project!

Introduction

<!> Solr1.4

A common need of users is the ability to ingest binary and/or structured documents such as Office, Word, PDF and other proprietary formats. The Apache Tika project provides a framework for wrapping many different file format parsers, such as PDFBox, POI and others.

Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

Concepts

Before getting started, there are a few concepts that are helpful to understand.

Getting Started with the Solr Example

Now start the solr example server:

cd example
java -jar start.jar

In a separate window go to the docs/ directory (which contains some nice example docs), or the site directory if you built Solr from source, and send Solr a file via HTTP POST:

cd docs
curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfile=@tutorial.html"

Now, you should be able to execute a query and find that document (open the following link in your browser): http://localhost:8983/solr/select?q=tutorial

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /udate/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true' -F "myfile=@tutorial.html"

And then query via http://localhost:8983/solr/select?q=attr_content:tutorial

Input Parameters

If extractOnly is true, additional input parameters:

Order of field operations

  1. fields are generated by Tika or passed in as literals via literal.fieldname=value

  2. if lowernames==true, fields are mapped to lower case
  3. mapping rules fmap.source=target are applied

  4. if uprefix is specified, any unknown field names are prefixed with that value, else if defaultField is specified, unknown fields are copied to that.

Configuration

If you are not working from the supplied example/solr directory you must copy all libraries from example/solr/libs into a libs directory within your own solr directory. The ExtractingRequestHandler is not incorporated into the solr war file, you have to install it separately.

Example config:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
    <!--Optional.  Specify a path to a tika configuration file.  See the Tika docs for details.-->
    <str name="tika.config">/my/path/to/tika.config</str>
    <!-- Optional. Specify one or more date formats to parse.  See DateUtil.DEFAULT_DATE_FORMATS for default date formats -->
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>
  </requestHandler>

In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named last_modified. We are also telling it to ignore undeclared fields. These are all overridden parameters.

The tika.config entry points to a file containing a Tika configuration. You would only need this if you have customized your own Tika configuration. The Tika config contains info about parsers, mime types, etc.

You may also need to adjust the multipartUploadLimitInKB attribute as follows if you are submitting very large documents. The enableRemoteStreaming is not used by the ExtractingRequestHandler.

  <requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="20480" />
    ....

Lastly, the date.formats allows you to specify various java.text.SimpleDateFormat date formats for working with transforming extracted input to a Date. Solr comes configured with the following date formats (see the DateUtil class in Solr)

yyyy-MM-dd'T'HH:mm:ss'Z'
yyyy-MM-dd'T'HH:mm:ss
yyyy-MM-dd
yyyy-MM-dd hh:mm:ss
yyyy-MM-dd HH:mm:ss
EEE MMM d hh:mm:ss z yyyy
EEE, dd MMM yyyy HH:mm:ss zzz
EEEE, dd-MMM-yy HH:mm:ss zzz
EEE MMM d HH:mm:ss yyyy

MultiCore config

Metadata

As has been implied up to now, Tika produces Metadata about the document. Metadata often contains things like the author of the file or the number of pages, etc. The Metadata produced depends on the type of document submitted. For instance, PDFs have different metadata from Word docs.

In addition to Tika's metadata, Solr adds the following metadata (defined in ExtractingMetadataConstants):

It is highly recommend that you try using the extract only option to see what values actually get set for these.

Examples

Mapping and Capture

Capture <div> tags separate, and then map that field to a dynamic field named foo_t.

 curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div"  -F "tutorial=@tutorial.pdf"

Mapping, Capture and Boost

Capture <div> tags separate, and then map that field to a dynamic field named foo_t. Boost foo_t by 3.

curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3" -F "tutorial=@tutorial.pdf"

Literals

To add in your own metadata, pass in the literal parameter along with the file:

curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah"  -F "tutorial=@tutorial.pdf"

XPath

Restrict down the XHTML returned by Tika by passing in an XPath expression

curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"  -F "tutorial=@tutorial.pdf"

Extract Only

curl "http://localhost:8983/solr/update/extract?&extractOnly=true"  --data-binary @tutorial.html  -H 'Content-type:text/html'

A the output includes XML generated by Tika (and is hence further escaped by Solr's XML) using a different output format enhance the readability:

curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true"  --data-binary @tutorial.html  -H 'Content-type:text/html'

See TikaExtractOnlyExampleOutput.

Sending documents to Solr

// TODO: describe the different ways to send the documents to solr (POST body, form encoded, remoteStreaming)

Additional Resources

* Lucid Imagination article * Supported document formats via Tika

What's in a Name

Grant was writing the javadocs for the code and needed an entry for the <title> tag and wrote out "Solr Content Extraction Library", since the contrib directory is named "extraction". This then lead to an "acronym": Solr CEL which then gets mashed to: Solr Cell. Hence, the project name is "Solr Cell". It's also appropriate because a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for converting the "raw" content of a document to something usable by Solr. http://en.wikipedia.org/wiki/Solar_cell

ExtractingRequestHandler (last edited 2009-10-27 23:37:05 by PeterWolanin)