a.k.a the "Solr Cell" project!

Introduction

<!> Solr1.4

A common need of users is the ability to ingest binary and/or structured documents such as Office, Word, PDF and other proprietary formats. The Apache Tika project provides a framework for wrapping many different file format parsers, such as PDFBox, POI and others.

Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.

Concepts

Before getting started, there are a few concepts that are helpful to understand.

Getting Started with the Solr Example

Now start the solr example server:

cd example
java -jar start.jar

In a separate window go to the docs/ directory (which contains some nice example docs), or the site directory if you built Solr from source, and send Solr a file via HTTP POST:

cd site/html
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html"

Now, you should be able to execute a query and find that document (open the following link in your browser): http://localhost:8983/solr/select?q=tutorial

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@tutorial.html"

And then query via http://localhost:8983/solr/select?q=attr_content:tutorial

Input Parameters

If extractOnly is true, additional input parameters:

Order of field operations

  1. fields are generated by Tika or passed in as literals via literal.fieldname=value. <!> Before Solr4.0 or if literalsOverride=false, then literals will be appended as multi-value to tika generated field.

  2. if lowernames==true, fields are mapped to lower case
  3. mapping rules fmap.source=target are applied

  4. if uprefix is specified, any unknown field names are prefixed with that value, else if defaultField is specified, unknown fields are copied to that.

Configuration

The ExtractingRequestHandler is not incorporated into the solr war file, it is provided as a SolrPlugins, and you have to load it (and it's dependencies) explicitly.

Example configuration for loading plugin and dependencies:

  <lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />
  <lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

Example configuration for the Handler:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
    <!--Optional.  Specify a path to a tika configuration file.  See the Tika docs for details.-->
    <str name="tika.config">/my/path/to/tika.config</str>
    <!-- Optional. Specify one or more date formats to parse.  See DateUtil.DEFAULT_DATE_FORMATS for default date formats -->
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>
  </requestHandler>

In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named last_modified. We are also telling it to ignore undeclared fields. These are all overridden parameters.

The tika.config entry points to a file containing a Tika configuration. You would only need this if you have customized your own Tika configuration. The Tika config contains info about parsers, mime types, etc.

You may also need to adjust the multipartUploadLimitInKB attribute as follows if you are submitting very large documents.

  <requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="{true|false}" multipartUploadLimitInKB="2048000" />
    ....

For remote streaming, you must enable remote stream. See ContentStream for more info or just set enableRemoteStreaming=true in the snippet above. As an example of using remote streaming, you can do:

 curl "http://localhost:8983/solr/update/extract?stream.file=/path/to/file/StatesLeftToVisit.doc&stream.contentType=application/msword&literal.id=states.doc"

Lastly, the date.formats allows you to specify various java.text.SimpleDateFormat date formats for working with transforming extracted input to a Date. Solr comes configured with the following date formats (see the DateUtil class in Solr)

yyyy-MM-dd'T'HH:mm:ss'Z'
yyyy-MM-dd'T'HH:mm:ss
yyyy-MM-dd
yyyy-MM-dd hh:mm:ss
yyyy-MM-dd HH:mm:ss
EEE MMM d hh:mm:ss z yyyy
EEE, dd MMM yyyy HH:mm:ss zzz
EEEE, dd-MMM-yy HH:mm:ss zzz
EEE MMM d HH:mm:ss yyyy

MultiCore config

Metadata

As has been implied up to now, Tika produces Metadata about the document. Metadata often contains things like the author of the file or the number of pages, etc. The Metadata produced depends on the type of document submitted. For instance, PDFs have different metadata from Word docs.

In addition to Tika's metadata, Solr adds the following metadata (defined in ExtractingMetadataConstants):

It is highly recommend that you try using the extract only option to see what values actually get set for these.

Encrypted files

<!> Solr4.0 By supplying a password in either resource.password on the request, or in a passwordsFile file, you can have ExtractingRequestHandler decrypt encrypted files and index their content. In the case of passwordsFile the file supplied must be on the format: One line per rule, each rule contains a file name regular expression followed by "=" followed by the password in clear-text (thus this file should have strict access restrictions).

# This is a comment
myFileName = myPassword
.*\.docx$ = myWordPassword
.*\.pdf$ = myPdfPassword

Examples

Mapping and Capture

Capture <div> tags separate, and then map that field to a dynamic field named foo_txt.

 curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_txt&capture=div"  -F "tutorial=@tutorial.pdf"

Mapping, Capture and Boost

Capture <div> tags separate, and then map that field to a dynamic field named foo_txt. Boost foo_txt by 3.

curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3" -F "tutorial=@tutorial.pdf"

Literals

To add in your own metadata, pass in the literal parameter along with the file:

curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.blah_s=Bah"  -F "tutorial=@tutorial.pdf"

XPath

Restrict down the XHTML returned by Tika by passing in an XPath expression

curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"  -F "tutorial=@tutorial.pdf"

Extract Only

curl "http://localhost:8983/solr/update/extract?&extractOnly=true"  --data-binary @tutorial.html  -H 'Content-type:text/html'

A the output includes XML generated by Tika and is thus further escaped by Solr's XML format. Using a different output format like json or ruby enhances the readability:

curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true"  --data-binary @tutorial.html  -H 'Content-type:text/html'

See TikaExtractOnlyExampleOutput.

Password protected

curl "http://localhost:8983/solr/collection1/update/extract?commit=true&literal.id=123&resource.password=mypassword" \
     -H "Content-Type: application/pdf" --data-binary @my-encrypted-file.pdf

Sending documents to Solr

The ExtractingRequestHandler can process any document sent as a ContentStream ...

Example...

curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"  --data-binary @tutorial.html  -H 'Content-type:text/html'

<!> NOTE, this literally streams the file as the body of the POST, which does not, then, provide info to Solr about the name of the file.

SimplePostTool (post.jar)

The simple post tool post.jar which ships with Solr in the example/exampledocs folder can post a file to ExtractingRequestHandler:

java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=doc5 -Dtype=text/html -jar post.jar tutorial.html

Since <!> Solr4.0 post.jar also has an auto mode which guesses content-type for you, and also sets a default ID and filename when sending to Solr. Also, a recursive option lets you automatically post a whole directory tree:

java -Dauto -jar post.jar tutorial.html
java -Dauto -Drecursive -jar post.jar .

<!> NOTE: The post.jar utility is not meant for production use, but as a convenience tool for experimenting with Solr. It is made as a single .java file (see SVN) without dependencies, so it does on purpose not use SolrJ.

SolrJ

Use the ContentStreamUpdateRequest (see ContentStreamUpdateRequestExample for a full example):

   1 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
   2 up.addFile(new File("mailing_lists.pdf"));
   3 up.setParam("literal.id", "mailing_lists.pdf");
   4 up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
   5 result = server.request(up);
   6 assertNotNull("Couldn't upload mailing_lists.pdf", result);
   7 rsp = server.query( new SolrQuery( "*:*") );
   8 Assert.assertEquals( 1, rsp.getResults().getNumFound() );

If you want to set a multiValued field, use the ModifiableSolrParams class like this:

   1 ModifiableSolrParams p = new ModifiableSolrParams();
   2 for(String value : values) {
   3     p.add(ExtractingParams.LITERALS_PREFIX + "field", value);
   4 }
   5 up.setParams(p);

You could also set all of the other literals and parameters in this class, and then use the setParams method to apply the changes to your content stream.

Extending the ExtractingRequestHandler

If you wish to supply your own ContentHandler for Solr to use, you can extend the ExtractingRequestHandler and override the createFactory() method. This factory is responsible for constructing the SolrContentHandler that interacts with Tika.

Committer Notes

Upgrading Tika

Additional Resources

What's in a Name

Grant was writing the javadocs for the code and needed an entry for the <title> tag and wrote out "Solr Content Extraction Library", since the contrib directory is named "extraction". This then lead to an "acronym": Solr CEL which then gets mashed to: Solr Cell. Hence, the project name is "Solr Cell". It's also appropriate because a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for converting the "raw" content of a document to something usable by Solr. http://en.wikipedia.org/wiki/Solar_cell

ExtractingRequestHandler (last edited 2013-05-08 19:10:53 by YonikSeeley)