Differences between revisions 1 and 2
Revision 1 as of 2007-09-04 20:18:42
Size: 3686
Editor: 171
Comment:
Revision 2 as of 2007-09-04 20:41:22
Size: 4144
Editor: 171
Comment: tweaking the example...
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Solr has an extensible Solr has an extensible DocumentHandler architecture that allows you to feed it XML and CSV documents. There is now a patch file available as part of [https://issues.apache.org/jira/browse/SOLR-284 SOLR-284] that adds support for parsing rich binary formats.
Line 5: Line 5:
Solr accepts index updates in [http://en.wikipedia.org/wiki/Comma-separated_values CSV] (Comma Separated Values) format. Different separators are configurable, and multi-valued fields are supported. This page talks about how to get started using this patch. If you like it, please [https://issues.apache.org/jira/secure/ViewVoters!default.jspa?id=12372848 vote] for it on the JIRA issue tracker so we can get it added to the Solr codebase!
Line 29: Line 30:
5) Apply the rich.patch to your source. Rich.patch has tweaks that add the solr.RichDocumentRequestHandler to your solrconfig.xml. 5) Apply the rich.patch to your source. Rich.patch has tweaks that add the solr.RichDocumentRequestHandler to your solrconfig.xml files.
Line 42: Line 43:
These examples assume you have run {{{ant example}}} first and have it up and running using {{{java -jar start.jar}}}.
Line 44: Line 47:
Example of using HTTP-POST to send the CSV data over the network to the Solr server: Example of using HTTP-POST to send the PDF data over the network to the Solr server:
Line 46: Line 49:
cd src/test/test-files/simple.pdf
curl http://localhost:8983/solr/update/rich --data-binary @simple.pdf -H 'Content-type:text/plain; charset=utf-8'
cd src/test/test-files/
curl http://localhost:8983/solr/update/rich?stream.type=pdf --data-binary @simple.pdf -H 'Content-type:text/plain; charset=utf-8'
Line 58: Line 61:
curl http://localhost:8983/solr/update/rich?stream.file=src/test/test-files/simple.pdf curl http://localhost:8983/solr/update/rich?stream.type=pdf&stream.file=src/test/test-files/simple.pdf&id=100&stream.fieldname=name

Updating a Solr Index with Rich Documents such as PDF and MS Office

Solr has an extensible DocumentHandler architecture that allows you to feed it XML and CSV documents. There is now a patch file available as part of [https://issues.apache.org/jira/browse/SOLR-284 SOLR-284] that adds support for parsing rich binary formats.

This page talks about how to get started using this patch. If you like it, please [https://issues.apache.org/jira/secure/ViewVoters!default.jspa?id=12372848 vote] for it on the JIRA issue tracker so we can get it added to the Solr codebase!

TableOfContents

Requirements

Solr 1.2 is the first version with CSV support for updates. The CSV request handler needs to be configured in solrconfig.xml This should already be present in the example solrconfig.xml

  <!-- CSV update handler, loaded on demand -->
  <requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy">
  </requestHandler>

How to Install

1) You need a couple patch files and zips of source and testcode that are attached to the JIRA issue at https://issues.apache.org/jira/browse/SOLR-284.

2) Download the libs.zip, rich.patch, test-files.zip, source.zip, and test.zip files.

3) Unzip the libs.zip into SOLR_HOME/lib. These are the jar's required for parsing the rich documents, using PDFBox and POI.

4) Unzip the test-files.zip into SOLR_HOME/test/test-files/. These are various test files for running the included unit tests.

5) Apply the rich.patch to your source. Rich.patch has tweaks that add the solr.RichDocumentRequestHandler to your solrconfig.xml files.

6) Copy the contents of source.zip into SOLR_HOME/src/java/org/apache/solr/handler

7) Copy the contents of test.zip into SOLR_HOME/src/test/org/apache/solr/handler

8) Run ant test to verify everything is working!

Methods of uploading Binary records

Binary records may be uploaded to Solr by sending the data to the /solr/update/rich URL. All of the normal methods for [SolrContentStreams uploading content] are supported.

Example

These examples assume you have run ant example first and have it up and running using java -jar start.jar.

There is a sample PDF file at src/test/test-files/simple.pdf that may be used to add a PDF to the solr example server.

Example of using HTTP-POST to send the PDF data over the network to the Solr server:

cd src/test/test-files/
curl http://localhost:8983/solr/update/rich?stream.type=pdf --data-binary @simple.pdf -H 'Content-type:text/plain; charset=utf-8'

Uploading a binary file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work. See the following line in solrconfig.xml, change it to enableRemoteStreaming="true", and restart Solr.

  <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />

The following request will cause Solr to directly read the input file:

curl http://localhost:8983/solr/update/rich?stream.type=pdf&stream.file=src/test/test-files/simple.pdf&id=100&stream.fieldname=name
#NOTE: The full path, or a path relative to the CWD of the running solr server must be used.

Parameters

Some parameters may be specified on a per field basis via f.<fieldname>.param=value

fieldnames

Specifies a comma separated list of field names to use when adding documents to the Solr index. If the CSV input already has a header, the names specified by this parameter will override them.

Example: fieldnames=id,name,category

overwrite

If true (the default), overwrite documents based on the uniqueKey field declared in the solr schema.

commit

Commit changes after all records in this request have been indexed. The default is commit=false to avoid the potential performance impact of frequent commits.

Disadvantages

There is no way to provide document or field index-time boosts with the CSV format, however many indicies do not utilize that feature.

UpdateRichDocuments (last edited 2009-09-20 22:04:42 by localhost)