UpdateCSV

Updating a Solr Index with CSV

Solr accepts index updates in [WWW] CSV (Comma Separated Values) format. Different separators and escape mechanisms are configurable, and multi-valued fields are supported.

<!> Solr1.2

  1. Updating a Solr Index with CSV
    1. Requirements
    2. Methods of uploading CSV records
      1. Example
    3. Parameters
      1. separator
      2. header
      3. fieldnames
      4. skip
      5. skipLines
      6. trim
      7. encapsulator
      8. escape
      9. keepEmpty
      10. map
      11. split
      12. overwrite
      13. commit
    4. Disadvantages
    5. Tab-delimited importing

Requirements

<!> Solr1.2 is the first version with CSV support for updates.

The CSV request handler needs to be configured in solrconfig.xml This should already be present in the example solrconfig.xml

  <!-- CSV update handler, loaded on demand -->
  <requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy">
  </requestHandler>

Methods of uploading CSV records

CSV records may be uploaded to Solr by sending the data to the /solr/update/csv URL. All of the normal methods for uploading content are supported.

Example

There is a sample CSV file at example/exampledocs/books.csv that may be used to add documents to the solr example server.

Example of using HTTP-POST to send the CSV data over the network to the Solr server:

cd example/exampledocs
curl http://localhost:8983/solr/update/csv --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8'

Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work. See the following line in solrconfig.xml, change it to enableRemoteStreaming="true", and restart Solr.

  <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />

The following request will cause Solr to directly read the input file:

curl http://localhost:8983/solr/update/csv?stream.file=exampledocs/books.csv
#NOTE: The full path, or a path relative to the CWD of the running solr server must be used.

Parameters

Some parameters may be specified on a per field basis via f.<fieldname>.param=value

separator

Specifies the character to act as the field separator. Default is separator=,

header

true if the first line of the CSV input contains field or column names. The default is header=true. If the fieldnames parameter is absent, these field names will be used when adding documents to the index.

fieldnames

Specifies a comma separated list of field names to use when adding documents to the Solr index. If the CSV input already has a header, the names specified by this parameter will override them.

Example: fieldnames=id,name,category

skip

A comma separated list of field names to skip in the input. An alternate way to skip a field is to specify it's name as a zero length string in fieldnames

Example:

fieldnames=id,name,category&skip=name

skips the name field, and is equivalent to

fieldnames=id,,category

skipLines

Specifies the number of lines in the input stream to discard before the CSV data starts (including the header, if present). Default is skipLines=0.

trim

If true remove leading and trailing whitespace from values. CSV parsing already ignores leading whitespace by default, but there may be trailing whitespace, or there may be leading whitespace that is encapsulated by quotes and is thus not removed. This may be specified globally, or on a per-field basis. The default is trim=false

encapsulator

The character optionally used to surround values to preserve characters such as the CSV separator or whitespace. This standard CSV format handles the encapsulator itself appearing in an encapsulated vaue by doubling the encapsulator.

CSV Example of quotes inside an encapsulated value:

100,"this is a ""quoted"" string inside an encapsulated value"

The default is encapsulator="

escape

<!> Solr1.3 The character used for escaping CSV separators or other reserved characters. If an escape is specified, the encapsulator is not used unless also explicitly specified since most formats use either encapsulation or escaping, not both.

keepEmpty

Keep and index empty (zero length) field values. This may be specified globally, or on a per-field basis. The default is keepEmpty=false.

map

Specifies a mapping between one value and another. The string on the LHS of the colon will be replaced with the string on the RHS. This parameter can be specified globally or on a per-field basis.

Example: replaces "Absolutely" with "true" in every field

map=Absolutely:true

Example: removes any values of "RemoveMe" in the field "foo"

f.foo.map=RemoveMe:&f.foo.keepEmpty=false

split

If true, the field value is split into multiple values by another CSV parser. The CSV parsing rules such as separator and encapsulator may be specified as field parameters.

Example: for the following input

id,tags
101,"movie,spiderman,action"

to index the 3 separate tags into a multi-valued Solr field called "tags", use

f.tags.split=true

Example: for the following input with a space separator and single quote encapsulator for the tags field

id,tags
101,movie 'spider man' action

to index the 3 separate tags into a multi-valued Solr field called "tags", use

f.tags.split=true&f.tags.separator=%20&f.tags.encapsulator='

The target Solr field of any split should be multiValued.

overwrite

If true (the default), overwrite documents based on the uniqueKey field declared in the solr schema.

commit

Commit changes after all records in this request have been indexed. The default is commit=false to avoid the potential performance impact of frequent commits.

Disadvantages

There is no way to provide document or field index-time boosts with the CSV format, however many indicies do not utilize that feature.

Tab-delimited importing

Don't let the "CSV" name fool you, this loader can load your tab-delimited files, and even handle backslash escaping rather than CSV encapsulation.

For example, one can dump MySQL table to a tab delimited file with

SELECT * INTO OUTFILE '/tmp/result.text' FROM mytable;

This file could then be imported into solr by setting the separator to tab (%09) and the escape to backslash (%5c)

curl 'http://localhost:8983/solr/update/csv?commit=true&separator=%09&escape=\&stream.file=/tmp/result.text'

<!> Solr1.3 is required to specify an escape.

last edited 2008-01-08 15:28:56 by YonikSeeley