Solrindex as an alias for org.apache.nutch.indexer.solr.SolrIndexer

This class replaces the legacy dependency for Nutch <1.3 to index to Apache Lucene for subsequent search. We now pass a SolrURL (amongst other arguements) to post data crawled by Nutch for search within an Apache Solr core.

Note: This class currently does commits once for all the reducers in one go. This is subject to change in subseqent versions of Nutch as a commit can take a lot of resources (cache warming) and it's not always necessary to commit after solrindex, solrdedup or solrclean, especially if they are run immediately after the other.

Nutch 1.x


bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]

<solr url>: This is the HTTP solr instance you wish to index data with. e.g. http://localhost:8983/solr

<crawldb>: This argument should be the path to the crawldb directory.

-linkdb <linkdb>: The path to the linkdb directory is optional. If the -linkdb <linkdb> is ommitted the overall solrindex command will still execute successfully.

<segment> ...: Should be the path to a directory containing segment(s).

-dir <segments>: A comprehensive list of paths to several segment directories.

[-noCommit]: Do not send a commit after indexing the segment(s).

[-deleteGone]: Delete the gone pages and permanent redirects of the input segment(s).

[-filter]: Enable URL filtering.

[-normalize]: Enable URL normalizing.

Nutch 2.x

Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]


