Solrindex as an alias for org.apache.nutch.indexer.solr.SolrIndexer
This class replaces the legacy dependency for Nutch <1.3 to index to Apache Lucene for subsequent search. We now pass a SolrURL (amongst other arguements) to post data crawled by Nutch for search within an Apache Solr core.
Note: This class currently does commits once for all the reducers in one go. This is subject to change in subseqent versions of Nutch as a commit can take a lot of resources (cache warming) and it's not always necessary to commit after solrindex, solrdedup or solrclean, especially if they are run immediately after the other.
bin/nutch solrindex <solr url> <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]
<solr url>: This is the HTTP solr instance you wish to index data with. e.g. http://localhost:8983/solr
<crawldb>: This argument should be the path to the crawldb directory.
-linkdb <linkdb>: The path to the linkdb directory is optional. If the -linkdb <linkdb> is ommitted the overall solrindex command will still execute successfully.
<segment> ...: Should be the path to a directory containing segment(s).
-dir <segments>: A comprehensive list of paths to several segment directories.
[-noCommit]: Do not send a commit after indexing the segment(s).
[-deleteGone]: Delete the gone pages and permanent redirects of the input segment(s).
[-filter]: Enable URL filtering.
[-normalize]: Enable URL normalizing.
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]