Solrclean is an alias for org.apache.nutch.indexer.solr.SolrClean

The class scans a crawldb directory looking for entries with status DB_GONE (404) and sends delete requests to Solr for those documents. Once Solr receives the request the aforementioned documents are duly deleted. This maintains a healthier quality of Solr index.

Usage:

bin/nutch solrclean <crawldb> <solrurl>

<crawldb>: The path to a crawldb directory. This enables us to search for 404 URLs and update the solr index accordingly.

<solrurl>: The solr instance we wish to update and remove 404 pages from e.g. http://localhost:8983/solr/

CommandLineOptions

bin/nutch solrclean (last edited 2011-07-03 03:53:01 by LewisJohnMcgibbney)