Solrdedup is an alias for org.apache.nutch.indexer.solr.SolrDeleteDuplicates

THIS HAS BEEN DEPRECATED IN NUTCH 1.x see dedup command.

As the name suggests this is a utility class for deleting duplicate documents from within a solr index.

The algorithm goes like follows:

Preparation: Query the solr server for the number of documents (say, N), Partition N among M map tasks. For example, if we have two map tasks the first map task will deal with solr documents from 0 - (N / 2 - 1) and the second will deal with documents from (N / 2) to (N - 1). This can be thought of as a linearly executing divide and conquer algorithm.


Note that unlike {@link DeleteDuplicates} we assume that two documents in a solr index will never have the same URL. So this class only deals with documents with different URLs but the same digest.


bin/nutch solrdedup <solr url>

<solr url>: Luckily all of the hard work is encapsulated within the class therefore the onyl parameter we pass is our SolrURL e.g. http://localhost:8983/solr/


