Differences between revisions 4 and 5
Revision 4 as of 2009-09-20 23:09:32
Size: 357
Editor: localhost
Comment: converted to 1.6 markup
Revision 5 as of 2014-05-16 11:02:53
Size: 377
Editor: JulienNioche
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:


dedup is an alias for org.apache.nutch.indexer.DeleteDuplicates

Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.

Usage: bin/nutch org.apache.nutch.indexer.DeleteDuplicates (-local | -ndfs <namenode:port>) [-workingdir <workingdir>] <segmentsDir>


bin/nutch_dedup (last edited 2014-05-16 11:02:53 by JulienNioche)