Differences between revisions 4 and 5
Revision 4 as of 2006-01-09 22:48:31
Size: 2614
Editor: 70-58-85-223
Comment: fixed classpath to org.apache
Revision 5 as of 2009-09-20 23:10:12
Size: 2614
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 12: Line 12:
Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir -queries queries.txt[[BR]] Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir -queries queries.txt<<BR>>

prune is an alias for org.apache.nutch.tools.PruneIndexTool

This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped.

NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link org.apache.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax. If additional level of control is required, an instance of PruneChecker can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - PrintFieldsChecker prints the values of selected index fields, StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options.

Typical Useage: bin/nutch org.apache.nutch.tools.PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title[[BR}} This command will just print out fields of matching documents.

Typical Useage: bin/nutch org.apache.nutch.tools.PruneIndexTool index_dir -queries queries.txt
This command will actually remove all matching entries, according to the queries read from queries.txt file.

NOTE 2: This tool removes matching documents ONLY from segment indexes (or from a merged index). In particular it does NOT remove the pages and links from WebDB. This means that unwanted URLs may pop up again when new segments are created. To prevent this, use your own link net.nutch.net.URLFilter, or PruneDBTool (under construction...).

NOTE 3: This tool uses a low-level Lucene interface to collect all matching documents. For large indexes and broad queries this may result in high memory consumption. If you encounter OutOfMemory exceptions, try to narrow down your queries, or increase the heap size.


bin/nutch_prune (last edited 2009-09-20 23:10:12 by localhost)