net.nutch.db.WebDBInjector
bin/nutch inject <db> (-urlfile <url_file> | -dmozfile <dmoz_file>) \[-subset <subsetDenominator>\] \[-includeAdultMaterial\] \[-skew skew\] \[-noDmozDesc\] |
Injects urls from a text file. Use a file with one url per line.
Injects the urls from a dmoz content file. You can download the current content file from dmoz.org.
Use this option if you want to inject only one of <subsetDenominator> urls. Injecting and fetching all urls from the open directory means to fetch over 4 million urls. Maybe for testing you would start with fewer urls. For example inject one out of every 4000 urls with -subset 4000, which whould be around 1000 urls injected. A random subset is selected: repeated calls with the same value will inject different urls.
By default urls from the adult part of the open directory will not be included.
The seed for the randomization used by subsetDenominator. For debugging.
If specified, the Open Directory description is not used as a link to the page.
The score of new pages added by the injector. 2.0 by default.
The number of days after each page injected is fetched that it should next be fetched. 30 by default.
– MatthiasJaekle - 13 Mar 2004