bin/nutch inject

called java class

net.nutch.db.WebDBInjector

command line options

bin/nutch inject <db> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc]

-urlfile <url_file>

Injects urls from a text file. Use a file with one url per line.

-dmozfile <dmoz_file>

Injects the urls from a dmoz content file. You can download the current content file from dmoz.org.

-subset <subsetDenominator>

Use this option if you want to inject only one of <subsetDenominator> urls. Injecting and fetching all urls from the open directory means to fetch over 4 million urls. Maybe for testing you would start with fewer urls. For example inject one out of every 4000 urls with -subset 4000, which whould be around 1000 urls injected. A random subset is selected: repeated calls with the same value will inject different urls.

-includeAdultMaterial

By default urls from the adult part of the open directory will not be included.

-skew skew

The seed for the randomization used by subsetDenominator. For debugging.

-noDmozDesc

If specified, the Open Directory description is not used as a link to the page.

config file options

db.score.injected

The score of new pages added by the injector. 2.0 by default.

db.default.fetch.interval

The number of days after each page injected is fetched that it should next be fetched. 30 by default.

MatthiasJaekle - 13 Mar 2004

  • No labels