"inject" is an alias for "org.apache.nutch.crawl.Injector"
Injects new URLs into the Crawl Database
Usage
nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.Injector <crawldb> <urldir>
<crawldb>: Path to the Crawl Database directory.
<urldir>: Path to the directory containing flat text url files.
Configuration Files
hadoop-default.xml
hadoop-site.xml
nutch-default.xml
nutch-site.xml
Configuration Values
The following properties directly affect how the Injector injects URLs.
db.default.fetch.interval -- Sets the time in days between fetches. Default: 30.0f.
db.score.injected -- Sets the default score of the URL. Default: 1.0f.
urlnormalizer.class -- Name of the class that normalizes injected urls. Default: org.apache.nutch.net.BasicUrlNormalizer.
Other Files
- None.
Caveats and Notes
<urldir> may contain one or more flat text url files. These files should contain one url per line to inject into the Crawl Database.
Example:
nutch-0.8-dev/bin/nutch inject /path/to/crawldb /path/to/url/dir Files: /path/to/url/dir/nutch.txt /path/to/url/dir/hadoop.txt /path/to/url/dir/wikis.txt nutch.txt contents: http://lucene.apache.org/nutch/ http://lucene.apache.org/nutch/tutorial.html hadoop.txt contents: http://lucene.apache.org/hadoop/ http://lucene.apache.org/hadoop/docs/api/ wikis.txt contents: http://wiki.apache.org/hadoop/ http://wiki.apache.org/nutch/ http://wiki.apache.org/lucene/
In this case seven urls would be injected into the Crawl Database located at /path/to/crawldb by the Injector.