Differences between revisions 6 and 7
Revision 6 as of 2006-03-07 21:50:44
Size: 1760
Editor: JeffRitchie
Comment: Added example fixed some text
Revision 7 as of 2009-09-20 23:09:41
Size: 1760
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
  '''<crawldb>:''' Path to the Crawl Database directory.[[BR]]
  '''<urldir>:''' Path to the directory containing flat text url files.[[BR]]
  '''<crawldb>:''' Path to the Crawl Database directory.<<BR>>
  '''<urldir>:''' Path to the directory containing flat text url files.<<BR>>
Line 12: Line 12:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>
Line 18: Line 18:
The following properties directly affect how the Injector injects URLs.[[BR]]
 db.default.fetch.interval -- Sets the time in days between fetches. Default: 30.0f.[[BR]]
 db.score.injected -- Sets the default score of the URL. Default: 1.0f.[[BR]]
 urlnormalizer.class -- Name of the class that normalizes injected urls. Default: ["org.apache.nutch.net.BasicUrlNormalizer"].
The following properties directly affect how the Injector injects URLs.<<BR>>
 db.default.fetch.interval -- Sets the time in days between fetches. Default: 30.0f.<<BR>>
 db.score.injected -- Sets the default score of the URL. Default: 1.0f.<<BR>>
 urlnormalizer.class -- Name of the class that normalizes injected urls. Default: [[org.apache.nutch.net.BasicUrlNormalizer]].
Line 27: Line 27:
 <urldir> may contain one or more flat text url files. These files should contain one url per line to inject into the Crawl Database.[[BR]][[BR]]
Example: [[BR]]
 <urldir> may contain one or more flat text url files. These files should contain one url per line to inject into the Crawl Database.<<BR>><<BR>>
Example: <<BR>>

"inject" is an alias for "org.apache.nutch.crawl.Injector"

Injects new URLs into the Crawl Database

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.Injector <crawldb> <urldir>

    • <crawldb>: Path to the Crawl Database directory.
      <urldir>: Path to the directory containing flat text url files.

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Configuration Values

The following properties directly affect how the Injector injects URLs.

  • db.default.fetch.interval -- Sets the time in days between fetches. Default: 30.0f.
    db.score.injected -- Sets the default score of the URL. Default: 1.0f.
    urlnormalizer.class -- Name of the class that normalizes injected urls. Default: org.apache.nutch.net.BasicUrlNormalizer.

Other Files

  • None.

Caveats and Notes

  • <urldir> may contain one or more flat text url files. These files should contain one url per line to inject into the Crawl Database.

Example:

nutch-0.8-dev/bin/nutch inject /path/to/crawldb /path/to/url/dir

Files:
/path/to/url/dir/nutch.txt
/path/to/url/dir/hadoop.txt
/path/to/url/dir/wikis.txt

nutch.txt contents:
http://lucene.apache.org/nutch/
http://lucene.apache.org/nutch/tutorial.html

hadoop.txt contents:
http://lucene.apache.org/hadoop/
http://lucene.apache.org/hadoop/docs/api/

wikis.txt contents:
http://wiki.apache.org/hadoop/
http://wiki.apache.org/nutch/
http://wiki.apache.org/lucene/

In this case seven urls would be injected into the Crawl Database located at /path/to/crawldb by the Injector.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_inject (last edited 2009-09-20 23:09:41 by localhost)