hostinject is an alias for org.apache.nutch.crawl.HostInjectorJob
This class takes a flat file of hosts and adds them to the of seeds to be crawled. It is useful for bootstrapping the system. The hosts files contain one host per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='. N.B. Is the metadata functionality supported yet?.
Note that some metadata keys are reserved:
nutch.score: allows to set a custom score for a specific URL
nutch.fetchInterval: allows to set a custom fetch interval for a specific URL
e.g. http://www.xyz.org/ nutch.score=10 nutch.fetchInterval=2592000 userType=open_source
bin/nutch hostinject <host_dir>
<host_dir>: The directory containing our seed list (referred to above as 'flat file'), usually a text document containing hosts, one host per line.