This is the official Nutch FAQ.
Questions that are not answered in the FAQ or in the documentation should be posted to the appropriate mailing list.
Please stick to technical issues on the discussion forum and mailing lists. Keep in mind that these are public, so do not include any confidential information in your questions!
You should also read the Mailing Lists Developer Resource (http://www.apache.org/dev/#mail) before participating in the discussion forum and mailing lists.
NOTE:
Please do NOT submit bugs, patches, or feature requests to the mailing lists. Refer instead to Commiter's_Rules and HowToContribute areas of the Nutch wiki.
There's a user, developer, commits and agents lists, all available at http://nutch.apache.org/mailing_lists.html.
Please visit our "webmaster info page"
We don't think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they're looking for. In short, a fast search engine is a better search engine. We don't think many people would want to use a search engine that takes ten or more seconds to return results.
That said, if someone wishes to start a sub-project of Nutch exploring distributed searching, we'd love to host it. We don't think these techniques are likely to solve the hard problems Nutch needs to solve, but we'd be happy to be proven wrong.
Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a large search engine is not crawling, but searching.
Search engines work hard to construct ranking algorithms that are immune to manipulation. Search engine optimizers still manage to reverse-engineer the ranking algorithms used by search engines, and improve the ranking of their pages. For example, many sites use link farms to manipulate search engines' link-based ranking algorithms, and search engines retaliate by improving their link-based algorithms to neutralize the effect of link farms.
With an open-source search engine, this will still happen, just out in the open. This is analagous to encryption and virus protection software. In the long term, making such algorithms open source makes them stronger, as more people can examine the source code to find flaws and suggest improvements. Thus we believe that an open source search engine has the potential to better resist manipulation of its rankings.
Nutch 1.0 requires Java 6 and up.
nutch-default.xml is the out of the box configuration for Nutch, and most configurations can (and should unless you know what your doing) stay as per. nutch-site.xml is where you make the changes that override the default settings.
Install ANT and call 'ant' on the command line from the directory containing the Nutch source code. Note : this won't work for the binary release for obvious reasons.
Nutch uses ANT+IVY to compile the code and manage the dependencies (see above). There are instructions on how to get Nutch working with Eclipse on \[http://wiki.apache.org/nutch/RunNutchInEclipse\] but the easiest way of doing is to use ANT for compiling and rely on Eclipse just for visualising the code. You can also debug with Eclipse using the remote debugging and setting e.g. "export NUTCH_OPTS=-Xdebug -agentlib:jdwp=transport=dt_socket,server=y,address=8000" prior to calling the nutch script in /runtime/local/bin. |
Urls which are already in the database, won't be injected.
In short yes, however this is disabled by default (justification follows shortly). To enable this simply configure the following in nutch-site.xml before initiating the fecth process.
<property> <name>fetcher.parse</name> <value>true</value> <description>If true, fetcher will parse content. Default is false, which means that a separate parsing step is required after fetching is finished.</description> </property> |
N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to this is usually observed in this situation.
In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.
Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt file might work, but adding a list with thousands of regular expressions would slow down your system excessively.
Alternatively, you can set db.ignore.external.links to "true", and inject seeds from the domains you wish to crawl (these seeds must link to all pages you wish to crawl, directly or indirectly). Doing this will let the crawl go through only these domains without leaving to start crawling external links. Unfortunately there is no way to record external links encountered for future processing, although a very small patch to the generator code can allow you to log these links to hadoop.log.
Well, you can not. However, you have two choices to proceed:
% touch /index/segments/2005somesegment/fetcher.done % bin/nutch updatedb /crawl/db/ /crawl/segments/2005somesegment/ % bin/nutch generate /crawl/db/ /crawl/segments/2005somesegment/ % bin/nutch fetch /crawl/segments/2005somesegment |
This is dependent on your particular set-up; unless one understands system/network environment variables it is impossible to accurately measure thread performance. The Nutch de-facto is an excellent start point.
The reason for that is that when a page is fetched, it is timestamped in the webdb. So basiclly if its time is not up it will not be included in a fetchlist. So for example if you generated a fetchlist and you deleted the segment dir created. calling generate again will generate an empty fetchlist. So, two choices:
1) Change your system date to be 30 days from today (if you haven't changed the default settings) and re-run bin/nutch generate
2) Call bin/nutch generate with the -adddays 30 (if you haven't changed the default settings) to make generate think the time has come... After generate you can call bin/nutch fetch.
See the HttpAuthenticationSchemes wiki page.
A possible reason is that by default the 'partition.url.mode' is set to 'byHost', which is a reasonable setting, because in the url-subsets for the fetcher threads in different map steps, you want to have disjoint subsets to avoid that urls are loaded twice from different machines.
Secondly the default setting for 'generate.max.count' could also be set to -1. This means the more urls you collect, especially from the same host, the more urls of the same host will be in the same fetcher map job!
Because there is also a policy setting (please do this at home!!) to wait for a delay of 30 secs. between calls to the same server, all maps which contains urls to the same server are slowing down. Therefore the resulting reduce step will only be done when all fetcher maps are done, which is a bottleneck in the overall processing step.
The following settings may solve your problem:
Map tasks should be splitted according to the host:
<property> <name>partition.url.mode</name> <value>byHost</value> <description>Determines how to partition URLs. Default value is 'byHost', also takes 'byDomain' or 'byIP'. </description> </property> |
Don't insert in a single fetch list more than 10000 entries!
<property> <name>generate.max.count</name> <value>10000</value> <description>The maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. </description> </property> |
Wait time between two fetches to the same server.
<property> <name>fetcher.max.crawl.delay</name> <value>10</value> <description> If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. </description> </property> |
While fetching is in progress, the fetcher job will log such statement to indicate the progress of the job:
0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632 7346 kb/s, 989 URLs in 5 queue |
Here is the explanation of each of all the fields:
Fetcher threads try to get a fetch item (url) from a queue of all the fetch items (this queue is actually a queue of queues. For details see \[0\]). If a thread doesn't get a fetch-item, it spinwaits for 500ms before polling the queue again. The 'spinWaiting' count tells us how many threads are in their "spinwaiting" state at a given instance. |
((((float)pages)*10)/elapsed)/10.0 |
(actualPages*10)/10.0 |
(((float)bytes)*8)/1024)/elapsed
and
((float)actualBytes)*8)/1024 |
Nutch maintains a crawldb (and linkdb, for that matter) of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the a re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if Nutch relied on the index instead of maintaining its own crawldb, to !store the same url twice? The problem we face here is what Nutch would do if we wished to change the Solr core which to index to?
Whats described above could be done with Nutch 2.0 by adding a SOLR backend to GORA. SOLR would be used to store the webtable and provided that you setup the schema accordingly you could index the appropriate fields for searching. Further to this, because Nutch is a crawler intending to write to more than one search engine. Besides, the crawldb is gone, as a flat file, in trunk (2.0). Also, Solr is really slow when it comes to updating millions of records, the crawldb isn't when split over multiple machines.
For more information see here
Yes. The list of common words is used only when indexing and searching, and not during other steps. So, if you change the list of common words, there is no need to re-fetch the content, you just need to re-create segment indexes to reflect the changes.
The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.
<property> <name>plugin.includes</name> <value>protocol-file|...copy original values from nutch-default here...</value> </property> |
where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-ins normally enabled will be enabled, plus the protocol-file plugin.
Now you can invoke the crawler and index all or part of your disk.
By default, Nutch will step into parent directories. You can avoid this by setting the following property to false:
<property> <name>file.crawl.parent</name> <value>false</value> <description>The crawler is not restricted to the directories that you specified in the Urls file but it is jumping into the parent directories as well. For your own crawlings you can change this behavior (set to false) the way that only directories beneath the directories that you specify get crawled.</description> </property> |
Alternatively, you could add a regex URL filter rule, e.g.
+^file:/c:/top/directory/ -. |
When converting file:
URLs from the Java URL class back only one slash remains:
String url = "file:///path/index.html"; java.net.URL u = new java.net.URL(url); url = u.toString(); // url is now file:/path/index.html |
Because such conversions are quite frequent, you better writer URLs (and also URL filter rules, etc.) with a single slash (file:/path/index.html
). Nutch's URL normalizers in the default configuration also normalize file: URLs to have only one slash.
At the current time, Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method is to mount the shares yourself, then index the contents as though they were local directories (see above).
Note that the share mounting method suffers from the following drawbacks:
1) The links generated by Nutch will not work except for queries from local host (end users typically won't have the exact same shares mounted in the exact same way)
2) You are limited to the number of mounted shares your operating system supports. In *nix environments, this is effectively unlimited, but in Windows you may mount 26 (one share or drive per letter in the English alphabet)
3) Documents with links to shares are unlikely to work since they won't link to the share on your machine, but rather to the SMB version.
If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default.
Please see the MapReduce page of the Nutch wiki.
% echo localhost >> ~/.slaves % echo somemachin >> ~/.slaves |
% bin/start-all.sh |
% mkdir seeds % echo http://www.cnn.com/ > seeds/urls |
% bin/nutch ndfs -put seeds seeds |
% bin/nutch crawl seeds -depth 3 |
[root@xxxxxx mapred]# bin/nutch ndfs -ls / 050927 160948 parsing file:/mapred/conf/nutch-default.xml 050927 160948 parsing file:/mapred/conf/nutch-site.xml 050927 160948 No FS indicated, using default:localhost:8009 050927 160948 Client connection to 127.0.0.1:8009: starting Found 3 items /user/root/crawl-20050927142856 <dir> /user/root/crawl-20050927144626 <dir> /user/root/seeds <dir> |
[root@xxxxxx mapred]# bin/nutch ndfs -rm /user/root/crawl-20050927144626 050927 161025 parsing file:/mapred/conf/nutch-default.xml 050927 161025 parsing file:/mapred/conf/nutch-site.xml 050927 161025 No FS indicated, using default:localhost:8009 050927 161025 Client connection to 127.0.0.1:8009: starting Deleted /user/root/crawl-20050927144626 |
Scoring is implemented as a filter plugin, i.e. an implementation of the ScoringFilter class. By default, the OPIC Scoring Filter is used. There are also numerous scoring filter properties which can be specified within nutch-site.xml.
In order to be able to find this information you have to modify the standard plugin.includes
property of the nutch configuration file and add the index-more
filter.
<property> <name>plugin.includes</name> <value>...|index-more|...|...</value> ... </property> |
After that, don't forget to crawl again and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits.
The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change the db.max.outlinks.per.page property to a higher value or simply -1 (unlimited).
file: conf/nutch-default.xml
<property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> |
see also: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html
Grub has some interesting ideas about building a search engine using distributed computing. And how is that relevant to nutch?