This is the official Nutch FAQ.

Contents

  1. General
    1. House Keeping
    2. Are there any mailing lists available?
    3. How can I stop Nutch from crawling my site?
    4. Will Nutch be a distributed, P2P-based search engine?
    5. Will Nutch use a distributed crawler, like Grub?
    6. Won't open source just make it easier for sites to manipulate rankings?
    7. What Java version is required to run Nutch?
    8. I have two XML files, nutch-default.xml and nutch-site.xml, why?
  2. Compiling Nutch
    1. How do I compile Nutch?
    2. How do I compile Nutch in Eclipse?
  3. Injecting
    1. What happens if I inject urls several times?
  4. Fetching
    1. Can I parse during the fetching process?
    2. Is it possible to fetch only pages from some specific domains?
    3. How can I recover an aborted fetch process?
    4. Who changes the next fetch date?
    5. I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?
    6. How many concurrent threads should I use?
    7. How can I force fetcher to use custom nutch-config?
    8. bin/nutch generate generates empty fetchlist, what can I do?
    9. How can I fetch pages that require Authentication?
    10. Speed of Fetching seems to decrease between crawl iterations... what's wrong?
    11. What do the numbers in the fetcher log indicate ?
  5. Updating
    1. Isn't there redudant/wasteful duplication between nutch crawldb and solr index?
  6. Indexing
    1. Is it possible to change the list of common words without crawling everything again?
    2. How do I index my local file system?
    3. Nutch crawling parent directories for file protocol
    4. How do I index remote file shares?
  7. Segment Handling
    1. Do I have to delete old segments after some time?
  8. MapReduce
    1. What is MapReduce?
    2. How to start working with MapReduce?
  9. NDFS
    1. What is it?
    2. How to send commands to NDFS?
  10. Scoring
    1. How can I influence Nutch scoring?
  11. Searching
    1. How can I find out/display the size and mime type of the hits that a search returns?
  12. Crawling
    1. Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?
  13. Discussion

General

House Keeping

Questions that are not answered in the FAQ or in the documentation should be posted to the appropriate mailing list.

Please stick to technical issues on the discussion forum and mailing lists. Keep in mind that these are public, so do not include any confidential information in your questions!

You should also read the Mailing Lists Developer Resource (http://www.apache.org/dev/#mail) before participating in the discussion forum and mailing lists.

NOTE: Please do NOT submit bugs, patches, or feature requests to the mailing lists. Refer instead to Commiter's_Rules and HowToContribute areas of the Nutch wiki.

Are there any mailing lists available?

There's a user, developer, commits and agents lists, all available at http://lucene.apache.org/nutch/mailing_lists.html.

How can I stop Nutch from crawling my site?

Please visit our "webmaster info page"

Will Nutch be a distributed, P2P-based search engine?

We don't think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they're looking for. In short, a fast search engine is a better search engine. We don't think many people would want to use a search engine that takes ten or more seconds to return results.

That said, if someone wishes to start a sub-project of Nutch exploring distributed searching, we'd love to host it. We don't think these techniques are likely to solve the hard problems Nutch needs to solve, but we'd be happy to be proven wrong.

Will Nutch use a distributed crawler, like Grub?

Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a large search engine is not crawling, but searching.

Won't open source just make it easier for sites to manipulate rankings?

Search engines work hard to construct ranking algorithms that are immune to manipulation. Search engine optimizers still manage to reverse-engineer the ranking algorithms used by search engines, and improve the ranking of their pages. For example, many sites use link farms to manipulate search engines' link-based ranking algorithms, and search engines retaliate by improving their link-based algorithms to neutralize the effect of link farms.

With an open-source search engine, this will still happen, just out in the open. This is analagous to encryption and virus protection software. In the long term, making such algorithms open source makes them stronger, as more people can examine the source code to find flaws and suggest improvements. Thus we believe that an open source search engine has the potential to better resist manipulation of its rankings.

What Java version is required to run Nutch?

Nutch 0.7 will run with Java 1.4 and up. Nutch 1.0 with Java 6.

I have two XML files, nutch-default.xml and nutch-site.xml, why?

nutch-default.xml is the out of the box configuration for Nutch, and most configurations can (and should unless you know what your doing) stay as per. nutch-site.xml is where you make the changes that override the default settings.

Compiling Nutch

How do I compile Nutch?

Install ANT and call 'ant' on the command line from the directory containing the Nutch source code. Note : this won't work for the binary release for obvious reasons.

How do I compile Nutch in Eclipse?

Nutch uses ANT+IVY to compile the code and manage the dependencies (see above). There are instructions on how to get Nutch working with Eclipse on [http://wiki.apache.org/nutch/RunNutchInEclipse] but the easiest way of doing is to use ANT for compiling and rely on Eclipse just for visualising the code. You can also debug with Eclipse using the remote debugging and setting e.g. "export NUTCH_OPTS=-Xdebug -agentlib:jdwp=transport=dt_socket,server=y,address=8000" prior to calling the nutch script in /runtime/local/bin.

Injecting

What happens if I inject urls several times?

Urls which are already in the database, won't be injected.

Fetching

Can I parse during the fetching process?

In short yes, however this is disabled by default (justification follows shortly). To enable this simply configure the following in nutch-site.xml before initiating the fecth process.

<property>
  <name>fetcher.parse</name>
  <value>true</value>
  <description>If true, fetcher will parse content. Default is false, which means
  that a separate parsing step is required after fetching is finished.</description>
</property>

N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to this is usually observed in this situation.

In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.

Is it possible to fetch only pages from some specific domains?

Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt file might work, but adding a list with thousands of regular expressions would slow down your system excessively.

Alternatively, you can set db.ignore.external.links to "true", and inject seeds from the domains you wish to crawl (these seeds must link to all pages you wish to crawl, directly or indirectly). Doing this will let the crawl go through only these domains without leaving to start crawling external links. Unfortunately there is no way to record external links encountered for future processing, although a very small patch to the generator code can allow you to log these links to hadoop.log.

How can I recover an aborted fetch process?

Well, you can not. However, you have two choices to proceed:

% touch /index/segments/2005somesegment/fetcher.done

% bin/nutch updatedb /crawl/db/ /crawl/segments/2005somesegment/

% bin/nutch generate /crawl/db/ /crawl/segments/2005somesegment/

% bin/nutch fetch /crawl/segments/2005somesegment

Who changes the next fetch date?

I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?

How many concurrent threads should I use?

This is dependent on your particular set-up; unless one understands system/network environment variables it is impossible to accurately measure thread performance. The Nutch de-facto is an excellent start point.

How can I force fetcher to use custom nutch-config?

bin/nutch generate generates empty fetchlist, what can I do?

The reason for that is that when a page is fetched, it is timestamped in the webdb. So basiclly if its time is not up it will not be included in a fetchlist. So for example if you generated a fetchlist and you deleted the segment dir created. calling generate again will generate an empty fetchlist. So, two choices:

How can I fetch pages that require Authentication?

See the HttpAuthenticationSchemes wiki page.

Speed of Fetching seems to decrease between crawl iterations... what's wrong?

A possible reason is that by default the 'partition.url.mode' is set to 'byHost', which is a reasonable setting, because in the url-subsets for the fetcher threads in different map steps, you want to have disjoint subsets to avoid that urls are loaded twice from different machines.

Secondly the default setting for 'generate.max.count' could also be set to -1. This means the more urls you collect, especially from the same host, the more urls of the same host will be in the same fetcher map job!

Because there is also a policy setting (please do this at home!!) to wait for a delay of 30 secs. between calls to the same server, all maps which contains urls to the same server are slowing down. Therefore the resulting reduce step will only be done when all fetcher maps are done, which is a bottleneck in the overall processing step.

The following settings may solve your problem:

Map tasks should be splitted according to the host:

<property>
  <name>partition.url.mode</name>
  <value>byHost</value>
  <description>Determines how to partition URLs. Default value is
'byHost',  also takes 'byDomain' or 'byIP'.
  </description>
</property>

Don't insert in a single fetch list more than 10000 entries!

<property>
  <name>generate.max.count</name>
  <value>10000</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>

Wait time between two fetches to the same server.

<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>

What do the numbers in the fetcher log indicate ?

While fetching is in progress, the fetcher job will log such statement to indicate the progress of the job:

0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632 7346 kb/s, 989 URLs in 5 queue

Here is the explanation of each of all the fields:

Updating

Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Nutch maintains a crawldb (and linkdb, for that matter) of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the a re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if Nutch relied on the index instead of maintaining its own crawldb, to !store the same url twice? The problem we face here is what Nutch would do if we wished to change the Solr core which to index to?

Whats described above could be done with Nutch 2.0 by adding a SOLR backend to GORA. SOLR would be used to store the webtable and provided that you setup the schema accordingly you could index the appropriate fields for searching. Further to this, because Nutch is a crawler intending to write to more than one search engine. Besides, the crawldb is gone, as a flat file, in trunk (2.0). Also, Solr is really slow when it comes to updating millions of records, the crawldb isn't when split over multiple machines.

For more information see here

Indexing

Is it possible to change the list of common words without crawling everything again?

Yes. The list of common words is used only when indexing and searching, and not during other steps. So, if you change the list of common words, there is no need to re-fetch the content, you just need to re-create segment indexes to reflect the changes.

How do I index my local file system?

The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.

    <property>
      <name>plugin.includes</name>
      <value>protocol-file|...copy original values from nutch-default here...</value>
    </property>

where you should copy and paste all values from nutch-default.xml in the plugin.includes setting provided there. This will ensure that all plug-ins normally enabled will be enabled, plus the protocol-file plugin. Make sure that urlfilter-regex is included, or else the urlfilter files will be ignored, leadingNnutch to accept all URLs. You need to enable crawl URL filters to prevent Nutch from crawling up the parent directory, see below.

Now you can invoke the crawler and index all or part of your disk.

Nutch crawling parent directories for file protocol

If you find Nutch crawling parent directories when using the file protocol, the following Jira issue may help:

http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you could put the following in regex-urlfilter.txt :

+^file:///c:/top/directory/
-.

Alternatively, you could apply the patch described on this page, which would avoid the hard-wiring of the site-specific /top/directory in your configuration file.

How do I index remote file shares?

At the current time, Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method is to mount the shares yourself, then index the contents as though they were local directories (see above).

Note that the share mounting method suffers from the following drawbacks:

Segment Handling

Do I have to delete old segments after some time?

If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default.

MapReduce

What is MapReduce?

Please see the MapReduce page of the Nutch wiki.

How to start working with MapReduce?

  % echo localhost >> ~/.slaves
  % echo somemachin >> ~/.slaves

  % bin/start-all.sh

  % mkdir seeds
  % echo http://www.cnn.com/ > seeds/urls

  % bin/nutch ndfs -put seeds seeds

  % bin/nutch crawl seeds -depth 3

NDFS

What is it?

NutchDistributedFileSystem

How to send commands to NDFS?

  [root@xxxxxx mapred]# bin/nutch ndfs -ls /
  050927 160948 parsing file:/mapred/conf/nutch-default.xml
  050927 160948 parsing file:/mapred/conf/nutch-site.xml
  050927 160948 No FS indicated, using default:localhost:8009
  050927 160948 Client connection to 127.0.0.1:8009: starting
  Found 3 items
  /user/root/crawl-20050927142856 <dir>
  /user/root/crawl-20050927144626 <dir>
  /user/root/seeds        <dir>

Scoring

How can I influence Nutch scoring?

Scoring is implemented as a filter plugin, i.e. an implementation of the ScoringFilter class. By default, the OPIC Scoring Filter is used. There are also numerous scoring filter properties which can be specified within nutch-site.xml.

Searching

How can I find out/display the size and mime type of the hits that a search returns?

In order to be able to find this information you have to modify the standard plugin.includes property of the nutch configuration file and add the index-more filter.

<property>
  <name>plugin.includes</name>
  <value>...|index-more|...|...</value>
  ...
</property>

After that, don't forget to crawl again and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits.

Crawling

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change the db.max.outlinks.per.page property to a higher value or simply -1 (unlimited).

file: conf/nutch-default.xml

 <property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description>The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   </description>
 </property>

see also: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html

Discussion

Grub has some interesting ideas about building a search engine using distributed computing. And how is that relevant to nutch?


CategoryHomepage

FAQ (last edited 2013-06-22 12:42:11 by TejasPatil)