FAQ

This is the official Nutch FAQ.

  1. Nutch FAQ
    1. General
      1. Are there any mailing lists available?
      2. How can I stop Nutch from crawling my site?
      3. Will Nutch be a distributed, P2P-based search engine?
      4. Will Nutch use a distributed crawler, like Grub?
      5. Won't open source just make it easier for sites to manipulate rankings?
      6. What Java version is required to run Nutch?
      7. Exception: java.net.SocketException: Invalid argument or cannot assign requested address on Fedora Core 3 or 4
      8. I have two XML files, nutch-default.xml and nutch-site.xml, why?
      9. My system does not find the segments folder. Why? Or: How do I tell the ''Nutch Servlet'' where the index file are located?
    2. Injecting
      1. What happens if I inject urls several times?
    3. Fetching
      1. Is it possible to fetch only pages from some specific domains?
      2. How can I recover an aborted fetch process?
      3. Who changes the next fetch date?
      4. I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?
      5. How many concurrent threads should I use?
      6. How can I force fetcher to use custom nutch-config?
      7. bin/nutch generate generates empty fetchlist, what can I do?
      8. While fetching I get UnknownHostException for known hosts
      9. How can I fetch pages that require Authentication?
    4. Updating
    5. Indexing
      1. Is it possible to change the list of common words without crawling everything again?
      2. How do I index my local file system?
      3. Nutch crawling parent directories for file protocol -> misconfigured URLFilters
      4. How do I index remote file shares?
      5. While indexing documents, I get the following error:
    6. Segment Handling
      1. Do I have to delete old segments after some time?
    7. MapReduce
      1. What is MapReduce?
      2. How to start working with MapReduce?
    8. NDFS
      1. What is it?
      2. How to send commands to NDFS?
    9. Searching
      1. Common words are saturating my search results.
      2. How is scoring done in Nutch? (Or, explain the "explain" page?)
      3. How can I influence Nutch scoring?
      4. What is the RSS symbol in search results all about?
      5. How can I find out/display the size and mime type of the hits that a search returns?
    10. Crawling
      1. Java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml
      2. Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?
    11. Discussion

Nutch FAQ

General

Are there any mailing lists available?

There's a user, developer, commits and agents lists, all available at [WWW] http://lucene.apache.org/nutch/mailing_lists.html.

How can I stop Nutch from crawling my site?

Please visit our [WWW] "webmaster info page"

Will Nutch be a distributed, P2P-based search engine?

We don't think it is presently possible to build a peer-to-peer search engine that is competitive with existing search engines. It would just be too slow. Returning results in less than a second is important: it lets people rapidly reformulate their queries so that they can more often find what they're looking for. In short, a fast search engine is a better search engine. I don't think many people would want to use a search engine that takes ten or more seconds to return results.

That said, if someone wishes to start a sub-project of Nutch exploring distributed searching, we'd love to host it. We don't think these techniques are likely to solve the hard problems Nutch needs to solve, but we'd be happy to be proven wrong.

Will Nutch use a distributed crawler, like Grub?

Distributed crawling can save download bandwidth, but, in the long run, the savings is not significant. A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages, so making the crawler use less bandwidth does not reduce overall bandwidth requirements. The dominant expense of operating a large search engine is not crawling, but searching.

Won't open source just make it easier for sites to manipulate rankings?

Search engines work hard to construct ranking algorithms that are immune to manipulation. Search engine optimizers still manage to reverse-engineer the ranking algorithms used by search engines, and improve the ranking of their pages. For example, many sites use link farms to manipulate search engines' link-based ranking algorithms, and search engines retaliate by improving their link-based algorithms to neutralize the effect of link farms.

With an open-source search engine, this will still happen, just out in the open. This is analagous to encryption and virus protection software. In the long term, making such algorithms open source makes them stronger, as more people can examine the source code to find flaws and suggest improvements. Thus we believe that an open source search engine has the potential to better resist manipulation of its rankings.

What Java version is required to run Nutch?

Nutch 0.7 will run with Java 1.4 and up.

Exception: java.net.SocketException: Invalid argument or cannot assign requested address on Fedora Core 3 or 4

It seems you have installed IPV6 on your machine.

To solve this problem, add the following java param to the java instantiation in bin/nutch:

JAVA_IPV4=-Djava.net.preferIPv4Stack=true

# run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath "$CLASSPATH" $CLASS "$@"

I have two XML files, nutch-default.xml and nutch-site.xml, why?

nutch-default.xml is the out of the box configuration for nutch. Most configuration can (and should unless you know what your doing) stay as it is. nutch-site.xml is where you make the changes that override the default settings. The same goes to the servlet container application.

My system does not find the segments folder. Why? Or: How do I tell the ''Nutch Servlet'' where the index file are located?

There are at least two choices to do that:

Injecting

What happens if I inject urls several times?

Urls which are already in the database, won't be injected.

Fetching

Is it possible to fetch only pages from some specific domains?

Please have a look on PrefixURLFilter. Adding some regular expressions to the urlfilter.regex.file might work, but adding a list with thousands of regular expressions would slow down your system excessively.

Alternatively, you can set db.ignore.external.links to "true", and inject seeds from the domains you wish to crawl (these seeds must link to all pages you wish to crawl, directly or indirectly). Doing this will let the crawl go through only these domains without leaving to start crawling external links. Unfortunately there is no way to record external links encountered for future processing, although a very small patch to the generator code can allow you to log these links to hadoop.log.

How can I recover an aborted fetch process?

Well, you can not. However, you have two choices to proceed:

Who changes the next fetch date?
I have a big fetchlist in my segments folder. How can I fetch only some sites at a time?
How many concurrent threads should I use?

This is dependent on your particular setup, but the following works for me:

If you are using a slow internet connection (ie- DSL), you might be limited to 40 or fewer concurrent fetches.

If you have a fast internet connection (> 10Mb/sec) your bottleneck will definitely be in the machine itself (in fact you will need multiple machines to saturate the data pipe). Empirically I have found that the machine works well up to about 1000-1500 threads.

To get this to work on my Linux box I needed to set the ulimit to 65535 (ulimit -n 65535), and I had to make sure that the DNS server could handle the load (we had to speak with our colo to get them to shut off an artifical cap on the DNS servers). Also, in order to get the speed up to a reasonable value, we needed to set the maximum fetches per host to 100 (otherwise we get a quick start followed by a very long slow tail of fetching).

To other users: please add to this with your own experiences, my own experience may be atypical.

How can I force fetcher to use custom nutch-config?
bin/nutch generate generates empty fetchlist, what can I do?

The reason for that is that when a page is fetched, it is timestamped in the webdb. So basiclly if its time is not up it will not be included in a fetchlist. So for example if you generated a fetchlist and you deleted the segment dir created. calling generate again will generate an empty fetchlist. So, two choices:

While fetching I get UnknownHostException for known hosts

Make sure your DNS server is working and/or it can handle the load of requests.

How can I fetch pages that require Authentication?

See HttpAuthenticationSchemes.

Updating

Indexing

Is it possible to change the list of common words without crawling everything again?

Yes. The list of common words is used only when indexing and searching, and not during other steps. So, if you change the list of common words, there is no need to re-fetch the content, you just need to re-create segment indexes to reflect the changes.

How do I index my local file system?

The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you have to change config files to get it to crawl your local disk.

Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will not load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [WWW] here and this behavior may be disabled by a [WWW] preference (see security.checkloaduri). IE5 does not have this problem.

Nutch crawling parent directories for file protocol -> misconfigured URLFilters

[WWW] http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you should put the following in regex-urlfilter.txt :


+^file:///c:/top/directory/
-.
How do I index remote file shares?

At the current time, Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method is to mount the shares yourself, then index the contents as though they were local directories (see above).

Note that the share mounting method suffers from the following drawbacks:

While indexing documents, I get the following error:

050529 011245 fetch okay, but can't parse myfile, reason: Content truncated at 65536 bytes. Parser can't handle incomplete msword file.

What is happening?

Segment Handling

Do I have to delete old segments after some time?

If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default.

MapReduce

What is MapReduce?

MapReduce

How to start working with MapReduce?

</description>

NDFS

What is it?

NutchDistributedFileSystem

How to send commands to NDFS?

Searching

Common words are saturating my search results.

You can tweak your conf/common-terms.utf8 file after creating an index through the following command:

How is scoring done in Nutch? (Or, explain the "explain" page?)

Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does it. The formula Lucene uses scoring can be found at the head of the Lucene Similarity class in the [WWW] Lucene Similarity Javadoc. Roughly, the score for a particular document in a set of query results, "score(q,d)", is the sum of the score for each term of a query ("t in q"). A terms score in a document is itself the sum of the term run against each field that comprises a document ("title" is one field, "url" another. A "document" is a set of "fields"). Per field, the score is the product of the following factors: Its "tf" (term freqency in the document), a score factor "idf" (usually a factor made up of frequency of term relative to amount of docs in index), an index-time boost, a normalization of count of terms found relative to size of document ("lengthNorm"), a similar normalization is done for the term in the query itself ("queryNorm"), and finally, a factor with a weight for how many instances of the total amount of terms a particular document contains. Study the lucene javadoc to get more detail on each of the equation components and how they effect overall score.

Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene scoring equation in mind. First, notice how we move right as we move from "score total", to "score per query term", to "score per query document field" (A document field is not shown if a term was not found in a particular field). Next, studying a particular field scoring, it comprises a query component and then a field component. The query component includes query time -- as opposed to index time -- boost, an "idf" that is same for the query and field components, and then a "queryNorm". Similar for the field component ("fieldNorm" is an aggregation of certain of the Lucene equation components).

How can I influence Nutch scoring?

Scoring is implemented as a filter plugin, i.e. an implementation of the ScoringFilter class. By default, [WWW] OPICScoringFilter is used.

However, the easiest way to influence scoring is to change query time boosts (Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost by default looks like this:

  query.url.boost, 4.0f
  query.anchor.boost, 2.0f
  query.title.boost, 1.5f
  query.host.boost, 2.0f
  query.phrase.boost, 1.0f

From the list above, you can see that terms found in a document URL get the highest boost with anchor text next, etc.

Anchor text makes a large contribution to document score (You can see the anchor text for a page by browsing to "explain" then editing the URL to put in place "anchors.jsp" in place of "explain.jsp").

What is the RSS symbol in search results all about?

Clicking on the RSS symbol sends the current query back to Nutch to a servlet named [WWW] OpenSearchServlet. [WWW] OpenSearchServlet reruns the query and returns the results formatted instead as RSS (XML). The RSS format is based on [WWW] OpenSearch RSS 1.0 from [WWW] a9.com: "[WWW] OpenSearch RSS 1.0 is an extension to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as outlined by the RSS 2.0 specification" (See also [WWW] opensearch). Nutch in turn makes extension to [WWW] OpenSearch. The Nutch extensions are identified by the 'nutch' namespace prefix and add to [WWW] OpenSearch navigation information, the original query, and all fields that are available at search result time including the Nutch page boost, the name of the segment the page resides in, etc.

Results as RSS (XML) rather than HTML are easier for programmatic clients to parse: such clients will query against [WWW] OpenSearchServlet rather than search.jsp. Results as XML can also be transformed using XSL stylesheets, the likely direction of UI development going forward.

How can I find out/display the size and mime type of the hits that a search returns?

In order to be able to find this information you have to modify the standard plugin.includes property of the nutch configuration file and add the index-more filter.

<property>
  <name>plugin.includes</name>
  <value>...|index-more|...|query-more|...</value>
  ...
</property>

After that, don't forget to crawl again and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits.

Crawling

Java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml

The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser...

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change the db.max.outlinks.per.page property to a higher value or simply -1 (unlimited).

file: conf/nutch-default.xml

 <property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description>The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   </description>
 </property> 

see also: [WWW] http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html (tested under nutch 0.9)

Discussion

[WWW] Grub has some interesting ideas about building a search engine using distributed computing. And how is that relevant to nutch?


CategoryHomepage

last edited 2008-03-20 03:25:53 by MarkDeSpain