New in Nutch 1.0-dev

Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/.

Pre Solr Nutch integration

This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi for all the help! You guys saved me a lot of time! :)

I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk.

Prerequisites

Steps

The first step to get started is to download the required software components, namely Apache Solr and Nutch.

1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page

2. Extract Solr package

3. Download Nutch version 1.0 or later (Alternatively download the the nightly version of Nutch that contains the required functionality)

4. Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz

5. Configure Solr For the sake of simplicity we are going to use the example configuration of Solr as a base.

a. Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file)

We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it:

b. Change schema.xml so that the stored attribute of field “content” is true.

<field name=”content” type=”text” stored=”true” indexed=”true”/>

We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case:

d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it

<requestHandler name="/nutch" class="solr.SearchHandler" >

<lst name="defaults">

<str name="defType">dismax</str>

<str name="echoParams">explicit</str>

<float name="tie">0.01</float>

<str name="qf">

content0.5 anchor1.0 title^1.2 </str>

<str name="pf"> content0.5 anchor1.5 title1.2 site1.5 </str>

<str name="fl"> url </str>

<str name="mm"> 2<-1 5<-2 6<90% </str>

<int name="ps">100</int>

<bool hl="true"/>

<str name="q.alt">*:*</str>

<str name="hl.fl">title url content</str>

<str name="f.title.hl.fragsize">0</str>

<str name="f.title.hl.alternateField">title</str>

<str name="f.url.hl.fragsize">0</str>

<str name="f.url.hl.alternateField">url</str>

<str name="f.content.hl.fragmenter">regex</str>

</lst>

</requestHandler>

6. Start Solr

cd apache-solr-1.3.0/example java -jar start.jar

7. Configure Nutch

a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :

<?xml version="1.0"?> <configuration>

<property>

<name>http.agent.name</name>

<value>nutch-solr-integration</value>

</property>

<property> <name>generate.max.per.host</name>

<value>100</value>

</property>

<property>

<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

</configuration>

b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with following:

-^(https|telnet|file|ftp|mailto):

# skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc. -[?*!@=]

# allow urls in foofactory.fi domain +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/

# deny anything else -.

8. Create a seed list (the initial urls to fetch)

mkdir urls echo "http://www.lucidimagination.com/" > urls/seed.txt

9. Inject seed url(s) to nutch crawldb (execute in nutch directory)

bin/nutch inject crawl/crawldb urls

10. Generate fetch list, fetch and parse content

bin/nutch generate crawl/crawldb crawl/segments

The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:

export SEGMENT=crawl/segments/ls -tr crawl/segments|tail -1

Now I launch the fetcher that actually goes to get the content:

bin/nutch fetch $SEGMENT -noParsing

Next I parse the content:

bin/nutch parse $SEGMENT

Then I update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

Now a full Fetch cycle is completed. Next you can repeat step 10 couple of more times to get some more content.

11. Create linkdb

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

12. Finally index all content from all segments to Solr

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

Now the indexed content is available through Solr. You can try to execute searches from the Solr admin ui from

http://127.0.0.1:8983/solr/admin

, or directly with url like

http://127.0.0.1:8983/solr/nutch/?q=solr&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json

HI, I to faced problems in integrating solr and nutch. After, some work out i found the below article and integrated successfully. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

RunningNutchAndSolr (last edited 2009-09-20 23:09:56 by localhost)