RunningNutchAndSolr

This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to [WWW] Brian Whitman at Variogr.am and [WWW] Sami Siren at FooFactory for all the help! You guys saved me a lot of time! :)

I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk.

  1. Check out solr-trunk ( svn co [WWW] http://svn.apache.org/repos/solr/ solr-trunk )

  2. Check out nutch-trunk ( svn co [WWW] http://svn.apache.org/repos/nutch/ nutch-trunk )

  3. Go into the solr-trunk and run 'ant dist dist-solrj'

  4. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar from solr-trunk/dist to nutch-trunk/lib

  5. Apply patch from [WWW] FooFactory patch to nutch-trunk (cd nutch-trunk; patch -p0 < nutch_solr.patch)

  6. Get zip file from [WWW] Variogr.am and unzip somewhere other than nutch-trunk

  7. Copy ONLY SolrIndexer.java from src/java/org/apache/nutch/indexer/ to nutch-trunk/src/java/org/apache/nutch/indexer

  8. Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java (somewhere around line 92):

  9. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing scope on LuceneDocumentWrapper from private to protected

  10. Get the zip file from [WWW] FooFactory for SOLR-20

  11. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'

  12. Copy solr-client.jar from dist to nutch-trunk/lib

  13. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib

  14. Configure nutch-trunk/conf/nutch-site.xml with *at least* settings for your site including a value for property indexer.solr.url (something like [WWW] http://localhost:8983/solr/), but you should also have http.agent.name, http.agnet.description, http.agent.url, and http.agent.email as well.

  15. Edit nutch-trunk/conf/regex-urlfilter.xml to include some pattern for what to grab (such as +^[WWW] http://([a-z0-9]*\.)apache.org/)

  16. Configure some url(s) to crawl (make a nutch-trunk/urls directory with a text file with just a url in it like [WWW] http://lucene.apache.org/nutch)

  17. Copy [WWW] Crawl.sh script from FooFactory and copy it to nutch-trunk/bin (editing if needed for things like topN)

  18. Go into solr-trunk and make an example server instance (run 'ant example')

  19. Copy example off somewhere (like /tmp/mysolr)

  20. Edit mysolr/solr/conf/schema.xml

    • Add the fields that Nutch needs (url, content, segment, digest, host, site, anchor, title, tstamp, text--see [WWW] FooFactory Article on Nutch + Solr)

    • Change defaultSearchField to 'text'

    • Change defaultOperator to 'AND'

    • Add lines to "copyField" section to copy anchor, title, and content into the text field

  21. Start the Solr you just made (cd /tmp/mysolr; java -jar start.jar)

  22. Run a Nutch crawl using the bin/crawl.sh script.

If you watch the output from your Solr instance (logs) you should see a bunch of messages scroll by when Nutch finishes crawling and posts new documents. If not, then you've got something not configured right. I'll try to add more notes here as people have questions/issues.

Troubleshooting:


ERROR I did everything but i got this error any idea??

2008-04-03 15:42:28,009 WARN mapred.LocalJobRunner - job_local_1 java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.ObjectWritable, recieved org.apache.nutch.crawl.NutchWritable

2008-04-03 15:42:28,609 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed!


Sorry but nothing change!! Same as below..

ERROR I changed lines and it worked.But this time gave this error. I tried both private and protected scopes but nothing changed. I also changed this line Document doc = (Document) ((ObjectWritable) value).get(); with this Document doc = (Document) ((NutchWritable) value).get(); this time gave build error..

2008-04-04 10:41:48,490 WARN mapred.LocalJobRunner - job_local_1

java.lang.ClassCastException: org.apache.nutch.indexer.Indexer$LuceneDocumentWrapper

2008-04-04 10:41:49,085 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed!


It works like a charm thanks for your help. (I repeated a mistake in nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java file It is explained here [WWW] http://variogram.com/latest/?p=26 +++ src/java/org/apache/nutch/indexer/Indexer.java (working copy) - private static class LuceneDocumentWrapper implements Writable { + public static class LuceneDocumentWrapper implements Writable { ).

last edited 2008-04-14 18:31:49 by NickTkach