This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to
Brian Whitman at Variogr.am and
Sami Siren at FooFactory for all the help! You guys saved me a lot of time!
I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk.
Check out solr-trunk ( svn co
http://svn.apache.org/repos/solr/ solr-trunk ) Check out nutch-trunk ( svn co
http://svn.apache.org/repos/nutch/ nutch-trunk ) Go into the solr-trunk and run 'ant dist dist-solrj'
Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar from solr-trunk/dist to nutch-trunk/lib
Apply patch from
FooFactory patch to nutch-trunk (cd nutch-trunk; patch -p0 < nutch_solr.patch) Get zip file from
Variogr.am and unzip somewhere other than nutch-trunk Copy ONLY SolrIndexer.java from src/java/org/apache/nutch/indexer/ to nutch-trunk/src/java/org/apache/nutch/indexer
Edit nutch-trunk/src/java/org/apache/nutch/indexer/SolrIndexer.java (somewhere around line 92):
Replace int res = new SolrIndexer().doMain(NutchConfiguration.create(), args); with int res = ToolRunner.run(NutchConfiguration.create(), new SolrIndexer(), args);
Edit the imports to pick up org.apache.hadoop.util.ToolRunner
Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing scope on LuceneDocumentWrapper from private to protected
Get the zip file from
FooFactory for SOLR-20 Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'
Copy solr-client.jar from dist to nutch-trunk/lib
Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib
Configure nutch-trunk/conf/nutch-site.xml with *at least* settings for your site including a value for property indexer.solr.url (something like
http://localhost:8983/solr/), but you should also have http.agent.name, http.agnet.description, http.agent.url, and http.agent.email as well. Edit nutch-trunk/conf/regex-urlfilter.xml to include some pattern for what to grab (such as +^
http://([a-z0-9]*\.)apache.org/) Configure some url(s) to crawl (make a nutch-trunk/urls directory with a text file with just a url in it like
http://lucene.apache.org/nutch) Copy
Crawl.sh script from FooFactory and copy it to nutch-trunk/bin (editing if needed for things like topN) Go into solr-trunk and make an example server instance (run 'ant example')
Copy example off somewhere (like /tmp/mysolr)
Edit mysolr/solr/conf/schema.xml
Add the fields that Nutch needs (url, content, segment, digest, host, site, anchor, title, tstamp, text--see
FooFactory Article on Nutch + Solr) Change defaultSearchField to 'text'
Change defaultOperator to 'AND'
Add lines to "copyField" section to copy anchor, title, and content into the text field
Start the Solr you just made (cd /tmp/mysolr; java -jar start.jar)
Run a Nutch crawl using the bin/crawl.sh script.
If you watch the output from your Solr instance (logs) you should see a bunch of messages scroll by when Nutch finishes crawling and posts new documents. If not, then you've got something not configured right. I'll try to add more notes here as people have questions/issues.
Troubleshooting:
If you get errors about "Type mismatch in value from map:" (expected ObjectWritable, but received NutchWritable), then you likely are missing the two steps I just added in step 9 above. Sorry about that, I forgot about making the change there in SolrIndexer.
Note: Double-mistaken. I've re-written the order of the steps. Turns out you do need both the Variogram file and the FooFactory files.
When in doubt, look at nutch-trunk/logs/hadoop.log . It frequently shows details about what's gone wrong and can be a big help when you start getting "unexplained" errors.
See original articles at
FooFactory Article on Nutch + Solr and
Variogr.am Updates to FooFactory Posting
ERROR I did everything but i got this error any idea??
2008-04-03 15:42:28,009 WARN mapred.LocalJobRunner - job_local_1 java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.ObjectWritable, recieved org.apache.nutch.crawl.NutchWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:369) at org.apache.nutch.indexer.Indexer.map(Indexer.java:344) at org.apache.nutch.indexer.Indexer.map(Indexer.java:52) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
2008-04-03 15:42:28,609 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894) at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:86) at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:111) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93)
Sorry but nothing change!! Same as below..
ERROR I changed lines and it worked.But this time gave this error. I tried both private and protected scopes but nothing changed. I also changed this line Document doc = (Document) ((ObjectWritable) value).get(); with this Document doc = (Document) ((NutchWritable) value).get(); this time gave build error..
2008-04-04 10:41:48,490 WARN mapred.LocalJobRunner - job_local_1
java.lang.ClassCastException: org.apache.nutch.indexer.Indexer$LuceneDocumentWrapper
at org.apache.nutch.indexer.SolrIndexer$OutputFormat$1.write(SolrIndexer.java:135) at org.apache.hadoop.mapred.ReduceTask$2.collect(ReduceTask.java:315) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:275) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
2008-04-04 10:41:49,085 FATAL indexer.Indexer - SolrIndexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894) at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:87) at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:94)
It works like a charm thanks for your help. (I repeated a mistake in nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java file It is explained here
http://variogram.com/latest/?p=26 +++ src/java/org/apache/nutch/indexer/Indexer.java (working copy) - private static class LuceneDocumentWrapper implements Writable { + public static class LuceneDocumentWrapper implements Writable { ).