General No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml

The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser...

Exception: Invalid argument or cannot assign requested address on Fedora Core 3 or 4

It seems you have installed IPV6 on your machine.

To solve this problem, add the following java param to the java instantiation in bin/nutch:

# run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath "$CLASSPATH" $CLASS "$@"

FileNotFoundException: 1

delay 1 fails crawltest and subdirectories are created; also ant compiles no probs; ROOT.war is installed and runs; urls file exists. Adding ./ or full path as x below changes nothing. Server runs squid on 80 and real Apache 1.3 on 81. Catalina is on 8080 and is up and running.

crawl test exists

ls -R crawl.test/ crawl.test/: . .. db

crawl.test/db: . .. dbreadlock dbwritelock webdb

crawl.test/db/webdb: . .. linksByMD5 linksByURL pagesByMD5 pagesByURL

crawl.test/db/webdb/linksByMD5: . .. data index

crawl.test/db/webdb/linksByURL: . .. data index

crawl.test/db/webdb/pagesByMD5: . .. data index

crawl.test/db/webdb/pagesByURL: . .. data index

It always fails with above error, while omitting the delay tag seems to work :\ ... I tried putting the -delay tag at several places above, it always fails

nutch 0.7 Apache Tomcat/5.0.19 jdsk 1.4.2-b28 Sun Microsystems Inc. Linux (Suse 8.2 1.5 years old but updated) Linux Kernel 2.4.21 i386

Well its working without the delay tag but I can't release it on other sites with no delay tag. What am I doing wrong?

Fetching Errors

Why do I get error "123456 104934 fetch of http://mydomain/index.html failed with: HTTP Error: 401" when crawling?

/etc/host.conf: line 1: cannot specify more then 4 services

While fetching I get UnknownHostException for known hosts

Make sure your DNS server is working and/or it can handle the load of requests.

Updating Errors

Until updating my DB I got a OutOfMemoryException or a 'to many files open' error.

Indexing Errors

While indexing documents, I get the following error:

050529 011245 fetch okay, but can't parse myfile, reason: Content truncated at 65536 bytes. Parser can't handle incomplete msword file.

What is happening?



Searching Errors

Tomcat reports root cause: java.lang.OutOfMemoryError and does not find anything.

Installation Errors

See GettingNutchRunningWithUbuntu for some help.

Nutch on Debian (cont)

What is mentioned here

java.lang.NoClassDefFoundError: org/apache/coyote/http11/Http11Processor$1

can be avoided with permission "*", "read,write,execute,delete";

pityfully the cache anchor option doesn't work still access denied (java.util.PropertyPermission * read,write)

this happens independent of putting

permission "*", "read,write,execute,delete";




so if you are then entirely fed up trying to find what's up ... because bad stack trace + idiotic and unpenetrable security settings are selfdefeating..

you enter permission;

in /etc/tomcat4/policy.d/04webapps.policy

and the thing works ... (but I am not even contemplating what security holes I have opened here :|)

Setup on a SUSE 8.1 system was no problem btw ...

