Monitoring Nutch Crawls

So you got Nutch all configured and turned it loose on your site, but your itchy trigger finger just needs to know how well it's working? Here are a couple ways you can keep an eye on your crawl:

Monitoring network traffic

One way is to watch Nutch suck up your bandwidth as it crawls its way around. If you look at a graph of historical bandwidth usage, you should see it spike up and stay at a fairly consistent plateau, with valleys every so often each time a segment completes (since while Nutch is merging segments it doesn't use any bandwidth).

Some tools for this:

ntop (Linux, Windows) - A nifty program that gives you a Web-based history of your machine's bandwidth usage. You might get lucky and have it install easily... because the website isn't terribly helpful for install help.

Monitoring fetch statistics

Of course, the bandwidth alone doesn't tell the whole story. How many pages are you retrieving? How many failed?

Here's a quick little shell script to do this; I'm sure people can improve on this--edit this page if so!

 
#!/bin/sh
echo "Monitoring nohup.out crawl progress..."
while :
do
  echo "Tried `grep 'fetching' nohup.out | wc -l` pages; `grep 'failed' nohup.out | wc -l` failed."
  sleep 60
done

To run this script:

Save this script as something like monitorCrawl.sh 2. Run your preferred crawl script with nohup, like this: nohup <nutch crawl command or script> & 3. By default, this will output to nohup.out in the working directory. From the same directory, run: sh monitorCrawl.sh

(Alternately, you can process hadoop.log in the logs/ directory by changing the three references to nohup.out to hadoop.log. Be aware, though, that by default hadoop.log only contains activity from today, so your counts will reset to zero each night.)

This will give you minute-by-minute stats on how many pages nutch tried to fetch and how many failed with errors (e.g. 404, server unreachable).

A More expansion of above Script:

Run this script by changing the three export paths in the script...

#!/bin/bash
####################################################################################
#########################Author: Chalavadi Suman Kumar #############################
##########################Email: sumankumar4@gmail.com #############################
####################################################################################

# Usage: sh monitor.sh existing/running logfile outputlogfile errorlogfile
# 'existing' - logfile is not a running log (already existing log)
# 'running'  - logfile is a running log (this is the default option)


#Edit your Log directory names...
export LOGFILE=/home/suman/nutch-0.8/logs/hadoop.log
export SAVEFILE=/home/suman/NutchMonitor/short_hadoop.log
export ERRORFILE=/home/suman/NutchMonitor/logs/hadoop_error.log

#Specify for reading the existing log or running log
#By default it assumes 'running'
case "$1" in
'existing') COMMAND="cat" ;;

'running') COMMAND="tail -f" ;;

*) COMMAND="tail -f" ;;
esac

#Change the LOGFILE,SAVEFILE and ERRORFILE, if provided through command line
#IF '-' is given, take the default values..
if [[ "$2" != "" && "$2" != "-" ]]; then LOGFILE=$2; fi
if [[ "$3" != "" && "$3" != "-" ]]; then SAVEFILE=$3; fi
if [[ "$4" != "" && "$4" != "-" ]]; then ERRORFILE=$4; fi

#Initializing the Variables....
minute=0
fetchcount=0
fetchfailcount=0
indexcount=0
mflcount=0
lasttime=0
newtime=0
totalfetchcount=0
totalindexedcount=0
totalfetchfail=0
totalmflcount=0

#Reads the appended content of the file as the file grows. So each log is processed through the while loop...
$COMMAND $LOGFILE | while read some; do

#to aviod exceptions. They some time start with 'java' and 'at'.
case $some in
[0-9]*)
#Get the time(current minute) in 'newtime' and entire date in 'totaltime'
newtime=`echo $some|perl -ne '@words=split(/\s+/);@temp=split(/:/,$words[1]);print $temp[1]'`
#totaltime=`echo $some|awk '{print $1" "$2}'`
totaltime=`echo $some|perl -ne '@word=split(/\s+/);print "$word[0] $word[1]"'`
;;
*) echo $some >>$ERRORFILE;;
esac

#echo $temp $newtime $totaltime $lasttime

#Pattern matches for important operations and prints with the change in time...
case "$some" in
*indexer.Indexer\ -\ Indexer:\ starting*)
                indexcount=0;mflcount=0;
                ;;

*indexer.Indexer\ -\ Indexer:\ done*)
                if [[ $indexcount -ne 0 || $mlfcount -ne 0 ]]; then
                echo "$totaltime Indexed:$indexcount MaxFieldLengthError:$mflcount TotalPagesIndexed=$totalindexedcount TotalMFLErrors=$totalmflcount" >>$SAVEFILE
                lasttime="$newtime";
                fi
                indexcount=0;mflcount=0;
                ;;

*fetcher.Fetcher\ -\ Fetcher:\ starting*)
                fetchcount=0;fetchfailcount=0;
                ;;

*fetcher.Fetcher\ -\ Fetcher:\ done*)
                if [[ $fetchcount -ne 0 || $fetchfailcount -ne 0 ]]; then
                echo "$totaltime FetchTried:$fetchcount FetchFailed:$fetchfailcount TotalPagesFetched=$totalfetchcount TotalFetchesFailed=$totalfetchfail" >>$SAVEFILE
                lasttime="$newtime";
                fi
                fetchcount=0;fetchfailcount=0;
                ;;

*fetching*)
                let fetchcount=$fetchcount+1;
                let totalfetchcount=$totalfetchcount+1;
                if [ "$newtime" -ne "$lasttime" ]; then
                echo "$totaltime FetchTried:$fetchcount FetchFailed:$fetchfailcount TotalPagesFetched=$totalfetchcount TotalFetchesFailed=$totalfetchfail" >>$SAVEFILE
                fetchcount=0;fetchfailcount=0;
                lasttime="$newtime";
                fi
                ;;

*failed*)
                let fetchfailcount=$fetchfailcount+1;
                let totalfetchfail=$totalfetchfail+1;
                if [ "$newtime" -ne "$lasttime" ]; then
                echo "$totaltime FetchTried:$fetchcount FetchFailed:$fetchfailcount TotalPagesFetched=$totalfetchcount TotalFetchesFailed=$totalfetchfail" >>$SAVEFILE
                fetchcount=0;fetchfailcount=0;
                lasttime="$newtime";
                fi
                ;;

*Indexing\ Filter*);;
*IndexingFilter*);;

*Indexing*)
                let indexcount=$indexcount+1;
                let totalindexedcount=$totalindexedcount+1;
                if [ "$newtime" -ne "$lasttime" ]; then
                echo "$totaltime Indexed:$indexcount MaxFieldLengthError:$mflcount TotalPagesIndexed=$totalindexedcount TotalMFLErrors=$totalmflcount" >>$SAVEFILE
                indexcount=0;mflcount=0;
                lasttime="$newtime";
                fi
                ;;

*maxFieldLength*)
                let mflcount=$mflcount+1;
                let totalmflcount=$totalmflcount+1;
                if [ "$newtime" -ne "$lasttime" ]; then
                echo "$totaltime Indexed:$indexcount MaxFieldLengthError:$mflcount TotalPagesIndexed=$totalindexedcount TotalMFLErrors=$totalmflcount" >>$SAVEFILE
                indexcount=0;mflcount=0;
                lasttime="$newtime";
                fi
                ;;
esac

#lasttime="$newtime";

#print the important operations(stages of crawling).
echo $some | perl -ne 'if(/fetcher.Fetcher\ -\ Fetcher:|indexer.Indexer\ -\ Indexer:|crawl.CrawlDb\ -\ CrawlDb update:|crawl.Generator\ -\ Generator:|crawl.Injector\ -\ Injector:|crawl.LinkDb\ -\ LinkDb:|indexer.DeleteDuplicates\ -\ Dedup:|indexer.IndexMerger|crawl.Crawl\ -\ crawl/g){print}' >>$SAVEFILE

echo $some | perl -ne 'if(/ERROR|WARN|FATAL/g){print}' >>$ERRORFILE

done

Running this script:

Change the exported paths(LOGFILE,SAVEFILE,ERRORFILE) in the script according to your nutch location. 2. Run the script. 3. Monitor your crawl by the following command: tail -f short_hadoop.log

This will give you minute-by-minute stats on how many pages nutch tried to fetch and how many failed with errors, how many pages indexed, how many pages truncated due to maximum field length(MLF) error during indexing. The errors are saved to another file.