How to setup Nutch and Hadoop on Ubuntu 6.06

Prerequisites

To run Hadoop one needs at least 2 computers to make use of a real distributed file system, it can also run on a single machine but than no use is made of the distributed capabilities.

Nutch is written in Java, so the java compiler and runtime are needed as well as ant. Hadoop makes use of ssh clients and servers on all machines. Lucene needs an servlet container, I used tomcat5.

To be able to login as root with su execute the following command and enter the new password for root as prompted.

sudo passwd

Login as root

su

Enable the universe and multiverse repositories by editing the apt sources.list file.

vi /etc/apt/sources.list 

Or execute the following if you are in the Netherlands and are using Ubuntu 6.06 Dapper.

echo "deb http://nl.archive.ubuntu.com/ubuntu/ dapper universe multiverse" >> /etc/apt/sources.list
echo "deb-src http://nl.archive.ubuntu.com/ubuntu/ dapper universe multiverse" >> /etc/apt/sources.list

Update the apt cache

apt-get update

Install the necessary packages for Nutch (java and ssh) on all machines

apt-get install sun-java5-jre
apt-get install ssh

update-alternatives --config java
#select /usr/lib/jvm/java-1.5.0-sun/jre/bin/java

And for the search web server

apt-get install apache2
apt-get install sun-java5-jdk
apt-get install tomcat5

Configure tomcat by editing /etc/default/tomcat5

 
vi /etc/default/tomcat5
#Add JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/

Or execute the following

echo "JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/" >> /etc/default/tomcat5

Build nutch

Download Nutch, this includes Hadoop and Lucene. I used the latest nightly build, which was at the time of writing 2007-02-06. Nutch nightly

Unpack the tarball to nutch-nightly and build it with ant.

tar -xvzf nutch-2007-02-06.tar.gz
cd nutch-nightly
mkdir /nutch-build
echo "/nutch-build" >> build.properties
ant package

Setup

Prepare the machines

Create the nutch user on each machine and create the necessary directories for nutch

ssh root@???

mkdir /nutch-0.9.0
mkdir /nutch-0.9.0/search
mkdir /nutch-0.9.0/filesystem
mkdir /nutch-0.9.0/local
mkdir /nutch-0.9.0/home

groupadd users
useradd -d /nutch-0.9.0/home -g users nutch
passwd nutch

chown -R nutch:users /nutch-0.9.0
exit

Install and configure nutch and hadoop

Install nutch on the namenode (the master) and add the following variables to the hadoop-env.sh shell script.

ssh nutch@???
cp -Rv /nutch-build/* /nutch-0.9.0/search/

echo "export HADOOP_HOME=/nutch-0.9.0/search" >> /nutch-0.9.0/search/conf/hadoop-env.sh
echo "export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun" >> /nutch-0.9.0/search/conf/hadoop-env.sh
echo "export HADOOP_LOG_DIR=/nutch-0.9.0/search/logs" >> /nutch-0.9.0/search/conf/hadoop-env.sh
echo "export HADOOP_SLAVES=/nutch-0.9.0/search/conf/slaves" >> /nutch-0.9.0/search/conf/hadoop-env.sh

exit

Configure SSH

Create ssh keys so that the nutch user can login over ssh without being prompted for a password.

ssh nutch@???
cd /nutch-0.9.0/home
ssh-keygen -t rsa
#! Use empty responses for each prompt
#  Enter passphrase (empty for no passphrase): 
#  Enter same passphrase again: 
#  Your identification has been saved in /nutch-0.9.0/home/.ssh/id_rsa.
#  Your public key has been saved in /nutch-0.9.0/home/.ssh/id_rsa.pub.
#  The key fingerprint is:
#  a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc nutch@localhost

Copy the key for the master (the namenode) to the authorized_keys file that will be copied to the other machines (the slaves).

cd /nutch-0.9.0/home/.ssh
cp id_rsa.pub authorized_keys

Configure Hadoop

Edit the mapred-default.xml configuration file. If it's missing, create it, with the following content:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property> 
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>
    This should be a prime number larger than multiple number of slave hosts,
    e.g. for 3 nodes set this to 17
  </description> 
</property> 

<property> 
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>
    This should be a prime number close to a low multiple of slave hosts,
    e.g. for 3 nodes set this to 7
  </description> 
</property> 

</configuration>

Note: do NOT put these properties in hadoop-site.xml. That file (see below) should only contain properties characteristic to the cluster, and not to the job. Misplacing these properties leads to strange and difficult to debug problems - e.g. inability to specify the number of map/reduce tasks programmatically (Generator and Fetcher depend on this).

Edit the hadoop-site.xml configuration file.

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>fs.default.name</name>
  <value>???:9000</value>
  <description>
    The name of the default file system. Either the literal string 
    "local" or a host:port for NDFS.
  </description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>???:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If 
    "local", then jobs are run in-process as a single map and 
    reduce task.
  </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>
    The maximum number of tasks that will be run simultaneously by
    a task tracker. This should be adjusted according to the heap size
    per task, the amount of RAM available, and CPU consumption of each task.
  </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx200m</value>
  <description>
    You can specify other Java options for each map or reduce task here,
    but most likely you will want to adjust the heap size.
  </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/nutch-0.9.0/filesystem/name</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/nutch-0.9.0/filesystem/data</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>$/nutch-0.9.0/filesystem/mapreduce/system</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/nutch-0.9.0/filesystem/mapreduce/local</value>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

</configuration>

Configure Nutch

Edit the nutch-site.xml file. Take the contents below and fill in the value tags.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>

Edit the crawl-urlfilter.txt file to edit the pattern of the urls that have to be fetched.

cd /nutch-0.9.0/search
vi conf/crawl-urlfilter.txt

change the line that reads:   +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to read:                      +^http://([a-z0-9]*\.)*org/

Or if downloading the whole internet is desired edit the nutch-site.xml file so that it includes the following property.

<property>
  <name>urlfilter.regex.file</name>
  <value>automaton-urlfilter.txt</value>
</property>

Distribute the code and the configuration

Copy the code and the configuration to the slaves

scp -r /nutch-0.9.0/search/* nutch@???:/nutch-0.9.0/search

Copy the keys to the slave machines

scp /nutch-0.9.0/home/.ssh/authorized_keys nutch@???:/nutch-0.9.0/home/.ssh/authorized_keys

Check if shhd is ready on the machines

ssh ???
hostname

Crawling

To crawl using dfs it is first necessary to format the namenode of Hadoop and then start it as well as all the datanode services.

Format the namenode

bin/hadoop namenode -format

Start all services on all machines.

bin/start-all.sh

To stop all of the servers use the following command, do not do this now:

bin/stop-all.sh

To start crawling from a few urls as seeds an url directory is made in which a seed file is put with some seed urls. This file is put into the hdfs, to check if hdfs has stored the directory use the dfs -ls option of hadoop.

mkdir urls
echo "http://lucene.apache.org" >> urls/seed
bin/hadoop dfs -put urls urls
bin/hadoop dfs -ls urls

Start an initial crawl

bin/nutch crawl urls -dir crawled -depth 3

On the masternode the progress and status can be viewed with a webbrowser. [http://localhost:50030/|http://localhost:50030/]

Searching

To search in the collected webpages the data that is now on the hdfs is best copied to the local filesystem for better performance. If an index becomes to large for one machine to handle, the index can be split and separate machines handle a part of the index. First we try to perform a search on one machine.

Install nutch for searching

Because the searching needs different settings for nutch than for crawling, the easiest thing to do is to make a sepperate folder for the nutch search part.

ssh root@???
mkdir /nutchsearch-0.9.0
chown nutch:users /nutchsearch-0.9.0
exit

ssh nutch@???
cp -Rv /nutch-build /nutchsearch-0.9.0/search
mkdir /nutchsearch-0.9.0/local

Configure

Edit the nutch-site.xml in the nutch search directory

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>local</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/nutchsearch-0.9.0/local/crawled</value>
  </property>

</configuration>

Edit the hadoop-site.xml file and delete all the properties

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

</configuration>

Make a local index

Copy the data from dfs to the local filesystem.

bin/hadoop dfs -copyToLocal crawled /nutchsearch-0.9.0/local/

Test if all is configured properly

bin/nutch org.apache.nutch.searcher.NutchBean an

The last command should give a number of hits. If the query results in 0 hits there could be something wrong with the configuration, with the index or there are no documents containing the word. Try a few words, if all result in 0 hits most probably the configuration is wrong or the index is corrupt. The configuration problems I came across were: pointing to the wrong index directory and unintentionally using hadoop.

Enable the web search interface

Copy the war file to the tomcat directory

rm -rf usr/share/tomcat5/webapps/ROOT*
cp /nutchsearch-0.9.0/*.war /usr/share/tomcat5/webapps/ROOT.war

Copy the configuration to the tomcat directory

cp /nutchsearch-0.9.0/search/conf/* /usr/share/tomcat5/webapps/ROOT/WEB-INF/classes/

Start tomcat

/usr/share/tomcat5/bin/startup.sh

Open the search page in a webbrowser [http://localhost:8180/|http://localhost:8180/]

Distributed searching

Prepare the other machines that are going to host a part of the index.

ssh root@???
mkdir /nutchsearch-0.9.0
mkdir /nutchsearch-0.9.0/search
chown -R nutch:users /nutchsearch-0.9.0
exit

Copy the search install directory to other machines.

scp -r /nutchsearch-0.9.0/search nutch@???:/nutchsearch-0.9.0/search

Configure

Edit the nutch-site.xml so that the searcher.dir property points to the directory containing a search-servers.txt file with a list of ip adresses and ports. Put the ip adresses and ports in a search-servers.txt file in the conf directory:

x.x.x.1 9999
x.x.x.2 9999
x.x.x.3 9999

Edit the nutch-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>local</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/nutchsearch-0.9.0/search/conf/</value>
  </property>

</configuration>

Split the index

With the current Nutch and Lucene code there is not a function to split one index into smaller indexes. There is some code published on a mailing list that does pretty much what is needed. [http://www.nabble.com/Lucene-index-manipulation-tools-tf2781692.html#a7760917]

Copy each part of the index to a different machine.

???
scp -R /nutchsearch-0.9.0/local/partX/crawled nutch@???:/nutchsearch-0.9.0/local/

Start the services

Startup the search services on all the machines that have a part of the index.

bin/nutch server 9999 /nutchsearch-0.9.0/local/crawled

Restart the master search node

/usr/share/tomcat5/bin/shutdown.sh
/usr/share/tomcat5/bin/startup.sh

Open the search page in a webbrowser [http://localhost:8180/|http://localhost:8180/]

Crawling more pages

To select links from the index and crawl for other pages there are a couple of nutch commands: generate, fetch and updatedb. The following bash script combines these, so that it can be started with just two parameters: the base directory of the data and the number of pages. Save this file as e.g. bin/fetch, if the data is in crawled01 than `bin/fetch crawled01 10000' selects 10000 links from the index and fetches them.

bin/nutch generate $1/crawldb $1/segments -topN $2
segment=`bin/hadoop dfs -ls crawled01/segments/ | tail -1 | grep -o [a-zA-Z0-9/]*`
bin/nutch fetch $segment
bin/nutch updatedb $1/crawldb $segment

To build a new index use the following script:

bin/hadoop dfs -rmr $1/indexes
bin/nutch invertlinks $1/linkdb $1/segments/*
bin/nutch index $1/indexes $1/crawldb $1/linkdb $1/segments/*

Copy the data to local and searching can be done on the new data.

Comments

Number of map reduce tasks

I noticed that the number of map and reduce task has an impact on the performance of Hadoop. Many times after crawling a lot of pages the nodes reported 'java.lang.OutOfMemoryError: Java heap space' errors, this happened also in the indexing part. Increasing the number of maps solved these problems, with an index that has over 200.000 pages I needed 306 maps in total over 3 machines. By setting the mapred.maps.tasks property in hadoop-site.xml to 99 (much higher than what is advised in other tutorials and in the hadoop-site.xml file) that problem is solved.

(ab) As noted above, do NOT set the number of map/reduce tasks in hadoop-site.xml, put them in mapred-default.xml instead.

See http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces for more info about the number of map reduce tasks.

Error when putting a file to DFS

If you get a similar error:

put: java.io.IOException: failed to create file /user/nutch/.test.crc on client 127.0.0.1 because target-length is 0, below MIN_REPLICATION (1) 

it may mean you do not have enough disc space. It happened to me with 90MB disk space available, Nutch 0.9/Hadoop 0.12.2. See also mailing list message.

  • No labels