How to setup Nutch 0.9.0 and Hadoop 0.12.2 with Lucene 2.1.0 on Debian

This tutorial is intended for:

Nutch running on multiple machines with mapreduce and Hadoop
Hadoop dfs on multiple machines
Lucene search interface on multiple machines with local search indices

Prerequisites

No Format

# Login as root on the first machine which is going to be the master for the code distribution and the Hadoop cluster.
su

# Enable the contrib and non-free package sources in /etc/apt/sources.list
vi /etc/apt/sources.list

# Install java5 apache2 and tomcat5
apt-get update
apt-get install sun-java5-jdk
apt-get install apache2
apt-get install tomcat5

# Configure tomcat
echo "JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/" >> /etc/default/tomcat5

Download and build

No Format

# Download nutch-0.9
# Download ant from apache, there seems to be something missing in the ant that comes with Debian
wget ftp://apache.essentkabel.com/apache/lucene/nutch/nutch-0.9.tar.gz
wget http://archive.apache.org/dist/ant/binaries/apache-ant-1.6.5-bin.tar.gz
tar -xzvf nutch-0.9.tar.gz
tar -xzvf apache-ant-1.6.5-bin.tar.gz

# Build nutch with apache ant 
cd nutch-0.9
/root/apache-ant-1.6.5/bin/ant package

Install and configure

No Format

# Create directories for nutch
mkdir /nutch-0.9
mkdir /nutch-0.9/build
mkdir /nutch-0.9/crawler
mkdir /nutch-0.9/dist
mkdir /nutch-0.9/filesystem
mkdir /nutch-0.9/home
mkdir /nutch-0.9/scripts
mkdir /nutch-0.9/source
mkdir /nutch-0.9/tars

# Create the nutch user and group
groupadd nutch
useradd -d /nutch-0.9/home -g nutch nutch
passwd nutch

# Copy the nutch build dir for the crawler
cp -Rv /root/nutch-0.9/build/nutch-0.9/* /nutch-0.9/crawler/

# Configure the crawler
echo "export HADOOP_HOME=/nutch-0.9/crawler" >> /nutch-0.9/crawler/conf/hadoop-env.sh
echo "export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun" >> /nutch-0.9/crawler/conf/hadoop-env.sh
echo "export HADOOP_LOG_DIR=/nutch-0.9/crawler/logs" >> /nutch-0.9/crawler/conf/hadoop-env.sh
echo "export HADOOP_SLAVES=/nutch-0.9/crawler/conf/slaves" >> /nutch-0.9/crawler/conf/hadoop-env.sh

Now the configuration files for the Nutch crawler in /nutch-0.9/crawler/conf/ have to be edited or created, these are:

mapred-default.xml
hadoop-site.xml
nutch-site.xml
url-crawlfilter.txt

Edit mapred-default.xml configuration file.

If it's missing, create it, with the following content:

No Format

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property> 
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>
    This should be a prime number larger than multiple number of slave hosts,
    e.g. for 3 nodes set this to 17
  </description> 
</property> 

<property> 
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>
    This should be a prime number close to a low multiple of slave hosts,
    e.g. for 3 nodes set this to 7
  </description> 
</property> 

</configuration>

Edit hadoop-site.xml

No Format

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>fs.default.name</name>
  <value>???:9000</value>
  <description>
    The name of the default file system. Either the literal string 
    "local" or a host:port for NDFS.
  </description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>???:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If 
    "local", then jobs are run in-process as a single map and 
    reduce task.
  </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>
    The maximum number of tasks that will be run simultaneously by
    a task tracker. This should be adjusted according to the heap size
    per task, the amount of RAM available, and CPU consumption of each task.
  </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx200m</value>
  <description>
    You can specify other Java options for each map or reduce task here,
    but most likely you will want to adjust the heap size.
  </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/nutch-0.9/filesystem/name</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/nutch-0.9/filesystem/data</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/nutch-0.9/filesystem/mapreduce/system</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/nutch-0.9/filesystem/mapreduce/local</value>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

</configuration>

Edit nutch-site.xml

Edit the nutch-site.xml file. Take the contents below and fill in the value tags.

No Format

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>

Edit crawl-urlfilter.txt

Edit the crawl-urlfilter.txt file to edit the pattern of the urls that have to be fetched.

No Format
cd /nutch-0.9.0/search vi conf/crawl-urlfilter.txt change the line that reads: +^http://([a-z0-9]\.)MY.DOMAIN.NAME/ to read: +^http://([a-z0-9]\.)/

Finishing the installation

No Format

# Change the ownership of all files to the nutch user
chown -R nutch:nutch /nutch-0.9

# Log in as nutch
su nutch

# Create ssh keys. These are needed by the hadoop scripts.
ssh-keygen -t rsa
cp /nutch-0.9/home/.ssh/id_rsa.pub /nutch-0.9/home/.ssh/authorized_keys

# Format the name node
cd /nutch-0.9/crawler
bin/hadoop namenode -format

Start crawling

To start crawling from a few urls as seeds an url directory is made in which a seed file is put with some seed urls. This file is put into the hdfs, to check if hdfs has stored the directory use the dfs -ls option of hadoop.

No Format
mkdir urls echo "http://lucene.apache.org" >> urls/seed bin/hadoop dfs -put urls urls bin/hadoop dfs -ls urls

Start an initial crawl

No Format
export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/ bin/nutch crawl urls -dir crawled -depth 3

On the masternode the progress and status can be viewed with a webbrowser. [http://localhost:50030/|http://localhost:50030/]

Wiki Markup

\[Nutch_Hadoop_Lucene_Tutorial_%3a_Setting_up_the_slave_nodes\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Setting_up_the_master_search_node\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Setting_up_the_slave_search_nodes\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Recrawl\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Spliting_up_the_index\]

Space shortcuts

Child pages

Versions Compared

Old Version 41

New Version 42

Key

How to setup Nutch 0.9.0 and Hadoop 0.12.2 with Lucene 2.1.0 on Debian

Prerequisites

Download and build

Install and configure

Edit mapred-default.xml configuration file.

Edit hadoop-site.xml

Edit nutch-site.xml

Edit crawl-urlfilter.txt

Finishing the installation

Start crawling

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 41

New Version 42

Key

How to setup Nutch 0.9.0 and Hadoop 0.12.2 with Lucene 2.1.0 on Debian

Prerequisites

Download and build

Install and configure

Edit mapred-default.xml configuration file.

Edit hadoop-site.xml

Edit nutch-site.xml

Edit crawl-urlfilter.txt

Finishing the installation

Start crawling