You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 101 Current »

How to setup Nutch 0.9.0 and Hadoop 0.12.2 with Lucene 2.1.0 on Debian

This tutorial is intended for:

  • Nutch running on multiple machines with mapreduce and Hadoop
  • Hadoop dfs on multiple machines
  • Lucene search interface on multiple machines with local search indices


# Login as root on the first machine which is going to be the master for the code distribution and the Hadoop cluster.

# Enable the contrib and non-free package sources in /etc/apt/sources.list
vi /etc/apt/sources.list

# Install java5 apache2 and tomcat5
apt-get update
apt-get install sun-java5-jdk
apt-get install apache2
apt-get install tomcat5

# Configure tomcat
echo "JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/" >> /etc/default/tomcat5

Download and build

# Download nutch-0.9
# Download ant from apache, there seems to be something missing in the ant that comes with Debian
tar -xzvf nutch-0.9.tar.gz
tar -xzvf apache-ant-1.6.5-bin.tar.gz

# Build nutch with apache ant 
cd nutch-0.9
/root/apache-ant-1.6.5/bin/ant package

Install and configure

# Create directories for nutch
mkdir /nutch-0.9
mkdir /nutch-0.9/build
mkdir /nutch-0.9/crawler
mkdir /nutch-0.9/dist
mkdir /nutch-0.9/filesystem
mkdir /nutch-0.9/home
mkdir /nutch-0.9/scripts
mkdir /nutch-0.9/source
mkdir /nutch-0.9/tars

# Create the nutch user and group
groupadd nutch
useradd -d /nutch-0.9/home -g nutch nutch
passwd nutch

# Copy the nutch build dir for the crawler
cp -Rv /root/nutch-0.9/build/nutch-0.9/* /nutch-0.9/crawler/

# Configure the crawler
echo "export HADOOP_HOME=/nutch-0.9/crawler" >> /nutch-0.9/crawler/conf/
echo "export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun" >> /nutch-0.9/crawler/conf/
echo "export HADOOP_LOG_DIR=/nutch-0.9/crawler/logs" >> /nutch-0.9/crawler/conf/
echo "export HADOOP_SLAVES=/nutch-0.9/crawler/conf/slaves" >> /nutch-0.9/crawler/conf/

Now the configuration files for the Nutch crawler in /nutch-0.9/crawler/conf/ have to be edited or created, these are:

  • mapred-default.xml
  • hadoop-site.xml
  • nutch-site.xml
  • url-crawlfilter.txt

Edit mapred-default.xml configuration file.

If it's missing, create it, with the following content:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


    This should be a prime number larger than multiple number of slave hosts,
    e.g. for 3 nodes set this to 17

    This should be a prime number close to a low multiple of slave hosts,
    e.g. for 3 nodes set this to 7


Edit hadoop-site.xml

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


    The name of the default file system. Either the literal string 
    "local" or a host:port for NDFS.

    The host and port that the MapReduce job tracker runs at. If 
    "local", then jobs are run in-process as a single map and 
    reduce task.

    The maximum number of tasks that will be run simultaneously by
    a task tracker. This should be adjusted according to the heap size
    per task, the amount of RAM available, and CPU consumption of each task.

    You can specify other Java options for each map or reduce task here,
    but most likely you will want to adjust the heap size.







Edit nutch-site.xml

Edit the nutch-site.xml file. Take the contents below and fill in the value tags.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:


  and set their values appropriately.


  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.

  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this

  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.

Edit crawl-urlfilter.txt

Edit the crawl-urlfilter.txt file to edit the pattern of the urls that have to be fetched.

cd /nutch-0.9.0/search
vi conf/crawl-urlfilter.txt

change the line that reads:   +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to read:                      +^http://([a-z0-9]*\.)*/

Finishing the installation

# Change the ownership of all files to the nutch user
chown -R nutch:nutch /nutch-0.9

# Log in as nutch
su nutch

# Create ssh keys. These are needed by the hadoop scripts.
ssh-keygen -t rsa
cp /nutch-0.9/home/.ssh/ /nutch-0.9/home/.ssh/authorized_keys

# Format the name node
cd /nutch-0.9/crawler
bin/hadoop namenode -format

Start crawling

To start crawling from a few urls as seeds an url directory is made in which a seed file is put with some seed urls. This file is put into the hdfs, to check if hdfs has stored the directory use the dfs -ls option of hadoop.

mkdir urls
echo "" >> urls/seed
bin/hadoop dfs -put urls urls
bin/hadoop dfs -ls urls

Start an initial crawl

export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/
bin/nutch crawl urls -dir crawled -depth 3

On the masternode the progress and status can be viewed with a webbrowser. [http://localhost:50030/|http://localhost:50030/]


  • No labels