How to setup Nutch 0.9.0 and Hadoop 0.12.2 with Lucene 2.1.0 on Debian
This tutorial is intended for:
- Nutch running on multiple machines with mapreduce and Hadoop
- Hadoop dfs on multiple machines
- Lucene search interface on multiple machines with local search indices
Prerequisites
No Format |
---|
# Login as root on the first machine which is going to be the master for the code distribution and the Hadoop cluster.
su
# Enable the contrib and non-free package sources in /etc/apt/sources.list
vi /etc/apt/sources.list
# Install java5 apache2 and tomcat5
apt-get update
apt-get install sun-java5-jdk
apt-get install apache2
apt-get install tomcat5
# Configure tomcat
echo "JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/" >> /etc/default/tomcat5
|
Download and build
No Format |
---|
# Download nutch-0.9
# Download ant from apache, there seems to be something missing in the ant that comes with Debian
wget ftp://apache.essentkabel.com/apache/lucene/nutch/nutch-0.9.tar.gz
wget http://archive.apache.org/dist/ant/binaries/apache-ant-1.6.5-bin.tar.gz
tar -xzvf nutch-0.9.tar.gz
tar -xzvf apache-ant-1.6.5-bin.tar.gz
# Build nutch with apache ant
cd nutch-0.9
/root/apache-ant-1.6.5/bin/ant package
|
No Format |
---|
# Create directories for nutch
mkdir /nutch-0.9
mkdir /nutch-0.9/build
mkdir /nutch-0.9/crawler
mkdir /nutch-0.9/dist
mkdir /nutch-0.9/filesystem
mkdir /nutch-0.9/home
mkdir /nutch-0.9/scripts
mkdir /nutch-0.9/source
mkdir /nutch-0.9/tars
# Create the nutch user and group
groupadd nutch
useradd -d /nutch-0.9/home -g nutch nutch
passwd nutch
# Copy the nutch build dir for the crawler
cp -Rv /root/nutch-0.9/build/nutch-0.9/* /nutch-0.9/crawler/
# Configure the crawler
echo "export HADOOP_HOME=/nutch-0.9/crawler" >> /nutch-0.9/crawler/conf/hadoop-env.sh
echo "export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun" >> /nutch-0.9/crawler/conf/hadoop-env.sh
echo "export HADOOP_LOG_DIR=/nutch-0.9/crawler/logs" >> /nutch-0.9/crawler/conf/hadoop-env.sh
echo "export HADOOP_SLAVES=/nutch-0.9/crawler/conf/slaves" >> /nutch-0.9/crawler/conf/hadoop-env.sh
|
Now the configuration files for the Nutch crawler in /nutch-0.9/crawler/conf/ have to be edited or created, these are:
- mapred-default.xml
- hadoop-site.xml
- nutch-site.xml
- url-crawlfilter.txt
Edit mapred-default.xml configuration file.
If it's missing, create it, with the following content:
No Format |
---|
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>
This should be a prime number larger than multiple number of slave hosts,
e.g. for 3 nodes set this to 17
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
<description>
This should be a prime number close to a low multiple of slave hosts,
e.g. for 3 nodes set this to 7
</description>
</property>
</configuration>
|
Edit hadoop-site.xml
No Format |
---|
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>???:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>???:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>
The maximum number of tasks that will be run simultaneously by
a task tracker. This should be adjusted according to the heap size
per task, the amount of RAM available, and CPU consumption of each task.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx200m</value>
<description>
You can specify other Java options for each map or reduce task here,
but most likely you will want to adjust the heap size.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/nutch-0.9/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/nutch-0.9/filesystem/data</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/nutch-0.9/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/nutch-0.9/filesystem/mapreduce/local</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
|
Edit nutch-site.xml
Edit the nutch-site.xml file. Take the contents below and fill in the value tags.
No Format |
---|
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>
|
Edit crawl-urlfilter.txt
Edit the crawl-urlfilter.txt file to edit the pattern of the urls that have to be fetched.
No Format |
---|
cd /nutch-0.9.0/search
vi conf/crawl-urlfilter.txt
change the line that reads: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
to read: +^http://([a-z0-9]*\.)*/
|
Finishing the installation
No Format |
---|
# Change the ownership of all files to the nutch user
chown -R nutch:nutch /nutch-0.9
# Log in as nutch
su nutch
# Create ssh keys. These are needed by the hadoop scripts.
ssh-keygen -t rsa
cp /nutch-0.9/home/.ssh/id_rsa.pub /nutch-0.9/home/.ssh/authorized_keys
# Format the name node
cd /nutch-0.9/crawler
bin/hadoop namenode -format
|
Start crawling
To start crawling from a few urls as seeds an url directory is made in which a seed file is put with some seed urls. This file is put into the hdfs, to check if hdfs has stored the directory use the dfs -ls option of hadoop.
No Format |
---|
mkdir urls
echo "http://lucene.apache.org" >> urls/seed
bin/hadoop dfs -put urls urls
bin/hadoop dfs -ls urls
|
Start an initial crawl
No Format |
---|
export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/
bin/nutch crawl urls -dir crawled -depth 3
|
On the masternode the progress and status can be viewed with a webbrowser. [http://localhost:50030/|http://localhost:50030/]
Wiki Markup |
---|
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Setting_up_the_slave_nodes\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Setting_up_the_master_search_node\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Setting_up_the_slave_search_nodes\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Recrawl\]
\[Nutch_Hadoop_Lucene_Tutorial_%3a_Spliting_up_the_index\] |