...
- Have a configured local Nutch crawler setup to crawl on one machine
- Learned how to understand and configure Nutch runtime configuration including seed URL lists, URLFilters, etc.
- Have executed a Nutch crawl cycle and viewed the results of the Crawl Database
- Indexed Nutch crawl records into Apache Solr for full text search
Any issues with this tutorial should be reported to the Nutch user@ list.
Table of Contents
Table of Contents |
---|
Steps
Note |
---|
This tutorial describes the installation and use of Nutch 1.x (e.g. release cut from the master branch). For a similar Nutch 2.x with HBase tutorial, see Nutch2Tutorial. |
...
- Unix environment, or Windows-Cygwin environment
- Java Runtime/Development Environment (JDK 1.8 / Java 8)
- (Source build only) Apache Ant: http://ant.apache.org/
Install Nutch
Option 1: Setup Nutch from a binary distribution
- Download a binary package (
apache-nutch-1.X-bin.zip
) from here. - Unzip your binary Nutch package. There should be a folder
apache-nutch-1.X
. cd apache-nutch-1.X/
From now on, we are going to use${NUTCH_RUNTIME_HOME
} to refer to the current directory (apache-nutch-1.X/
).
Option 2: Set up Nutch from a source distribution
...
- Download a source package (
apache-nutch-1.X-src.zip
) - Unzip
cd apache-nutch-1.X/
- Run
ant
in this folder (cf. RunNutchInEclipse) - Now there is a directory
runtime/local
which contains a ready to use Nutch installation.
When the source distribution is used${NUTCH_RUNTIME_HOME
} refers toapache-nutch-1.X/runtime/local/
. Note that - config files should be modified in
apache-nutch-1.X/runtime/local/conf/
ant clean
will remove this directory (keep copies of modified config files)
Verify your Nutch installation
- run "
bin/nutch
" - You can confirm a correct installation if you see something similar to the following:
No Format |
---|
Usage: nutch COMMAND where command is one of: readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages ... |
...
- Run the following command if you are seeing "Permission denied":
No Format |
---|
chmod +x bin/nutch |
- Setup
JAVA_HOME
if you are seeingJAVA_HOME
not set. On Mac, you can run the following command or add it to~/.bashrc
:
No Format |
---|
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.8/Home # note that the actual path may be different on your system |
...
- Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize
- Set a seed list of URLs to crawl
Customize your crawl properties
- Default crawl properties can be viewed and edited within {{conf/nutch-default.xml }}- where most of these can be used without modification
- The file
conf/nutch-site.xml
serves as a place to add your own custom crawl properties that overwriteconf/nutch-default.xml
. The only required modification for this file is to override thevalue
field of the {{http.agent.name }}- i.e. Add your agent name in the
value
field of thehttp.agent.name
property inconf/nutch-site.xml
, for example:
- i.e. Add your agent name in the
No Format |
---|
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> |
- ensure that the
plugin.includes
property withinconf/nutch-site.xml
includes the indexer asindexer-solr
Create a URL seed list
- A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl
- The file
conf/regex-urlfilter.txt
will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download
Create a URL seed list
mkdir -p urls
cd urls
touch seed.txt
to create a text fileseed.txt
underurls/
with the following content (one URL per line for each site you want Nutch to crawl).
No Format |
---|
http://nutch.apache.org/ |
...
- The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.
- The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.
- A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:
- a crawl_generate names a set of URLs to be fetched
- a crawl_fetch contains the status of fetching each URL
- a content contains the raw content retrieved from each URL
- a parse_text contains the parsed text of each URL
- a parse_data contains outlinks and metadata parsed from each URL
- a crawl_parse contains the outlink URLs, used to update the crawldb
Step-by-Step: Seeding the crawldb with a list of URLs
...
- Map: Identity map where keys are digests and values are CrawlDatum records
- Reduce: CrawlDatums with the same digest are marked (except one of them) as duplicates. There are multiple heuristics available to choose the item which is not marked as duplicate - the one with the shortest URL, fetched most recently, or with the highest score.
No Format |
---|
Usage: bin/nutch dedup <crawldb> [-group <none|host|domain>] [-compareOrder <score>,<fetchTime>,<urlLength>] |
...
Every version of Nutch is built against a specific Solr version, but you may also try a "close" version.
Nutch | Solr |
1.15 | 7.3.1 |
1.14 | 6.6.0 |
1.13 | 5.5.0 |
1.12 | 5.4.1 |
To install Solr 7.x:
- download binary file from here
- unzip to
$HOME/apache-solr
, we will now refer to this as${APACHE_SOLR_HOME
} create resources for a new nutch solr core
No Format mkdir -p ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/ cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/_default/* ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/
copy the nutch schema.xml into the
conf
directoryNo Format cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
You may try to use the most recent schema.xml in case of issues launching Solr with this schema.
make sure that there is no managed-schema "in the way":
No Format rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema
start the solr server
No Format ${APACHE_SOLR_HOME}/bin/solr start
create the nutch core
No Format ${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
...
- (Nutch 1.15 and later) edit the file
conf/index-writers.xml
, see IndexWriters - (until Nutch 1.14) add the core name to the Solr server URL:
-Dsolr.server.url=http://localhost:8983/solr/nutch
Verify Solr installation
After you started Solr admin console, you should be able to access the following links:
...