Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Have a configured local Nutch crawler setup to crawl on one machine
  • Learned how to understand and configure Nutch runtime configuration including seed URL lists, URLFilters, etc.
  • Have executed a Nutch crawl cycle and viewed the results of the Crawl Database
  • Indexed Nutch crawl records into Apache Solr for full text search

Any issues with this tutorial should be reported to the Nutch user@ list.

Table of Contents

Table of Contents

Steps

Note

This tutorial describes the installation and use of Nutch 1.x (e.g. release cut from the master branch). For a similar Nutch 2.x with HBase tutorial, see Nutch2Tutorial.

...

  • Unix environment, or Windows-Cygwin environment
  • Java Runtime/Development Environment (JDK 1.8 / Java 8)
  • (Source build only) Apache Ant: http://ant.apache.org/

Install Nutch

Option 1: Setup Nutch from a binary distribution

  • Download a binary package (apache-nutch-1.X-bin.zip) from here.
  • Unzip your binary Nutch package. There should be a folder apache-nutch-1.X.
  • cd apache-nutch-1.X/
    From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory (apache-nutch-1.X/).

Option 2: Set up Nutch from a source distribution

...

  • Download a source package (apache-nutch-1.X-src.zip)
  • Unzip
  • cd apache-nutch-1.X/
  • Run ant in this folder (cf. RunNutchInEclipse)
  • Now there is a directory runtime/local which contains a ready to use Nutch installation.
    When the source distribution is used ${NUTCH_RUNTIME_HOME} refers to apache-nutch-1.X/runtime/local/. Note that
  • config files should be modified in apache-nutch-1.X/runtime/local/conf/
  • ant clean will remove this directory (keep copies of modified config files)

Verify your Nutch installation

  • run "bin/nutch" - You can confirm a correct installation if you see something similar to the following:
No Format
Usage: nutch COMMAND where command is one of:
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
...

...

  • Run the following command if you are seeing "Permission denied":
No Format
chmod +x bin/nutch
  • Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:
No Format
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.8/Home
# note that the actual path may be different on your system

...

  1. Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize
  2. Set a seed list of URLs to crawl

Customize your crawl properties

  • Default crawl properties can be viewed and edited within {{conf/nutch-default.xml }}- where most of these can be used without modification
  • The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwrite conf/nutch-default.xml. The only required modification for this file is to override the value field of the {{http.agent.name }}
    • i.e. Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:
No Format
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>
  • ensure that the plugin.includes property within conf/nutch-site.xml includes the indexer as indexer-solr

Create a URL seed list

  • A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl
  • The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download

Create a URL seed list

  • mkdir -p urls
  • cd urls
  • touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl).
No Format
http://nutch.apache.org/

...

  1. The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.
  2. The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.
  3. A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:
    • a crawl_generate names a set of URLs to be fetched
    • a crawl_fetch contains the status of fetching each URL
    • a content contains the raw content retrieved from each URL
    • a parse_text contains the parsed text of each URL
    • a parse_data contains outlinks and metadata parsed from each URL
    • a crawl_parse contains the outlink URLs, used to update the crawldb

Step-by-Step: Seeding the crawldb with a list of URLs

...

  • Map: Identity map where keys are digests and values are CrawlDatum records
  • Reduce: CrawlDatums with the same digest are marked (except one of them) as duplicates. There are multiple heuristics available to choose the item which is not marked as duplicate - the one with the shortest URL, fetched most recently, or with the highest score.
No Format
     Usage: bin/nutch dedup <crawldb> [-group <none|host|domain>] [-compareOrder <score>,<fetchTime>,<urlLength>]

...

Every version of Nutch is built against a specific Solr version, but you may also try a "close" version.

Nutch

Solr

1.15

7.3.1

1.14

6.6.0

1.13

5.5.0

1.12

5.4.1

To install Solr 7.x:

  • download binary file from here
  • unzip to $HOME/apache-solr, we will now refer to this as ${APACHE_SOLR_HOME}
  • create resources for a new nutch solr core

    No Format
    mkdir -p ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/
    cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/_default/* ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/
    


  • copy the nutch schema.xml into the conf directory

    No Format
    cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
    


You may try to use the most recent schema.xml in case of issues launching Solr with this schema.

  • make sure that there is no managed-schema "in the way":

    No Format
    rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema
    


  • start the solr server

    No Format
    ${APACHE_SOLR_HOME}/bin/solr start
    


  • create the nutch core

    No Format
    ${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
    


...

Verify Solr installation

After you started Solr admin console, you should be able to access the following links:

...