Have a configured local Nutch crawler setup to crawl on one machine
Learned how to understand and configure Nutch runtime configuration including seed URL lists, URLFilters, etc.
Have executed a Nutch crawl cycle and viewed the results of the Crawl Database
Indexed Nutch crawl records into Apache Solr for full text search

Any issues with this tutorial should be reported to the Nutch user@ list.

Steps

Note
This tutorial describes the installation and use of Nutch 1.x (e.g. release cut from the master branch). For a similar Nutch 2.x with HBase tutorial, see Nutch2Tutorial.

...

Unix environment, or Windows-Cygwin environment
Java Runtime/Development Environment (JDK 1.8 / Java 8)
(Source build only) Apache Ant: http://ant.apache.org/

Install Nutch

Option 1: Setup Nutch from a binary distribution

Download a binary package (apache-nutch-1.X-bin.zip) from here.
Unzip your binary Nutch package. There should be a folder apache-nutch-1.X.
cd apache-nutch-1.X/
From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory (apache-nutch-1.X/).

Option 2: Set up Nutch from a source distribution

...

Download a source package (apache-nutch-1.X-src.zip)
Unzip
cd apache-nutch-1.X/
Run ant in this folder (cf. RunNutchInEclipse)
Now there is a directory runtime/local which contains a ready to use Nutch installation.
When the source distribution is used ${NUTCH_RUNTIME_HOME} refers to apache-nutch-1.X/runtime/local/. Note that
config files should be modified in apache-nutch-1.X/runtime/local/conf/
ant clean will remove this directory (keep copies of modified config files)

Verify your Nutch installation

run "bin/nutch" - You can confirm a correct installation if you see something similar to the following:

No Format

Usage: nutch COMMAND where command is one of:
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
...

...

Run the following command if you are seeing "Permission denied":

No Format
chmod +x bin/nutch

Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:

No Format
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.8/Home # note that the actual path may be different on your system

...

Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize
Set a seed list of URLs to crawl

Customize your crawl properties

Default crawl properties can be viewed and edited within {{conf/nutch-default.xml }}- where most of these can be used without modification
The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwrite conf/nutch-default.xml. The only required modification for this file is to override the value field of the {{http.agent.name }}
- i.e. Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:

No Format
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>

ensure that the plugin.includes property within conf/nutch-site.xml includes the indexer as indexer-solr

Create a URL seed list

A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl
The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download

Create a URL seed list

mkdir -p urls
cd urls
touch seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl).

No Format
http://nutch.apache.org/

...

The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.
The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.
A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:
- a crawl_generate names a set of URLs to be fetched
- a crawl_fetch contains the status of fetching each URL
- a content contains the raw content retrieved from each URL
- a parse_text contains the parsed text of each URL
- a parse_data contains outlinks and metadata parsed from each URL
- a crawl_parse contains the outlink URLs, used to update the crawldb

Step-by-Step: Seeding the crawldb with a list of URLs

...

Map: Identity map where keys are digests and values are CrawlDatum records
Reduce: CrawlDatums with the same digest are marked (except one of them) as duplicates. There are multiple heuristics available to choose the item which is not marked as duplicate - the one with the shortest URL, fetched most recently, or with the highest score.

No Format
Usage: bin/nutch dedup <crawldb> [-group <none\|host\|domain>] [-compareOrder <score>,<fetchTime>,<urlLength>]

...

Every version of Nutch is built against a specific Solr version, but you may also try a "close" version.

Nutch	Solr
1.15	7.3.1
1.14	6.6.0
1.13	5.5.0
1.12	5.4.1

To install Solr 7.x:

download binary file from here
unzip to $HOME/apache-solr, we will now refer to this as ${APACHE_SOLR_HOME}

create resources for a new nutch solr core

No Format
mkdir -p ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/ cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/_default/* ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/

copy the nutch schema.xml into the conf directory

No Format
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/

You may try to use the most recent schema.xml in case of issues launching Solr with this schema.

make sure that there is no managed-schema "in the way":

No Format
rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema

start the solr server
No Format
${APACHE_SOLR_HOME}/bin/solr start

create the nutch core

No Format
${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/

...

(Nutch 1.15 and later) edit the file conf/index-writers.xml, see IndexWriters
(until Nutch 1.14) add the core name to the Solr server URL: -Dsolr.server.url=http://localhost:8983/solr/nutch

Verify Solr installation

After you started Solr admin console, you should be able to access the following links:

...

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version 2

Key

Table of Contents

Steps

Install Nutch

Option 1: Setup Nutch from a binary distribution

Option 2: Set up Nutch from a source distribution

Verify your Nutch installation

Customize your crawl properties

Create a URL seed list

Create a URL seed list

Step-by-Step: Seeding the crawldb with a list of URLs

Verify Solr installation

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 1

New Version 2

Key

Table of Contents

Steps

Install Nutch

Option 1: Setup Nutch from a binary distribution

Option 2: Set up Nutch from a source distribution

Verify your Nutch installation

Customize your crawl properties

Create a URL seed list

Create a URL seed list

Step-by-Step: Seeding the crawldb with a list of URLs

Verify Solr installation