Page History

...

Please contribute your knowledge about Nutch here!

*If you would like to update any content, would like to add your own content or would like to see something added then please

Wiki Markup
forward your wiki username to the dev \[at\] nutch.apache.org mailing list (someone will give you permissions)

...

Or browse the open issues, open a new Jira ticket, or check the Nutch source code on git.

Table of Contents

What is Apache Nutch?

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.
Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

...

DownloadingNutch
Current CommandLineOptions: Command line options for 1.X and 2.X
JavaDocs – The JavaDocs for the most recent Nutch-1.X release.
JavaDocs – The JavaDocs for the most recent Nutch-2.X release.

Tutorials

Nutch 1.X tutorial(s)

NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.
QuickStartparseChecker - Quick start tutorial on how to use the ParseChecker tool to quickly scrape a website.
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI - An overview of the entire Nutch 1.X REST API.

Nutch 2.X tutorial(s)

Nutch2Tutorial – How to get Nutch 2.X to use HBase as persistence layer for Gora. This is the primary Nutch 2.X tutorial.
Setting up Nutch 2.x with Cassandra - How to setup and run Nutch 2.x using Cassandra as storage.
How to map your Nutch 2.x Hbase table to Hive - Sample query for Hive mapping.
Accumulo, Nutch, and Gora - A step-by-step tutorial Very Old

General Information

Nutch Website
Features :TODO:This needs to be completely overhauled to reflect recent Nutch features.
Current Nutch Gotchas
PublicServers running Nutch
Presentations on Nutch
Press Articles
Evaluations of Search Quality
Commercial Support & developers for hire
Mailing Lists
AcademicArticles that deal with Nutch
FAQ
HardwareRequirements
NutchResources
NutchScoring - The whats and wheres of Scoring implementations in Apache Nutch
NutchFileFormats - Provides information on the Nutch file formats

Nutch Development

Becoming a Nutch Developer - Start developing and contributing to Nutch.
PluginCentral – How to write your own plugins and use other people's.
InternalDocumentation – How Nutch works.
Nutch Version Control
UsingGit - a guide to leveraging Git and Nutch. Nutch's source code is no longer managed in Subversion, it's managed in Git.
HowToContribute
Committer's_Rules – Committers should follow these guidelines when deciding, which branch to use for committing the patches and when to commit.
Release_HOWTO
Apache CMS] - How to edit the Nutch website based on the [http://www.apache.org/dev/cms.html.
Image_Search_Design
StrategicGoals
Getting_Started
NutchMeetUps - Records of previous Nutch community meetup, hackathons, barcamps etc.
Using Nutch as a Maven dependency
GoogleSummerOfCode - An area dedicated to GSoC projects and student/mentor development/documentation sandbox.
AdvancedAjaxInteraction - Discussion centered on enabling Nutch to not only fetch, but also interact with JavaScript
WhiteListRobots - User guide for the new host robots.txt whitelist capability

Nutch 2.x

Nutch2Crawling - A description of the crawling jobs and field to database mappings.
Nutch2Architecture - A high level overview of the new architecture and design
Nutch2Roadmap – Discussions on the architecture and features of Nutch 2.0
Build Nutch 2.0 in Eclipse – How to setup your IDE environment comfortably.
ErrorMessagesInNutch2 – What they mean and suggestions for getting rid of them.
NutchConfigurationFiles-2.x – Configuration files that are specific to Nutch-2.x
Understanding the columns/fields in Nutch 2.0 Webpage - Detailed article
WorkingWithGoraSnapshots - A step by step guide to working with Gora development code within your Nutch 2.x deployment
NutchRESTAPI - A UML diagram and overview of the entire Nutch 2.X REST API.

...

Archive and Old Nutch Versions

Archive and Legacy

How to edit this Wiki

This Wiki is a collaborative site, anyone can contribute and share:

Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
Edit any page by pressing <<GetText(Edit)>> at the top or the bottom of the page

There are some conventions used on the Nutch wiki:

:TODO: (/!\ :TODO: /! ) is used to denote sections that definitely need to be cleaned up.

Some general info on using this Wiki Software:

Wiki Markup
Create a link to another page with joined capitalized words (like [WikiSandBox]) or with {{\["quoted words in brackets"\]}}

...

. To help avoid spam the Nutch wiki is only editable by known accounts. If you would like to help out with the Nutch wiki, add a new page, or work on an existing one, please first create a wiki account by clicking on "Sign Up" or "Log in" if you already have an account.

Space shortcuts

Child pages

Versions Compared

Old Version 7

New Version 8

Key