Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update link to nutch website repo

...

Please contribute your knowledge about Nutch here!

*If you would like to update any content, would like to add your own content or would like to see something added then please

Table of Contents

Or browse the open issues, open a new Jira ticket, or check the Nutch source code on git.

Table of Contents

Table of Contents
maxLevel6
exclude^(Welcome|Table)

What is Apache Nutch?

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

Tip

Nutch 1.x (ACTIVE): A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.


Warning

Nutch 2.x (INACTIVE): An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. No more releases or bug fixes are anticipated for this codebase.


Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

...

You can download Nutch here.For more information about Apache

Nutch , please see the Nutch wiki.Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

...

Tutorials

Nutch 1.X tutorial(s)

Nutch 2.X tutorial(s)

Other Tutorial(s)

...

  • OverviewDeploymentConfigs (warning) :This full page requires a complete update to reflect recent Nutch releases: (warning)
  • NutchConfigurationFiles: An overview from Nutch developers.
  • NutchPropertiesCompleteList: A fine grained account of all Nutch property configuration.
  • HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.
  • NonDefaultIntranetCrawlingOptions - Desirable options to add to your Nutch intranet crawling configuration.
  • OptimizingCrawls - How to optimise your crawling/fetching speed with Nutch.
  • ErrorMessages – What they mean and suggestions for getting rid of them. (warning) :This requires extensive updating to reflect recent Nutch releases. In addition the legacy indexing and searching material should be archived. (warning)
  • IndexStructure (warning) :This page needs a slight update to provide more information on plugins and the data they send to Solr for indexing: (warning)
  • IndexWriters: How to configure the index writers for indexing step.
  • Exchanges: How to configure the exchanges for indexing step.
  • Logging: Details of logging using slf4j and log4j2
  • Metrics: A narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.

General Information

Nutch Development

Nutch 2.x

Pre Nutch 1.3 and Archive

Archive and Old Nutch Versions

How to edit this Wiki

This Wiki is a collaborative site, anyone can contribute and share:

  • Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
  • Edit any page by pressing <<GetText(Edit)>> at the top or the bottom of the page

There are some conventions used on the Nutch wiki:

  • (warning) :TODO: (warning) (/!\ :TODO: /! ) is used to denote sections that definitely need to be cleaned up.

Some general info on using this Wiki Software:

...

Wiki Markup
Create a link to another page with joined capitalized words (like [WikiSandBox]) or with {{\["quoted words in brackets"\]}}

...

. To help avoid spam the Nutch wiki is only editable by known accounts. If you would like to help out with the Nutch wiki, add a new page, or work on an existing one, please first create a wiki account by clicking on "Sign Up" or "Log in" if you already have an account.