Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

Tip

Nutch 1.x (ACTIVE): A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch

...

processin


Warning

Nutch 2.x (INACTIVE): An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. No more releases or bug fixes are anticipated for this codebase.


Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

...

You can download Nutch here.For more information about Apache Nutch, please see the Nutch wiki.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

...

Nutch 2.X tutorial(s)

Other Tutorial(s)

...