...
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:
Tip |
---|
Nutch 1.x (ACTIVE): A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch |
...
processin |
Warning |
---|
Nutch 2.x (INACTIVE): An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. No more releases or bug fixes are anticipated for this codebase. |
Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.
...
You can download Nutch here.For more information about Apache Nutch, please see the Nutch wiki.
Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.
...
- NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.
- QuickStartparseChecker - Quick start tutorial on how to use the ParseChecker tool to quickly scrape a website.
- https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI - An overview of the entire Nutch 1.X REST API.
Nutch 2.X tutorial(s)
- Nutch2Tutorial – How to get Nutch 2.X to use HBase as persistence layer for Gora. This is the primary Nutch 2.X tutorial.
- Setting up Nutch 2.x with Cassandra - How to setup and run Nutch 2.x using Cassandra as storage.
- How to map your Nutch 2.x Hbase table to Hive - Sample query for Hive mapping.Accumulo, Nutch, and Gora - A step-by-step tutorial Very Old
Other Tutorial(s)
- Focused Crawling with Nutch using Cosine Similarity, Naive Bayes or the Anthelion mechanisms.
- Hadoop Tutorial Nutch being based Hadoop, it helps to have a better understanding of Hadoop.
- Running Nutch in (pseudo) distributed mode - How to setup and run Nutch in Hadoop pseudo-distributed mode.
- RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
- Intranet Document Search - Index and search Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr backend.
- Recrawling with Nutch - How to re-crawl with Nutch.
- Ajax-Solr Tutorial: Nutch - Quick and easy guide to getting a nice UI on top of your Nutch crawl data.
- AJAX/JavaScript Enabled Parsing with Apache Nutch and Selenium
- SetupProxyForNutch - using Tinyproxy on Ubuntu
- SetupNutchAndTor - Crawling .onion hidden services using Nutch behind Polipo HTTP Proxy
- CloudSearch - Step by step instructions on using Nutch with Cloudsearch, including pseudo distributed mode
- Webcast : running Apache Nutch on Elastic MapReduce
...