...
Tip |
---|
Nutch 1.x (ACTIVE): A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processinprocessing. |
Warning |
---|
Nutch 2.x (INACTIVE): An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. No more releases or bug fixes are anticipated for this codebase. |
...
- DownloadingNutch
- Current CommandLineOptions: Command line options for 1.X and 2.X
- JavaDocs – The JavaDocs for the most recent Nutch-1.X release.
- JavaDocs – The JavaDocs for the most recent Nutch-2.X release.
...
- NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.
- QuickStartparseChecker - Quick start tutorial on how to use the ParseChecker tool to quickly scrape a website.
- https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI - An overview of the entire Nutch 1.X REST API.
- Running Nutch on Tez - Covers using Apache Tez as the YARN execution engine
Other Tutorial(s)
- Focused Crawling with Nutch using Cosine Similarity, Naive Bayes or the Anthelion mechanisms.
- Hadoop Tutorial Nutch being based Hadoop, it helps to have a better understanding of Hadoop.
- Running Nutch in (pseudo) distributed mode - How to setup and run Nutch in Hadoop pseudo-distributed mode.
- RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
- Intranet Document Search - Index and search Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr backend.
- Recrawling with Nutch - How to re-crawl with Nutch.
- Ajax-Solr Tutorial: Nutch - Quick and easy guide to getting a nice UI on top of your Nutch crawl data.
- AJAX/JavaScript Enabled Parsing with Apache Nutch and Selenium
- SetupProxyForNutch - using Tinyproxy on Ubuntu
- SetupNutchAndTor - Crawling .onion hidden services using Nutch behind Polipo HTTP Proxy
- CloudSearch - Step by step instructions on using Nutch with Cloudsearch, including pseudo distributed mode
- Webcast : running Apache Nutch on Elastic MapReduce
...