Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update link to nutch website repo

...

Tip

Nutch 1.x (ACTIVE): A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processinprocessing.


Warning

Nutch 2.x (INACTIVE): An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. No more releases or bug fixes are anticipated for this codebase.

...

...

Other Tutorial(s)

...

  • OverviewDeploymentConfigs (warning) :This full page requires a complete update to reflect recent Nutch releases: (warning)
  • NutchConfigurationFiles: An overview from Nutch developers.
  • NutchPropertiesCompleteList: A fine grained account of all Nutch property configuration.
  • HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.
  • NonDefaultIntranetCrawlingOptions - Desirable options to add to your Nutch intranet crawling configuration.
  • OptimizingCrawls - How to optimise your crawling/fetching speed with Nutch.
  • ErrorMessages – What they mean and suggestions for getting rid of them. (warning) :This requires extensive updating to reflect recent Nutch releases. In addition the legacy indexing and searching material should be archived. (warning)
  • IndexStructure (warning) :This page needs a slight update to provide more information on plugins and the data they send to Solr for indexing: (warning)
  • IndexWriters: How to configure the index writers for indexing step.
  • Exchanges: How to configure the exchanges for indexing step.
  • Logging: Details of logging using slf4j and log4j2
  • Metrics: A narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.

General Information

...

...