Introduction

This page provides a narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.

Metrics are important because they tell you vital information about any given Nutch (and subsequently MapReduce) process. They provide accurate measurements about how the process is functioning and provide basis to suggest improvements.

Metrics provide a data-driven mechanism for intelligence gathering within Nutch operations and administration.

Audience

The page is intended for

users who wish to learn about how Nutch Jobs and Tasks are performing, and
developers who would wish to further extend/customize Nutch metrics

Related Development Work

N/A

Building Metrics on MapReduce Context's

As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper, CrawlDb, DeduplicationJob, Fetcher, Generator, IndexingJob, Injector, LinkDb, ParseSegment utilize MapContext's and ReduceContext's. These Context's are passed to the Mapper and Reducer initially during setup but also used throughout each Mapper or Reducer task lifecycle.

Hadoop documentation

The canonical Hadoop documentation for Mapper and Reducer provides much more detail about the involvement of Context's in each task lifecycle.

This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext. Specifically, the getCounter(...) methods facilitate access to Hadoop Counter's which we discuss below.

Hadoop Counter's

A Counter is simply a record comprising a name and value. As one would expect, Counter's can be incremented in order to count for example how many total records were processed within a task completion.

The following example shows how Counter's are used within the Nutch Injector to count the total number of URLs filtered during the Map phase of this job.

Use of Counters in the Nutch Injector

    @Override
    public void map(Text key, Writable value, Context context)
        throws IOException, InterruptedException {
      if (value instanceof Text) {
        // if its a url from the seed list
        String url = key.toString().trim();

        // remove empty string or string starting with '#'
        if (url.length() == 0 || url.startsWith("#"))
          return;

        url = filterNormalize(url);
        if (url == null) {
          context.getCounter("injector", "urls_filtered").increment(1);

The code on Line 14 demonstrates the urls_filtered counter for injector counter group being incremented by 1.

The end result is that we generate useful, insightful metrics for each mapper and reducer task for any given Nutch Job.

See below for details on each Nutch metric available.

Metrics Table

The table below provides a canonical, comprehensive collection of Nutch metrics.

Table Ordering Logic

The table is arranged

by Tool column; alphabetically
by the Metric Group; alphabetically for the given tool
by Metric Name; alphabetically for the given metric group

Tool/Object	Metric Group	Metric Name	Description	Usage and Comments
CleaningJob	CleaningJobStatus	Deleted documents	The total count of DB_GONE and/or DB_DUPLICATE documents ultimately cleaned (deleted) from the indexer(s).	This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place.
CrawlDbFilter	CrawlDB filter	Gone records removed
	CrawlDB filter	Orphan records removed
	CrawlDB filter	URLs filtered
CrawlDbReducer	CrawlDB status	CrawlDatum.getStatusName(CrawlDatum().getStatus())
DeduplicationJob	DeduplicationJobStatus	Documents marked as duplicate
DomainStatistics	N/A	MyCounter.EMPTY_RESULT
	N/A	MyCounter.FETCHED
	N/A	MyCounter.NOT_FETCHED
Fetcher	FetcherStatus	bytes_downloaded
	FetcherStatus	hitByThrougputThreshold
	FetcherStatus	hitByTimeLimit
FetcherThread	FetcherStatus	AboveExceptionThresholdInQueue
	FetcherStatus	FetchItem.notCreated.redirect
	FetcherStatus	outlinks_detected
	FetcherStatus	outlinks_following
	FetcherStatus	ProtocolStatus.getName()
	FetcherStatus	redirect_count_exceeded
	FetcherStatus	redirect_deduplicated
	FetcherStatus	robots_denied
	FetcherStatus	robots_denied_maxcrawldelay
	ParserStatus	ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()]
Generator	Generator	EXPR_REJECTED
	Generator	HOSTS_AFFECTED_PER_HOST_OVERFLOW
	Generator	INTERVAL_REJECTED
	Generator	MALFORMED_URL
	Generator	SCHEDULE_REJECTED
	Generator	SCORE_TOO_LOW
	Generator	STATUS_REJECTED
	Generator	URLS_SKIPPED_PER_HOST_OVERFLOW
IndexerMapReduce	IndexerStatus	deleted (duplicates)
	IndexerStatus	deleted (IndexingFilter)
	IndexerStatus	deleted (gone)
	IndexerStatus	deleted (redirects)
	IndexerStatus	deleted (robots=noindex)
	IndexerStatus	errors (IndexingFilter)
	IndexerStatus	errors (ScoringFilter)
	IndexerStatus	indexed (add/update)
	IndexerStatus	skipped (IndexingFilter)
	IndexerStatus	skipped (not modified)
Injector	injector	urls_filtered
	injector	urls_injected
	injector	urls_merged
	injector	urls_purged_404
	injector	urls_purged_filter
ParseSegment	ParserStatus	ParseStatus.majorCodes[parseStatus.getMajorCode()]
QueueFeeder	FetcherStatus	filtered
(also QueueFeeder)	FetcherStatus	AboveExceptionThresholdInQueue
ResolverThread	UpdateHostDb	checked_hosts
	UpdateHostDb	existing_known_host
	UpdateHostDb	existing_unknown_host
	UpdateHostDb	new_known_host
	UpdateHostDb	new_unknown_host
	UpdateHostDb	purged_unknown_host
	UpdateHostDb	rediscovered_host
	UpdateHostDb	Long.toString(datum.numFailures()) + "_times_failed"
SitemapProcessor	Sitemap	existing_sitemap_entries
	Sitemap	failed_fetches
	Sitemap	filtered_records
	Sitemap	filtered_sitemaps_from_hostname
	Sitemap	new_sitemap_entries
	Sitemap	sitemaps_from_hostname
	Sitemap	sitemap_seeds
UpdateHostDbMapper	UpdateHostDb	filtered_records
UpdateHostDbReducer	UpdateHostDb	total_hosts
(also UpdateHostDbReducer)	UpdateHostDb	skipped_not_eligible
WebGraph	WebGraph.outlinks	added links
(also WebGraph)	WebGraph.outlinks	removed links
WARCExporter	WARCExporter	exception
	WARCExporter	invalid URI
	WARCExporter	missing content
	WARCExporter	missing metadata
	WARCExporter	omitted empty response
	WARCExporter	records generated

Space shortcuts

Child pages