Under construction
Note on from Lewis McGibbney
This page is under construction.
Introduction
This page provides a narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.
Metrics are important because they tell you vital information about any given Nutch (and subsequently MapReduce) process. They provide accurate measurements about how the process is functioning and provide basis to suggest improvements.
Metrics provide a data-driven mechanism for intelligence gathering within Nutch operations and administration.
Audience
The page is intended for
- users who wish to learn about how Nutch Jobs and Tasks are performing, and
- developers who would wish to further extend/customize Nutch metrics
Related Development Work
N/A
Building Metrics on MapReduce Context's
As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper
, CrawlDb
, DeduplicationJob
, Fetcher
, Generator
, IndexingJob
, Injector
, LinkDb
, ParseSegment
utilize MapContext's and ReduceContext's. These Context's are passed to the Mapper and Reducer initially during setup but also used throughout each Mapper or Reducer task lifecycle.
Hadoop documentation
This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext. Specifically, the getCounter(...) methods facilitate access to Hadoop Counter's which we discuss below.
Hadoop Counter's
A Counter is simply a record comprising a name and value. As one would expect, Counter's can be incremented in order to count for example how many total records were processed within a task completion.
The following example shows how Counter's are used within the Nutch Injector to count the total number of URLs filtered during the Map phase of this job.
The code on Line 14 demonstrates the urls_filtered counter for injector counter group being incremented by 1.
The end result is that we generate useful, insightful metrics for each mapper and reducer task for any given Nutch Job.
See below for details on each Nutch metric available.
Metrics Table
The table below provides a canonical, comprehensive collection of Nutch metrics.
Table Ordering Logic
The table is arranged
- by Tool column; alphabetically
- by the Metric Group; alphabetically for the given tool
- by Metric Name; alphabetically for the given metric group
Tool/Object | Metric Group | Metric Name | Description | |
---|---|---|---|---|
Fetcher | FetcherStatus | bytes_downloaded | ||
FetcherStatus | hitByThrougputThreshold | |||
FetcherStatus | hitByTimeLimit | |||
FetcherThread | FetcherStatus | AboveExceptionThresholdInQueue | ||
FetchItem.notCreated.redirect | ||||
outlinks_detected | ||||
outlinks_following | ||||
ProtocolStatus.getName() | ||||
redirect_count_exceeded | ||||
redirect_deduplicated | ||||
FetcherStatus | robots_denied | |||
FetcherStatus | robots_denied_maxcrawldelay | |||
ParserStatus | ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()] | |||
Generator | Generator | EXPR_REJECTED | ||
Generator | HOSTS_AFFECTED_PER_HOST_OVERFLOW | |||
Generator | INTERVAL_REJECTED | |||
Generator | MALFORMED_URL | |||
Generator | SCHEDULE_REJECTED | |||
Generator | SCORE_TOO_LOW | |||
Generator | STATUS_REJECTED | |||
Generator | URLS_SKIPPED_PER_HOST_OVERFLOW | |||
Injector | injector | urls_filtered | ||
injector | urls_injected | |||
injector | urls_merged | |||
injector | urls_purged_404 | |||
injector | urls_purged_filter |
./src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java
./src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
./src/java/org/apache/nutch/tools/warc/WARCExporter.java
./src/java/org/apache/nutch/util/SitemapProcessor.java
./src/java/org/apache/nutch/util/domain/DomainStatistics.java
./src/java/org/apache/nutch/parse/ParseSegment.java
./src/java/org/apache/nutch/fetcher/QueueFeeder.java
./src/java/org/apache/nutch/crawl/CrawlDb.java
./src/java/org/apache/nutch/crawl/CrawlDbReducer.java
./src/java/org/apache/nutch/crawl/DeduplicationJob.java
./src/java/org/apache/nutch/crawl/CrawlDbFilter.java
./src/java/org/apache/nutch/hostdb/UpdateHostDbMapper.java
./src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
./src/java/org/apache/nutch/hostdb/ResolverThread.java
./src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
./src/java/org/apache/nutch/indexer/IndexingJob.java
./src/java/org/apache/nutch/indexer/IndexerMapReduce.java
./src/java/org/apache/nutch/indexer/CleaningJob.java