Table of Contents |
---|
Info | ||
---|---|---|
| ||
Note on from Lewis McGibbney This page is under construction. |
Introduction
This page provides a narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.
Metrics are important because they tell you vital information about any given Nutch (and subsequently MapReduce) process. They provide accurate measurements about how the process is functioning and provide basis to suggest improvements.
Metrics provide a data-driven mechanism for intelligence gathering within Nutch operations and administration.
Audience
The page is intended for
- users who wish to learn about how Nutch Jobs and Tasks are performing, and
- developers who would wish to further extend/customize Nutch metrics
Related Development Work
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key NUTCH-2909
Building Metrics on MapReduce Context's
As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper
, CrawlDb
, DeduplicationJob
, Fetcher
, Generator
, IndexingJob
, Injector
, LinkDb
, ParseSegment
utilize MapContext's andReduceContext's. These Context's are passed to the Mapper and Reducer initially during setup but also used throughout each Mapper or Reducer task lifecycle.
Info | ||
---|---|---|
| ||
The canonical Hadoop documentation for Mapper and Reducer provides much more detail about the involvement of Context's in each task lifecycle. |
This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext. Specifically, the getCounter(...) methods facilitate access to Hadoop Counter's which we discuss below.
Hadoop Counter's
A Counter is simply a record comprising a name and value. As one would expect, Counter's can be incremented in order to count for example how many total records were processed within a task completion.
The following example shows how Counter's are used within the Nutch Injector to count the total number of URLs filtered during the Map phase of this job.
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
@Override public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { if (value instanceof Text) { // if its a url from the seed list String url = key.toString().trim(); // remove empty string or string starting with '#' if (url.length() == 0 || url.startsWith("#")) return; url = filterNormalize(url); if (url == null) { context.getCounter("injector", "urls_filtered").increment(1); |
The code on Line 14 demonstrates the urls_filtered counter for injector counter group being incremented by 1.
The end result is that we generate useful, insightful metrics for each mapper and reducer task for any given Nutch Job.
See below for details on each Nutch metric available.
Metrics Table
The table below provides a canonical, comprehensive collection of Nutch metrics.
Info | ||
---|---|---|
| ||
The table is arranged
|
Tool/Object | Metric Group | Metric Name | Description | Usage and Comments | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CleaningJob | CleaningJobStatus | Deleted documents | The total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s). | This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place. | |||||||||||||||||||||
CrawlDbFilter | CrawlDB filter | Gone records removed | The total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update. | See
| |||||||||||||||||||||
CrawlDB filter | Orphan records removed | The total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update. | See
| ||||||||||||||||||||||
CrawlDB filter | URLs filtered | The total count of filtered pages e.g. pages which didn't pass one or more URLFIlter implementation(s), deleted from the CrawlDB during an update. | This metric is generally useful for determining the overall effectiveness of URLFilter plugins over time. POSSIBLE IMPROVEMENT: This metric could be improved if an association could be made between the URL was filtered and the URLFilter which filtered it. This would facilitate aggregating URLFiltering results by URLFilter. | ||||||||||||||||||||||
CrawlDbReducer | CrawlDB status | CrawlDatum.getStatusName(CrawlDatum().getStatus()) | With each URL able to have only one state at any given point in time, this metric facilitates aggregated counts of the different types of CrawlDatum states for a given CrawlDB. | The state of any given URL will change as the URL transitions through a crawl cycle. Available URL states are defined in the CrawlDatum e.g., STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_FETCH_SUCCESS, etc. Practically, CrawlDatum status' are defined using byte signatures but accessed programmatically using static final constants. This metric can be used to identify the presence of undesired URL CrawlDatum status' for given URL's e.g., STATUS_DB_GONE. Such an event could then trigger a cleaning/pruning operation. | |||||||||||||||||||||
DeduplicationJob | DeduplicationJobStatus | Documents marked as duplicate | The total number of duplicate documents in the CrawlDB. | The process of identifying (near) duplicate documents is of vital importance within the context of a search engine. The precision of any given information retrieval system can be negatively impacted if (near) duplicates are not identified and handled correctly. This does not always mean removing them, for example maybe (near) duplicates are important for versioning purposes. In most cases however it is preferred to identify and remove (near) duplicate records. The Deduplication algorithm in Nutch groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept. A duplicate record will have a CrawlDatum status of CrawlDatum.STATUS_DB_DUPLICATE. | |||||||||||||||||||||
DomainStatistics | N/A | MyCounter.EMPTY_RESULT | The total count of empty (probably problematic) URL records for a given host, domain, suffix or top-level domain. | It is possible that the DomainStatistics tool may identify an empty record for a given URL. This may happen regardless of whether the tool is invoked to retrieve host, domain, suffix or top-level domain statistics. When this discovery event occurs, it it is likely that some investigation would take place to understand why. For example, the CrawlDbReader could be invoked with the -url command line argument to further debug/detail what CrawlDatum data exists. | |||||||||||||||||||||
N/A | MyCounter.FETCHED | The total count of fetched URL records for a given host, domain, suffix or top-level domain. | This metric is particularly useful for quickly drilling down through large datasets to determine, for example, how much 'coverage' has been achieved for a given host, domain, suffix or top-level domain. This figure can be compared to a website administrators total. | ||||||||||||||||||||||
N/A | MyCounter.NOT_FETCHED | The total count of unfetched URL records for a given host, domain, suffix or top-level domain. | This metric is particularly useful for quickly drilling down through large datasets to determine, for example, how much 'coverage' still has to be achieved for a given host, domain, suffix or top-level domain. When combined with the fetched figure and compared to a website administrators total it can provide useful insight. | ||||||||||||||||||||||
Fetcher | FetcherStatus | bytes_downloaded | The total bytes of fetched data acquired across the Fetcher Mapper task(s). | Over time, this can be used to benchmark how much data movement is occurring over the Nutch crawl network. POSSIBLE IMPROVEMENT: This metric could be improved if a correlation could be made between the volume of data and the source it came from whether that be a given host, domain, suffix or top-level domain. | |||||||||||||||||||||
FetcherStatus | hitByThrougputThreshold | A total count of the URLs dropped across all fetch queues due to throughput dropping below the threshold too many times. | This aspect of the Nutch Fetcher configuration is designed to prevent slow fetch queues from stalling the overall fetcher throughput. The specific configuration settings is
A more thorough understanding of Fetcher configuration relating to (slow) throughput requires an understanding of the following configuration settings as well
POSSIBLE IMPROVEMENT: It would be advantageous to understand which URLs from which hosts in the queue(s) were resulting in slow throughput. This would facilitate investigation into why this was happening. | ||||||||||||||||||||||
FetcherStatus | hitByTimeLimit | ||||||||||||||||||||||||
FetcherThread | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||||||||||||
FetcherStatus | FetchItem.notCreated.redirect | ||||||||||||||||||||||||
FetcherStatus | outlinks_detected | ||||||||||||||||||||||||
FetcherStatus | outlinks_following | ||||||||||||||||||||||||
FetcherStatus | ProtocolStatus.getName() | ||||||||||||||||||||||||
FetcherStatus | redirect_count_exceeded | ||||||||||||||||||||||||
FetcherStatus | redirect_deduplicated | ||||||||||||||||||||||||
FetcherStatus | robots_denied | ||||||||||||||||||||||||
FetcherStatus | robots_denied_maxcrawldelay | ||||||||||||||||||||||||
ParserStatus | ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()] | ||||||||||||||||||||||||
Generator | Generator | EXPR_REJECTED | |||||||||||||||||||||||
Generator | HOSTS_AFFECTED_PER_HOST_OVERFLOW | ||||||||||||||||||||||||
Generator | INTERVAL_REJECTED | ||||||||||||||||||||||||
Generator | MALFORMED_URL | ||||||||||||||||||||||||
Generator | SCHEDULE_REJECTED | ||||||||||||||||||||||||
Generator | SCORE_TOO_LOW | ||||||||||||||||||||||||
Generator | STATUS_REJECTED | ||||||||||||||||||||||||
Generator | URLS_SKIPPED_PER_HOST_OVERFLOW | ||||||||||||||||||||||||
IndexerMapReduce | IndexerStatus | deleted (duplicates) | |||||||||||||||||||||||
IndexerStatus | deleted (IndexingFilter) | ||||||||||||||||||||||||
IndexerStatus | deleted (gone) | ||||||||||||||||||||||||
IndexerStatus | deleted (redirects) | ||||||||||||||||||||||||
IndexerStatus | deleted (robots=noindex) | ||||||||||||||||||||||||
IndexerStatus | errors (IndexingFilter) | ||||||||||||||||||||||||
IndexerStatus | errors (ScoringFilter) | ||||||||||||||||||||||||
IndexerStatus | indexed (add/update) | ||||||||||||||||||||||||
IndexerStatus | skipped (IndexingFilter) | ||||||||||||||||||||||||
IndexerStatus | skipped (not modified) | ||||||||||||||||||||||||
Injector | injector | urls_filtered | |||||||||||||||||||||||
injector | urls_injected | ||||||||||||||||||||||||
injector | urls_merged | ||||||||||||||||||||||||
injector | urls_purged_404 | ||||||||||||||||||||||||
injector | urls_purged_filter | ||||||||||||||||||||||||
ParseSegment | ParserStatus | ParseStatus.majorCodes[parseStatus.getMajorCode()] | |||||||||||||||||||||||
QueueFeeder | FetcherStatus | filtered | |||||||||||||||||||||||
(also QueueFeeder) | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||||||||||||
ResolverThread | UpdateHostDb | checked_hosts | |||||||||||||||||||||||
UpdateHostDb | existing_known_host | ||||||||||||||||||||||||
UpdateHostDb | existing_unknown_host | ||||||||||||||||||||||||
UpdateHostDb | new_known_host | ||||||||||||||||||||||||
UpdateHostDb | new_unknown_host | ||||||||||||||||||||||||
UpdateHostDb | purged_unknown_host | ||||||||||||||||||||||||
UpdateHostDb | rediscovered_host | ||||||||||||||||||||||||
UpdateHostDb | Long.toString(datum.numFailures()) + "_times_failed" | ||||||||||||||||||||||||
SitemapProcessor | Sitemap | existing_sitemap_entries | |||||||||||||||||||||||
Sitemap | failed_fetches | ||||||||||||||||||||||||
Sitemap | filtered_records | ||||||||||||||||||||||||
Sitemap | filtered_sitemaps_from_hostname | ||||||||||||||||||||||||
Sitemap | new_sitemap_entries | ||||||||||||||||||||||||
Sitemap | sitemaps_from_hostname | ||||||||||||||||||||||||
Sitemap | sitemap_seeds | ||||||||||||||||||||||||
UpdateHostDbMapper | UpdateHostDb | filtered_records | |||||||||||||||||||||||
UpdateHostDbReducer | UpdateHostDb | total_hosts | |||||||||||||||||||||||
(also UpdateHostDbReducer) | UpdateHostDb | skipped_not_eligible | |||||||||||||||||||||||
WebGraph | WebGraph.outlinks | added links | |||||||||||||||||||||||
(also WebGraph) | WebGraph.outlinks | removed links | |||||||||||||||||||||||
WARCExporter | WARCExporter | exception | |||||||||||||||||||||||
WARCExporter | invalid URI | ||||||||||||||||||||||||
WARCExporter | missing content | ||||||||||||||||||||||||
WARCExporter | missing metadata | ||||||||||||||||||||||||
WARCExporter | omitted empty response | ||||||||||||||||||||||||
WARCExporter | records generated |