You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 22 Next »

Under construction

Note on   from Lewis McGibbney 

This page is under construction.

Introduction

This page provides a narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.

Metrics are important because they tell you vital information about any given Nutch (and subsequently MapReduce) process. They provide accurate measurements about how the process is functioning and provide basis to suggest improvements.

Metrics provide a data-driven mechanism for intelligence gathering within Nutch operations and administration.

Audience

The page is intended for

  • users who wish to learn about how Nutch Jobs and Tasks are performing, and
  • developers who would wish to further extend/customize Nutch metrics

Related Development Work

N/A

Building Metrics on MapReduce Context's

As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper, CrawlDb, DeduplicationJob, Fetcher, Generator, IndexingJob, Injector, LinkDb, ParseSegment utilize MapContext's and ReduceContext's. These Context's are passed to the Mapper and Reducer initially during setup but also used throughout each Mapper or Reducer task lifecycle.

Hadoop documentation

The canonical Hadoop documentation for Mapper and Reducer provides much more detail about the involvement of Context's in each task lifecycle.

This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext. Specifically, the getCounter(...) methods facilitate access to Hadoop Counter's which we discuss below.

Hadoop Counter's

A Counter is simply a record comprising a name and value. As one would expect, Counter's can be incremented in order to count for example how many total records were processed within a task completion.

The following example shows how Counter's are used within the Nutch Injector to count the total number of URLs filtered during the Map phase of this job.

Use of Counters in the Nutch Injector
    @Override
    public void map(Text key, Writable value, Context context)
        throws IOException, InterruptedException {
      if (value instanceof Text) {
        // if its a url from the seed list
        String url = key.toString().trim();

        // remove empty string or string starting with '#'
        if (url.length() == 0 || url.startsWith("#"))
          return;

        url = filterNormalize(url);
        if (url == null) {
          context.getCounter("injector", "urls_filtered").increment(1);

The code on Line 14 demonstrates the urls_filtered counter for injector counter group being incremented by 1.

The end result is that we generate useful, insightful metrics for each mapper and reducer task for any given Nutch Job.

See below for details on each Nutch metric available.

Metrics Table

The table below provides a canonical, comprehensive collection of Nutch metrics.

Table Ordering Logic

The table is arranged

  1. by Tool column; alphabetically
  2. by the Metric Group; alphabetically for the given tool
  3. by Metric Name; alphabetically for the given metric group


Tool/ObjectMetric GroupMetric NameDescription
CleaningJobCleaningJobStatusDeleted documents
CrawlDbFilterCrawlDB filterGone records removed
CrawlDB filterOrphan records removed
CrawlDB filterURLs filtered
CrawlDbReducerCrawlDB status

CrawlDatum.getStatusName(CrawlDatum().getStatus())


DeduplicationJobDeduplicationJobStatusDocuments marked as duplicate
DomainStatistics


MyCounter.EMPTY_RESULT

MyCounter.FETCHED

MyCounter.NOT_FETCHED

Fetcher
FetcherStatusbytes_downloaded
FetcherStatushitByThrougputThreshold
FetcherStatushitByTimeLimit








FetcherThread
FetcherStatusAboveExceptionThresholdInQueue
FetcherStatusFetchItem.notCreated.redirect
FetcherStatusoutlinks_detected
FetcherStatusoutlinks_following
FetcherStatusProtocolStatus.getName()
FetcherStatusredirect_count_exceeded
FetcherStatusredirect_deduplicated
FetcherStatusrobots_denied
FetcherStatusrobots_denied_maxcrawldelay
ParserStatusParseStatus.majorCodes[p.getData().getStatus().getMajorCode()]







Generator















GeneratorEXPR_REJECTED
GeneratorHOSTS_AFFECTED_PER_HOST_OVERFLOW
GeneratorINTERVAL_REJECTED
GeneratorMALFORMED_URL
GeneratorSCHEDULE_REJECTED
GeneratorSCORE_TOO_LOW
GeneratorSTATUS_REJECTED
GeneratorURLS_SKIPPED_PER_HOST_OVERFLOW
IndexerMapReduce








IndexerStatusdeleted (duplicates)

deleted (IndexingFilter)

deleted (gone)

deleted (redirects)

deleted (robots=noindex)

errors (IndexingFilter)

errors (ScoringFilter)

indexed (add/update)

skipped (IndexingFilter)

skipped (not modified)




Injector



injectorurls_filtered
injectorurls_injected
injectorurls_merged
injectorurls_purged_404
injectorurls_purged_filter
ParseSegmentParserStatusParseStatus.majorCodes[parseStatus.getMajorCode()]
QueueFeederFetcherStatusfiltered
(also QueueFeeder)FetcherStatusAboveExceptionThresholdInQueue
ResolverThread






UpdateHostDbchecked_hosts
UpdateHostDbexisting_known_host
UpdateHostDbexisting_unknown_host
UpdateHostDbnew_known_host
UpdateHostDbnew_unknown_host
UpdateHostDbpurged_unknown_host
UpdateHostDbrediscovered_host
UpdateHostDbLong.toString(datum.numFailures()) + "_times_failed"
SitemapProcessor





Sitemapexisting_sitemap_entries
Sitemapfailed_fetches
Sitemapfiltered_records
Sitemapfiltered_sitemaps_from_hostname
Sitemapnew_sitemap_entries
Sitemapsitemaps_from_hostname
Sitemapsitemap_seeds
UpdateHostDbMapperUpdateHostDbfiltered_records
UpdateHostDbReducerUpdateHostDbtotal_hosts
(also UpdateHostDbReducer)UpdateHostDbskipped_not_eligible
WebGraphWebGraph.outlinksadded links
(also WebGraph)WebGraph.outlinksremoved links
WARCExporter




WARCExporterexception
WARCExporterinvalid URI
WARCExportermissing content
WARCExportermissing metadata
WARCExporteromitted empty response
WARCExporterrecords generated

Conclusion

  • No labels