Related Development Work

N/A

Building Metrics on MapReduce

...

Context's

As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper, CrawlDb, DeduplicationJob, Fetcher, Generator, IndexingJob, Injector, LinkDb, ParseSegment utilize MapContext's andReduceContext's. This is relevant because these These Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext, specifically implementations of are the entry point for . These contexts are passed to the Mapper and Reducer initially during setup but also used throughout the each Mapper or Reducer task lifecycle.

Info

title	Hadoop documentation

The canonical Hadoop documentation for Mapper and Reducer provides much more detail about the involvement of Context's in each task lifecycle.

This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext. Specifically, the getCounter(...) methods facilitate access to Hadoop Counter's which we discuss below.

Hadoop Counter's

A Counter is simply a record comprising a name and value. As one would expect, Counter's can be incremented in order to count for example how many total records were processed within a task completion.

The following example shows how Counter's are used within the Nutch Injector to count the total number of URLs filtered during the Map phase of this job.

...

Code Block

language	java
title	Use of Counters in the Nutch Injector
linenumbers	true
collapse	true

    @Override
    public void map(Text key, Writable value, Context context)
        throws IOException, InterruptedException {
      if (value instanceof Text) {
        // if its a url from the seed list
        String url = key.toString().trim();

        // remove empty string or string starting with '#'
        if (url.length() == 0 || url.startsWith("#"))
          return;

        url = filterNormalize(url);
        if (url == null) {
          context.getCounter("injector", "urls_filtered").increment(1);

Space shortcuts

Child pages

Versions Compared

Old Version 3

New Version 4

Key

Related Development Work

Building Metrics on MapReduce

Context's

Hadoop Counter's

Metrics Table

Conclusion

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 3

New Version 4

Key

Related Development Work

Building Metrics on MapReduce

Context's

Hadoop Counter's

Metrics Table

Conclusion