Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Related Development Work

N/A

Building Metrics on MapReduce

...

Context's

As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper, CrawlDb, DeduplicationJob, Fetcher, Generator, IndexingJob, Injector, LinkDb, ParseSegment utilize MapContext's andReduceContext's. This is relevant because these These Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext, specifically implementations of  are the entry point for . These contexts are passed to the Mapper and Reducer initially during setup but also used throughout the each Mapper or Reducer task lifecycle.

Info
titleHadoop documentation

The canonical Hadoop documentation for Mapper and Reducer provides much more detail about the involvement of Context's in each task lifecycle.

This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext. Specifically, the getCounter(...) methods facilitate access to Hadoop Counter's which we discuss below.

Hadoop Counter's

A Counter is simply a record comprising a name and value. As one would expect, Counter's can be incremented in order to count for example how many total records were processed within a task completion.

The following example shows how Counter's are used within the Nutch Injector to count the total number of URLs filtered during the Map phase of this job.

...

Code Block
languagejava
titleUse of Counters in the Nutch Injector
linenumberstrue
collapsetrue
    @Override
    public void map(Text key, Writable value, Context context)
        throws IOException, InterruptedException {
      if (value instanceof Text) {
        // if its a url from the seed list
        String url = key.toString().trim();

        // remove empty string or string starting with '#'
        if (url.length() == 0 || url.startsWith("#"))
          return;

        url = filterNormalize(url);
        if (url == null) {
          context.getCounter("injector", "urls_filtered").increment(1);



Metrics Table

Conclusion