...
Related Development Work
N/A
Building Metrics on MapReduce
...
Context's
As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper
, CrawlDb
, DeduplicationJob
, Fetcher
, Generator
, IndexingJob
, Injector
, LinkDb
, ParseSegment
utilize MapContext's andReduceContext's. This is relevant because these These Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext, specifically implementations of are the entry point for . These contexts are passed to the Mapper and Reducer initially during setup but also used throughout the each Mapper or Reducer task lifecycle.
Info | ||
---|---|---|
| ||
The canonical Hadoop documentation for Mapper and Reducer provides much more detail about the involvement of Context's in each task lifecycle. |
This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext. Specifically, the getCounter(...) methods facilitate access to Hadoop Counter's which we discuss below.
Hadoop Counter's
A Counter is simply a record comprising a name and value. As one would expect, Counter's can be incremented in order to count for example how many total records were processed within a task completion.
The following example shows how Counter's are used within the Nutch Injector to count the total number of URLs filtered during the Map phase of this job.
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
@Override
public void map(Text key, Writable value, Context context)
throws IOException, InterruptedException {
if (value instanceof Text) {
// if its a url from the seed list
String url = key.toString().trim();
// remove empty string or string starting with '#'
if (url.length() == 0 || url.startsWith("#"))
return;
url = filterNormalize(url);
if (url == null) {
context.getCounter("injector", "urls_filtered").increment(1); |