You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Under construction

Note on   from Lewis McGibbney 

This page is under construction.

Introduction

This page provides a narrative on Nutch application metrics. It details which metrics are captured for which Nutch Job's within which Tasks.

Metrics are important because they tell you vital information about any given Nutch (and subsequently MapReduce) process. They provide accurate measurements about how the process is functioning and provide basis to suggest improvements.

Metrics provide a data-driven mechanism for intelligence gathering within Nutch operations and administration.

Audience

The page is intended for

  • users who wish to learn about how Nutch Jobs and Tasks are performing, and
  • developers who would wish to further extend/customize Nutch metrics

Related Development Work

N/A

Building Metrics on MapReduce Task Context

As Nutch is a native MapReduce application, the Mapper and Reducer functions of each NutchTool implementation i.e. CommonCrawlDataDumper, CrawlDb, DeduplicationJob, Fetcher, Generator, IndexingJob, Injector, LinkDb, ParseSegment utilize MapContext's and ReduceContext's.

This is relevant because these Context's inherit certain methods from the interface org.apache.hadoop.mapreduce.TaskAttemptContext

, specifically implementations of  are the entry point for . These contexts are passed to the Mapper and Reducer initially during setup but also used throughout the task lifecycle.

Hadoop documentation

The canonical Hadoop documentation for Mapper and Reducer provides much more detail about the involvement of Context's in each task lifecycle.

For


Metrics Table

Conclusion

  • No labels