Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added missing thresholds to alerts table.

...

Name

Alert Message

Description

Threshold

Capacity Remaining

There is little or no space capacity remaining in HDFS.

Gives warning/critical alert if percentage of available space on all HDFS nodes together is less then upper/lower threshold.

30-Warning
10-Critical

Under-Replicated Blocks

Number of under-replicated blocks in the HDFS is too high.

Gives warning/critical alert if percentage of under-replicated blocks is more than lower/upper threshold.

1-Warning
5-Critical

Corrupted Blocks

There are corrupted file blocks in HDFS.

Gives critical alert if number of corrupted blocks is more than threshold.

1

DataNodes Down

A significant number of DataNodes are down in the cluster.

Gives warning/critical alert if percentage of dead HDFS data nodes in cluster is more than lower/upper threshold.

10-Warning
20-Critical

Failed Jobs

MapReduce jobs are failing too frequently.

Gives warning/critical alert if percentage of map-reduce failed jobs is more than lower/upper threshold.

10-Warning
40-Critical

Hive Metastore State

Hive Metastore server is not running.

Gives critical alert if a Hive Metastore service is unavailable.

HiveServer State

HiveServer service is not running.

Gives critical alert if a Hive Server service is unavailable.

Invalid TaskTrackers

There are TaskTracker nodes which are in the invalid state.

Gives warning alert if there is at least one graylisted task-tracker. Gives critical alert if there is at least one blacklisted task-tracker.

1

JobTracker Service State

JobTracker service is not running.

Gives critical alert if a JobTracker service is unavailable.

Memory Heap Usage

JobTracker is working under high memory pressure.

Gives warning/critical alert if percentage of used job-tracker memory heap is more than lower/upper threshold.

80-Warning
90-Critical

Memory Heap Usage

NameNode is working under high memory pressure.

Gives warning/critical alert if percentage of used NameNode memory heap is more than lower/upper threshold.

80-Warning
90-Critical

NameNode Service State

NameNode service is not running.

Gives critical alert if a NameNode service is unavailable.

Oozie Server Service State

Oozie Server service is not running.

Gives critical alert if a Oozie Server service is unavailable.

Secondary NameNode Service State

Secondary NameNode service is not running.

Gives warning alert if a Secondary NameNode service is unavailable.

TaskTracker Service State

 

Turns TaskTracker service to warning state if the TaskTracker service is unavailable.

TaskTrackers Down

A significant number of TaskTrackers are down in the cluster.

Gives warning/critical alert if percentage of map reduce dead task-trackers is more than lower/upper threshold.

10-Warning
20-Critical

WebHCat Server Service State

WebHCat Server service is not running.

Gives critical alert if a Templeton Server service is unavailable.

...