Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Section
Column
width375px

Column

Cluster Summary

This scenario checks Clusters health state. User can choose the Cluster by clicking Cluster Name, after User can see intuitively visualization:

  • Cluster Services
  • Participating Hosts
  • Live vs. Dead Nodes
  • Space Utilization

After user selects a Cluster Service, Participating Hosts will populate automatically.

...

Section
Column
width375px

Column

HDFS Service Summary

This scenario checks HDFS Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

  • Files Summary metrics
  • Block Summary metrics
  • I/O Summary metrics
  • Capacity Remaining
Section
Column
width375px

Column

HDFS NameNode

This scenario checks NameNode Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

  • Memory Heap Utilization
  • Thread Status
  • Garbage Collection Time (ms)
  • Average RPC Wait Time
Section
Column
width375px

Column

MapReduce Service Summary

This scenario checks MapReduce Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

  • Jobs Summary
  • TaskTrackers Summary
  • Slots Utilization
  • Maps vs. Reducers
Section
Column
width375px

Column

MapReduce JobTracker

This scenario checks JobTracker Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

  • Memory Heap Utilization
  • Threads Status
  • Garbage Collection Time (ms)
  • Average RPC Wait Time

Alerts
Anchor
alerts
alerts

The following Alerts are configured by Ambari SCOM:

  NameNode service NameNode service JobTracker service service Metastore server HiveServer service Server service

Name

Alert Message

Description

Threshold

Capacity Remaining

There is little or no space capacity remaining in HDFS.

Gives warning/critical alert if percentage of available space on all HDFS nodes together is less then upper/lower threshold.

30-Warning
10-Critical

Under-Replicated Blocks

Number of under-replicated blocks in the HDFS is too high.

Gives warning/critical alert if percentage of under-replicated blocks is more than lower/upper threshold.

1-Warning
5-Critical

Corrupted Blocks

There are corrupted file blocks in HDFS.

Gives critical alert if number of corrupted blocks is more than threshold.

1

DataNodes Down

A significant number of DataNodes are down in the cluster.

Gives warning/critical alert if percentage of dead HDFS data nodes in cluster is more than lower/upper threshold.

10-Warning
20-Critical

Failed Jobs

MapReduce jobs are failing too frequently.

Gives warning/critical alert if percentage of map-reduce failed jobs is more than lower/upper threshold.

10-Warning
40-Critical

Invalid TaskTrackers

There are TaskTracker nodes which are in the invalid state.

Gives critical alert if there is at least one blacklisted task-tracker.

1

Memory Heap Usage

JobTracker is working under high memory pressure.

Gives warning/critical alert if percentage of used job-tracker memory heap is more than lower/upper threshold.

80-Warning
90-Critical

Memory Heap Usage

NameNode is working under high memory pressure.

Gives warning/critical alert if percentage of used NameNode memory heap is more than lower/upper threshold.

80-Warning
90-Critical

TaskTrackers Down

A significant number of TaskTrackers are down in the cluster.

Gives warning/critical alert if percentage of map reduce dead task-trackers is more than lower/upper threshold.

10-Warning
20-Critical

TaskTracker Service State

 TaskTracker component is not running.

Turns TaskTracker service to warning state if the TaskTracker service is unavailable.

N/A

NameNode Service State

NameNode component is not running.

Gives critical alert if a NameNode service is unavailable.

N/A

Secondary NameNode Service State

Secondary

NameNode component is not running.

Gives warning alert if a Secondary NameNode service is unavailable.

N/A

JobTracker Service State

JobTracker component is not running.

Gives critical alert if a JobTracker service is unavailable.

N/A

Oozie Server Service State

Oozie Server

component is not running.

Gives critical alert if a Oozie Server service is unavailable.

N/A

Hive Metastore State

Hive

Metastore component is not running.

Gives critical alert if a Hive Metastore service is unavailable.

N/A

HiveServer State

HiveServer component is not running.

Gives critical alert if a Hive Server service is unavailable.

N/A

WebHCat Server Service State

WebHCat

Server component is not running.

Gives critical alert if a WebHCat Server service is unavailable.

N/A

Viewing
Anchor
alerts-viewing
alerts-viewing

...