3. Monitoring Scenarios

This section describes the following:

Navigation

Ambari SCOM

Use the Ambari SCOM main navigation tree to browse cluster, HDFS and MapReduce performance metrics.

Cluster Summary

This scenario checks Clusters health state. User can choose the Cluster by clicking Cluster Name, after User can see intuitively visualization:

Cluster Services
Participating Hosts
Live vs. Dead Nodes
Space Utilization

After user selects a Cluster Service, Participating Hosts will populate automatically.

Cluster Diagram

See a layout of Services and Components across your cluster hosts.

HDFS Service Summary

This scenario checks HDFS Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Files Summary metrics
Block Summary metrics
I/O Summary metrics
Capacity Remaining

HDFS NameNode

This scenario checks NameNode Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Memory Heap Utilization
Thread Status
Garbage Collection Time (ms)
Average RPC Wait Time

MapReduce Service Summary

This scenario checks MapReduce Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Jobs Summary
TaskTrackers Summary
Slots Utilization
Maps vs. Reducers

MapReduce JobTracker

This scenario checks JobTracker Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Memory Heap Utilization
Threads Status
Garbage Collection Time (ms)
Average RPC Wait Time

Alerts

The following Alerts are configured by Ambari SCOM:

Name	Alert Message	Description
Capacity Remaining	There is little or no space capacity remaining in HDFS.	Gives warning/critical alert if percentage of available space on all HDFS nodes together is less then upper/lower threshold.
Corrupted Blocks	There are corrupted file blocks in HDFS.	Gives critical alert if number of corrupted blocks is more than threshold.
DataNodes Down	A significant number of DataNodes are down in the cluster.	Gives warning/critical alert if percentage of dead HDFS data nodes in cluster is more than lower/upper threshold.
Failed Jobs	MapReduce jobs are failing too frequently.	Gives warning/critical alert if percentage of map-reduce failed jobs is more than lower/upper threshold.
Hive Metastore State	Hive Metastore server is not running.	Gives critical alert if a Hive Metastore service is unavailable.
HiveServer State	HiveServer service is not running.	Gives critical alert if a Hive Server service is unavailable.
Invalid TaskTrackers	There are TaskTracker nodes which are in the invalid state.	Gives warning alert if there is at least one graylisted task-tracker. Gives critical alert if there is at least one blacklisted task-tracker.
JobTracker Service State	JobTracker service is not running.	Gives critical alert if a JobTracker service is unavailable.
Memory Heap Usage	JobTracker is working under high memory pressure.	Gives warning/critical alert if percentage of used job-tracker memory heap is more than lower/upper threshold.
Memory Heap Usage	NameNode is working under high memory pressure.	Gives warning/critical alert if percentage of used NameNode memory heap is more than lower/upper threshold.
NameNode Service State	NameNode service is not running.	Gives critical alert if a NameNode service is unavailable.
Oozie Server Service State	Oozie Server service is not running.	Gives critical alert if a Oozie Server service is unavailable.
Secondary NameNode Service State	Secondary NameNode service is not running.	Gives warning alert if a Secondary NameNode service is unavailable.
TaskTracker Service State		Turns TaskTracker service to warning state if the TaskTracker service is unavailable.
TaskTrackers Down	A significant number of TaskTrackers are down in the cluster.	Gives warning/critical alert if percentage of map reduce dead task-trackers is more than lower/upper threshold.
WebHCat Server Service State	WebHCat Server service is not running.	Gives critical alert if a Templeton Server service is unavailable.
Under-Replicated Blocks	Number of under-replicated blocks in the HDFS is too high.	Gives warning/critical alert if percentage of under-replicated blocks is more than lower/upper threshold.

Viewing Alerts

The Cluster Diagram view will show when an alert has been raised on an object in the cluster. In the image below this is indicated with a on the cluster icon.

You can find out more information about any alerts by accessing the Alert View. The Alert View can be accessed from the Tasks panel on the right. Alert View shows all of the alerts for the selected object. You can see details about any alert or edit its monitor by selecting it in the list.

Another way to see all of the alerts for a specific object or to override the default thresholds and properties is to access the Health Explorer. You can bring up the Health Explorer by right clicking on any object in the diagram view and selecting from the menu. The list on the left shows all of the alerts for the selected object.

You can see the Monitor Properties by right clicking on any alert in the list and selecting from the menu. This will show details about the monitor that is associated with the alert and allow you to override the properties and thresholds of the monitor.

You can also see the state changes of an object in the Health Explorer by selecting an alert and picking the State Changes tab on the right. This tab shows the time as well as the “from” and “to” state of any state change for the monitor associated with the selected alert. The tab also shows the state of the object that triggered the state change.

Thresholds

By selecting Overrides you can change the default values of the monitor. Check the override box and enter a new value. Then select the destination management pack where the overrides will be stored.

Space shortcuts

Child pages

Navigation

Alerts

Viewing Alerts

Thresholds