3. Monitoring Scenarios

This section describes the following:

Navigation
Alerts
Interval Rules

Navigation

Ambari SCOM

Use the Ambari SCOM main navigation tree to browse cluster, HDFS and MapReduce performance metrics.

Cluster Summary

This scenario checks Clusters health state. User can choose the Cluster by clicking Cluster Name, after User can see intuitively visualization:

Cluster Services
Participating Hosts
Live vs. Dead Nodes
Space Utilization

After user selects a Cluster Service, Participating Hosts will populate automatically.

Cluster Diagram

See a layout of Services and Components across your cluster hosts.

HDFS Service Summary

This scenario checks HDFS Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Files Summary metrics
Block Summary metrics
I/O Summary metrics
Capacity Remaining

HDFS NameNode

This scenario checks NameNode Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Memory Heap Utilization
Thread Status
Garbage Collection Time (ms)
Average RPC Wait Time

MapReduce Service Summary

This scenario checks MapReduce Cluster Services health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Jobs Summary
TaskTrackers Summary
Slots Utilization
Maps vs. Reducers

MapReduce JobTracker

This scenario checks JobTracker Host Component health state. User can choose the Cluster by clicking Parent Cluster Name, after User can see intuitively visualization:

Memory Heap Utilization
Threads Status
Garbage Collection Time (ms)
Average RPC Wait Time

Alerts

The following Alerts are configured by Ambari SCOM:

Name	Alert Message	Description	Threshold
Capacity Remaining	There is little or no space capacity remaining in HDFS.	Gives warning/critical alert if percentage of available space on all HDFS nodes together is less then upper/lower threshold.	30-Warning 10-Critical
Under-Replicated Blocks	Number of under-replicated blocks in the HDFS is too high.	Gives warning/critical alert if percentage of under-replicated blocks is more than lower/upper threshold.	1-Warning 5-Critical
Corrupted Blocks	There are corrupted file blocks in HDFS.	Gives critical alert if number of corrupted blocks is more than threshold.	1
DataNodes Down	A significant number of DataNodes are down in the cluster.	Gives warning/critical alert if percentage of dead HDFS data nodes in cluster is more than lower/upper threshold.	10-Warning 20-Critical
Failed Jobs	MapReduce jobs are failing too frequently.	Gives warning/critical alert if percentage of map-reduce failed jobs is more than lower/upper threshold.	10-Warning 40-Critical
Invalid TaskTrackers	There are TaskTracker nodes which are in the invalid state.	Gives critical alert if there is at least one blacklisted task-tracker.	1
Memory Heap Usage	JobTracker is working under high memory pressure.	Gives warning/critical alert if percentage of used job-tracker memory heap is more than lower/upper threshold.	80-Warning 90-Critical
Memory Heap Usage	NameNode is working under high memory pressure.	Gives warning/critical alert if percentage of used NameNode memory heap is more than lower/upper threshold.	80-Warning 90-Critical
TaskTrackers Down	A significant number of TaskTrackers are down in the cluster.	Gives warning/critical alert if percentage of map reduce dead task-trackers is more than lower/upper threshold.	10-Warning 20-Critical
TaskTracker Service State	TaskTracker component is not running.	Turns TaskTracker service to warning state if the TaskTracker service is unavailable.	N/A
NameNode Service State	NameNode component is not running.	Gives critical alert if a NameNode service is unavailable.	N/A
Secondary NameNode Service State	Secondary NameNode component is not running.	Gives warning alert if a Secondary NameNode service is unavailable.	N/A
JobTracker Service State	JobTracker component is not running.	Gives critical alert if a JobTracker service is unavailable.	N/A
Oozie Server Service State	Oozie Server component is not running.	Gives critical alert if a Oozie Server service is unavailable.	N/A
Hive Metastore State	Hive Metastore component is not running.	Gives critical alert if a Hive Metastore service is unavailable.	N/A
HiveServer State	HiveServer component is not running.	Gives critical alert if a Hive Server service is unavailable.	N/A
WebHCat Server Service State	WebHCat Server component is not running.	Gives critical alert if a WebHCat Server service is unavailable.	N/A

Viewing

The Cluster Diagram view will show when an alert has been raised on an object in the cluster. In the image below this is indicated with a on the cluster icon.

You can find out more information about any alerts by accessing the Alert View. The Alert View can be accessed from the Tasks panel on the right. Alert View shows all of the alerts for the selected object. You can see details about any alert or edit its monitor by selecting it in the list.

Another way to see all of the alerts for a specific object or to override the default thresholds and properties is to access the Health Explorer. You can bring up the Health Explorer by right clicking on any object in the diagram view and selecting from the menu. The list on the left shows all of the alerts for the selected object.

You can see the Monitor Properties by right clicking on any alert in the list and selecting from the menu. This will show details about the monitor that is associated with the alert and allow you to override the properties and thresholds of the monitor.

You can also see the state changes of an object in the Health Explorer by selecting an alert and picking the State Changes tab on the right. This tab shows the time as well as the “from” and “to” state of any state change for the monitor associated with the selected alert. The tab also shows the state of the object that triggered the state change.

Customizing

By selecting Overrides you can change the default values of the monitor (Critical Threshold, Warning Threshold, Internal). Check the override box and enter a new value. Then select the destination management pack where the overrides will be stored.

Interval Rules

The following table lists performance rules that have default intervals for alert checks that might require additional tuning to suit your environment. Evaluate these rules to determine whether the default intervals are appropriate for your environment. If a default interval is not appropriate for your environment, you should obtain a baseline for the relevant performance counters, and then adjust the interval by applying an override to them.

Name	Description	Interval (secs)
Collect HDFS Blocks Read	This rule collects amount of heap memory used by Host Component.	900
Collect HDFS Blocks Written	This rule collects amount of non-heap memory committed to Host Component.	900
Collect HDFS Bytes Read	This rule collects amount of non-heap memory used by Host Component.	900
Collect HDFS Bytes Written	This rule collects number of garbage collections performed for Host Component process.	900
Collect HDFS Capacity Non-DFS Used (GB)	This rule collects number of blocked threads for Host Component process.	900
Collect HDFS Capacity Remaining (GB)	This rule collects number of new threads for Host Component process.	900
Collect HDFS Capacity Total (GB)	This rule collects number of runnable threads for Host Component process.	900
Collect HDFS Capacity Used (GB)	This rule collects number of terminated threads for Host Component process.	900
Collect HDFS Corrupted Blocks	This rule collects number of timed waiting threads for Host Component process.	900
Collect HDFS Dead DataNodes	This rule collects number of waiting threads for Host Component process.	900
Collect HDFS Decommissioned DataNodes	This rule collects time spent in garbage collection of Host Component process.	900
Collect HDFS Files Appended	This rule collects number of dead TaskTrackers for cluster.	900
Collect HDFS Files Created	This rule collects number of completed MapReduce jobs for cluster.	900
Collect HDFS Files Deleted	This rule collects number of failed MapReduce jobs for cluster.	900
Collect HDFS Live DataNodes	This rule collects percent of failed MapReduce jobs in cluster.	900
Collect HDFS Missing Blocks	This rule collects number of killed MapReduce jobs for cluster.	900
Collect HDFS Pending Deletion Blocks	This rule collects number of preparing MapReduce jobs for cluster.	900
Collect HDFS Pending Replication Blocks	This rule collects number of running MapReduce jobs for cluster.	900
Collect HDFS Total Blocks	This rule collects number of submitted MapReduce jobs for cluster.	900
Collect HDFS Total Files	This rule collects number of live TaskTrackers for cluster.	900
Collect HDFS Under-Replicated Blocks	This rule collects number of reserved map slots for cluster.	900
Collect Live vs Dead DataNodes Widget Data	This rule collects number of completed maps tasks for cluster.	900
Collect Space Utilization Widget Data	This rule collects number of failed map tasks for cluster.	900
Collect JVM Errors Logged	This rule collects number of killed map tasks for cluster.	900
Collect JVM Fatal Errors Logged	This rule collects number of launched map tasks for cluster.	900
Collect JVM Heap Memory Committed	This rule collects total number of TaskTrackers in cluster.	900
Collect JVM Heap Memory Used	This rule collects number of occupied map slots for cluster.	900
Collect JVM Non Heap Memory Committed	This rule collects number of occupied reduce slots for cluster.	900
Collect JVM Non Heap Memory Used	This rule collects number of reserved reduce slots for cluster.	900
Collect JVM Number of Garbage Collections	This rule collects number of completed reduce tasks for cluster.	900
Collect JVM Threads Blocked	This rule collects number of failed reduce tasks for cluster.	900
Collect JVM Threads New	This rule collects number of killed reduce tasks for cluster.	900
Collect JVM Threads Runnable	This rule collects number of launched reduce tasks for cluster.	900
Collect JVM Threads Terminated	This rule collects number of running map tasks for cluster.	900
Collect JVM Threads Timed Waiting	This rule collects number of running reduce tasks for cluster.	900
Collect JVM Threads Waiting	This rule collects number of blacklisted TaskTrackers in cluster.	900
Collect JVM Time Spent in Garbage Collection (ms)	This rule collects number of decommissioned TaskTrackers in cluster.	900
Collect MapReduce Dead TaskTrackers	This rule collects number of graylisted TaskTrackers in cluster.	900
Collect MapReduce Jobs Completed	This rule collects number of waiting map tasks for cluster.	900
Collect MapReduce Jobs Failed	This rule collects number of waiting reduce tasks for cluster.	900
Collect MapReduce Jobs Failed (%)	This rule collects bytes received by Host Component.	900
Collect MapReduce Jobs Killed	This rule collects bytes sent by Host Component.	900
Collect MapReduce Jobs Preparing	This rule collects queue average time (ms) of remote procedure calls to Host Component.	900
Collect MapReduce Jobs Running	This rule collects number of failed remote procedure call authorization attempts to Host Component.	900
Collect MapReduce Jobs Submitted	This rule collects average processing time (ms) of remote procedure calls to Host Component.	900
Collect MapReduce Live TaskTrackers	This rule collects number of processing remote procedure calls to Host Component.	900
Collect MapReduce Map Slots Reserved	This rule collects number of queued remote procedure calls to Host Component.	900
Collect MapReduce Maps Completed	This rule collects number of available map slots on TaskTracker.	900
Collect MapReduce Maps Failed	This rule collects number of available reduce slots on TaskTracker.	900
Collect MapReduce Maps Killed	This rule collects number of running map tasks on TaskTracker.	900
Collect MapReduce Maps Launched	This rule collects number of running reduce tasks on TaskTracker.	900
Collect MapReduce Number of TaskTrackers	This rule collects number of caught exceptions for shuffle running on TaskTracker.	900
Collect MapReduce Occupied Map Slots	This rule collects number of failed outputs for shuffle running on TaskTracker.	900
Collect MapReduce Reduced Slots Occupied	This rule collects percentage of busy shuffle handlers on TaskTracker.	900
Collect MapReduce Reduced Slots Reserved	This rule collects number of bytes produced by shuffle running on TaskTracker.	900
Collect MapReduce Reduces Completed	This rule collects number of successful outputs for shuffle running on TaskTracker.	900
Collect MapReduce Reduces Failed	This rule collects amount of heap memory used by Host Component.	900
Collect MapReduce Reduces Killed	This rule collects amount of non-heap memory committed to Host Component.	900
Collect MapReduce Reduces Launched	This rule collects amount of non-heap memory used by Host Component.	900
Collect MapReduce Running Map Tasks	This rule collects number of garbage collections performed for Host Component process.	900
Collect MapReduce Running Reduce tasks	This rule collects number of blocked threads for Host Component process.	900
Collect MapReduce TaskTrackers Blacklisted	This rule collects number of new threads for Host Component process.	900
Collect MapReduce TaskTrackers Decommissioned	This rule collects number of runnable threads for Host Component process.	900
Collect MapReduce TaskTrackers Graylisted	This rule collects number of terminated threads for Host Component process.	900
Collect MapReduce Waiting Map Tasks	This rule collects number of timed waiting threads for Host Component process.	900
Collect MapReduce Waiting Reduce tasks	This rule collects number of waiting threads for Host Component process.	900
Collect Network Bytes Received	This rule collects time spent in garbage collection of Host Component process.	900
Collect Network Bytes Sent	This rule collects number of dead TaskTrackers for cluster.	900
Collect Queue Average Wait Time	This rule collects number of completed MapReduce jobs for cluster.	900
Collect RPC Authorization Failures	This rule collects number of failed MapReduce jobs for cluster.	900
Collect RPC Processing Average Time	This rule collects percent of failed MapReduce jobs in cluster.	900
Collect RPC Processing Number of Operations	This rule collects number of killed MapReduce jobs for cluster.	900
Collect RPC Queue Number of Operations	This rule collects number of preparing MapReduce jobs for cluster.	900
Collect TaskTracker Map Slots	This rule collects number of running MapReduce jobs for cluster.	900
Collect TaskTracker Reduce Slots	This rule collects number of submitted MapReduce jobs for cluster.	900
Collect TaskTracker Running Map Tasks	This rule collects number of live TaskTrackers for cluster.	900
Collect TaskTracker Running Reduce tasks	This rule collects number of reserved map slots for cluster.	900
Collect TaskTracker Shuffle Exceptions Caught	This rule collects number of completed maps tasks for cluster.	900
Collect TaskTracker Shuffle Failed Outputs	This rule collects number of failed map tasks for cluster.	900
Collect TaskTracker Shuffle Handler Busy (%)	This rule collects number of killed map tasks for cluster.	900
Collect TaskTracker Shuffle Output Bytes	This rule collects number of launched map tasks for cluster.	900
Collect TaskTracker Shuffle Success Outputs	This rule collects total number of TaskTrackers in cluster.	900

Space shortcuts

Child pages

Navigation

Alerts

Viewing

Customizing

Interval Rules