...
Section | |||||||
---|---|---|---|---|---|---|---|
|
Section | |||||||
---|---|---|---|---|---|---|---|
|
Section | |||||||
---|---|---|---|---|---|---|---|
|
Section | |||||||
---|---|---|---|---|---|---|---|
|
Section | |||||||
---|---|---|---|---|---|---|---|
|
Section | |||||||
---|---|---|---|---|---|---|---|
|
Alerts
Anchor | ||||
---|---|---|---|---|
|
The following Alerts are configured by Ambari SCOM:
Name | Alert Message | Description | Threshold | ||||
---|---|---|---|---|---|---|---|
Capacity Remaining | There is little or no space capacity remaining in HDFS. | Gives warning/critical alert if percentage of available space on all HDFS nodes together is less then upper/lower threshold. | 30-Warning | ||||
Under-Replicated Blocks | Number of under-replicated blocks in the HDFS is too high. | Gives warning/critical alert if percentage of under-replicated blocks is more than lower/upper threshold. | 1-Warning | ||||
Corrupted Blocks | There are corrupted file blocks in HDFS. | Gives critical alert if number of corrupted blocks is more than threshold. | 1 | ||||
DataNodes Down | A significant number of DataNodes are down in the cluster. | Gives warning/critical alert if percentage of dead HDFS data nodes in cluster is more than lower/upper threshold. | 10-Warning | ||||
Failed Jobs | MapReduce jobs are failing too frequently. | Gives warning/critical alert if percentage of map-reduce failed jobs is more than lower/upper threshold. | |||||
Hive Metastore State | Hive Metastore server is not running. | Gives critical alert if a Hive Metastore service is unavailable. | |||||
HiveServer State | HiveServer service is not running. | Gives critical alert if a Hive Server service is unavailable. 10-Warning | |||||
Invalid TaskTrackers | There are TaskTracker nodes which are in the invalid state. | Gives warning alert if there is at least one graylisted task-tracker. Gives critical alert if there is at least one blacklisted task-tracker. | |||||
JobTracker Service State | JobTracker service is not running. | Gives critical alert if a JobTracker service is unavailable. | |||||
1 | |||||||
Memory Heap Usage | JobTracker is working under high memory pressure. | Gives warning/critical alert if percentage of used job-tracker memory heap is more than lower/upper threshold. | 80-Warning | ||||
Memory Heap Usage | NameNode is working under high memory pressure. | Gives warning/critical alert if percentage of used NameNode memory heap is more than lower/upper threshold. NameNode | 80-Warning | ||||
TaskTrackers Down | A significant number of TaskTrackers are down in the cluster. | Gives warning/critical alert if percentage of map reduce dead task-trackers is more than lower/upper threshold. | 10-Warning | ||||
TaskTracker Service State | NameNode service TaskTracker component is not running. | Gives critical alert if a NameNode Turns TaskTracker service to warning state if the TaskTracker service is unavailable. | N/A | ||||
NameNode | Oozie Server Service State | Oozie Server service NameNode component is not running. | Gives critical alert if a | Oozie Server NameNode service is unavailable. | N/A | ||
Secondary NameNode Service State | Secondary | NameNode service NameNode component is not running. | Gives warning alert if a Secondary NameNode service is unavailable. | N/A | |||
JobTracker | TaskTracker Service State |
| Turns TaskTracker service to warning state if the TaskTracker JobTracker component is not running. | Gives critical alert if a JobTracker service is unavailable. | N/A | ||
Oozie Server Service State | Oozie Server component is not running. | Gives critical alert if a Oozie Server service is unavailable. | TaskTrackers Down | N/A | |||
Hive Metastore State | Hive Metastore component is not running | A significant number of TaskTrackers are down in the cluster. | Gives | warning/critical alert if | percentage of map reduce dead task-trackers is more than lower/upper threshold. a Hive Metastore service is unavailable. | N/A | |
HiveServer State | HiveServer component | WebHCat Server Service State | WebHCat Server service is not running. | Gives critical alert if a | Templeton Hive Server service is unavailable. | Under-Replicated Blocks | N/A |
WebHCat Server Service State | WebHCat Server component is not running | Number of under-replicated blocks in the HDFS is too high. | Gives | warning/critical alert if | percentage of under-replicated blocks is more than lower/upper threshold.
...
a WebHCat Server service is unavailable. | N/A |
Viewing
Anchor | ||||
---|---|---|---|---|
|
Section | |||||||
---|---|---|---|---|---|---|---|
|
...
Section | |||||||
---|---|---|---|---|---|---|---|
|
Customizing
Anchor | ||||
---|---|---|---|---|
|
Section | |||||||
---|---|---|---|---|---|---|---|
|
...
The following table lists performance rules that have default intervals for alert checks that might require additional tuning to suit your environment. Evaluate these rules to determine whether the default intervals are appropriate for your environment. If a default interval is not appropriate for your environment, you should obtain a baseline for the relevant performance counters, and then adjust the interval by applying an override to them.
Name | Description | Default threshold Interval (secs) |
---|---|---|
Collect HDFS Blocks Read | This rule collects amount of heap memory used by Host Component. | 900 |
Collect HDFS Blocks Written | This rule collects amount of non-heap memory committed to Host Component. | 900 |
Collect HDFS Bytes Read | This rule collects amount of non-heap memory used by Host Component. | 900 |
Collect HDFS Bytes Written | This rule collects number of garbage collections performed for Host Component process. | 900 |
Collect HDFS Capacity Non-DFS Used (GB) | This rule collects number of blocked threads for Host Component process. | 900 |
Collect HDFS Capacity Remaining (GB) | This rule collects number of new threads for Host Component process. | 900 |
Collect HDFS Capacity Total (GB) | This rule collects number of runnable threads for Host Component process. | 900 |
Collect HDFS Capacity Used (GB) | This rule collects number of terminated threads for Host Component process. | 900 |
Collect HDFS Corrupted Blocks | This rule collects number of timed waiting threads for Host Component process. | 900 |
Collect HDFS Dead DataNodes | This rule collects number of waiting threads for Host Component process. | 900 |
Collect HDFS Decommissioned DataNodes | This rule collects time spent in garbage collection of Host Component process. | 900 |
Collect HDFS Files Appended | This rule collects number of dead TaskTrackers for cluster. | 900 |
Collect HDFS Files Created | This rule collects number of completed MapReduce jobs for cluster. | 900 |
Collect HDFS Files Deleted | This rule collects number of failed MapReduce jobs for cluster. | 900 |
Collect HDFS Live DataNodes | This rule collects percent of failed MapReduce jobs in cluster. | 900 |
Collect HDFS Missing Blocks | This rule collects number of killed MapReduce jobs for cluster. | 900 |
Collect HDFS Pending Deletion Blocks | This rule collects number of preparing MapReduce jobs for cluster. | 900 |
Collect HDFS Pending Replication Blocks | This rule collects number of running MapReduce jobs for cluster. | 900 |
Collect HDFS Total Blocks | This rule collects number of submitted MapReduce jobs for cluster. | 900 |
Collect HDFS Total Files | This rule collects number of live TaskTrackers for cluster. | 900 |
Collect HDFS Under-Replicated Blocks | This rule collects number of reserved map slots for cluster. | 900 |
Collect Live vs Dead DataNodes Widget Data | This rule collects number of completed maps tasks for cluster. | 900 |
Collect Space Utilization Widget Data | This rule collects number of failed map tasks for cluster. | 900 |
Collect JVM Errors Logged | This rule collects number of killed map tasks for cluster. | 900 |
Collect JVM Fatal Errors Logged | This rule collects number of launched map tasks for cluster. | 900 |
Collect JVM Heap Memory Committed | This rule collects total number of TaskTrackers in cluster. | 900 |
Collect JVM Heap Memory Used | This rule collects number of occupied map slots for cluster. | 900 |
Collect JVM Non Heap Memory Committed | This rule collects number of occupied reduce slots for cluster. | 900 |
Collect JVM Non Heap Memory Used | This rule collects number of reserved reduce slots for cluster. | 900 |
Collect JVM Number of Garbage Collections | This rule collects number of completed reduce tasks for cluster. | 900 |
Collect JVM Threads Blocked | This rule collects number of failed reduce tasks for cluster. | 900 |
Collect JVM Threads New | This rule collects number of killed reduce tasks for cluster. | 900 |
Collect JVM Threads Runnable | This rule collects number of launched reduce tasks for cluster. | 900 |
Collect JVM Threads Terminated | This rule collects number of running map tasks for cluster. | 900 |
Collect JVM Threads Timed Waiting | This rule collects number of running reduce tasks for cluster. | 900 |
Collect JVM Threads Waiting | This rule collects number of blacklisted TaskTrackers in cluster. | 900 |
Collect JVM Time Spent in Garbage Collection (ms) | This rule collects number of decommissioned TaskTrackers in cluster. | 900 |
Collect MapReduce Dead TaskTrackers | This rule collects number of graylisted TaskTrackers in cluster. | 900 |
Collect MapReduce Jobs Completed | This rule collects number of waiting map tasks for cluster. | 900 |
Collect MapReduce Jobs Failed | This rule collects number of waiting reduce tasks for cluster. | 900 |
Collect MapReduce Jobs Failed (%) | This rule collects bytes received by Host Component. | 900 |
Collect MapReduce Jobs Killed | This rule collects bytes sent by Host Component. | 900 |
Collect MapReduce Jobs Preparing | This rule collects queue average time (ms) of remote procedure calls to Host Component. | 900 |
Collect MapReduce Jobs Running | This rule collects number of failed remote procedure call authorization attempts to Host Component. | 900 |
Collect MapReduce Jobs Submitted | This rule collects average processing time (ms) of remote procedure calls to Host Component. | 900 |
Collect MapReduce Live TaskTrackers | This rule collects number of processing remote procedure calls to Host Component. | 900 |
Collect MapReduce Map Slots Reserved | This rule collects number of queued remote procedure calls to Host Component. | 900 |
Collect MapReduce Maps Completed | This rule collects number of available map slots on TaskTracker. | 900 |
Collect MapReduce Maps Failed | This rule collects number of available reduce slots on TaskTracker. | 900 |
Collect MapReduce Maps Killed | This rule collects number of running map tasks on TaskTracker. | 900 |
Collect MapReduce Maps Launched | This rule collects number of running reduce tasks on TaskTracker. | 900 |
Collect MapReduce Number of TaskTrackers | This rule collects number of caught exceptions for shuffle running on TaskTracker. | 900 |
Collect MapReduce Occupied Map Slots | This rule collects number of failed outputs for shuffle running on TaskTracker. | 900 |
Collect MapReduce Reduced Slots Occupied | This rule collects percentage of busy shuffle handlers on TaskTracker. | 900 |
Collect MapReduce Reduced Slots Reserved | This rule collects number of bytes produced by shuffle running on TaskTracker. | 900 |
Collect MapReduce Reduces Completed | This rule collects number of successful outputs for shuffle running on TaskTracker. | 900 |
Collect MapReduce Reduces Failed | This rule collects amount of heap memory used by Host Component. | 900 |
Collect MapReduce Reduces Killed | This rule collects amount of non-heap memory committed to Host Component. | 900 |
Collect MapReduce Reduces Launched | This rule collects amount of non-heap memory used by Host Component. | 900 |
Collect MapReduce Running Map Tasks | This rule collects number of garbage collections performed for Host Component process. | 900 |
Collect MapReduce Running Reduce tasks | This rule collects number of blocked threads for Host Component process. | 900 |
Collect MapReduce TaskTrackers Blacklisted | This rule collects number of new threads for Host Component process. | 900 |
Collect MapReduce TaskTrackers Decommissioned | This rule collects number of runnable threads for Host Component process. | 900 |
Collect MapReduce TaskTrackers Graylisted | This rule collects number of terminated threads for Host Component process. | 900 |
Collect MapReduce Waiting Map Tasks | This rule collects number of timed waiting threads for Host Component process. | 900 |
Collect MapReduce Waiting Reduce tasks | This rule collects number of waiting threads for Host Component process. | 900 |
Collect Network Bytes Received | This rule collects time spent in garbage collection of Host Component process. | 900 |
Collect Network Bytes Sent | This rule collects number of dead TaskTrackers for cluster. | 900 |
Collect Queue Average Wait Time | This rule collects number of completed MapReduce jobs for cluster. | 900 |
Collect RPC Authorization Failures | This rule collects number of failed MapReduce jobs for cluster. | 900 |
Collect RPC Processing Average Time | This rule collects percent of failed MapReduce jobs in cluster. | 900 |
Collect RPC Processing Number of Operations | This rule collects number of killed MapReduce jobs for cluster. | 900 |
Collect RPC Queue Number of Operations | This rule collects number of preparing MapReduce jobs for cluster. | 900 |
Collect TaskTracker Map Slots | This rule collects number of running MapReduce jobs for cluster. | 900 |
Collect TaskTracker Reduce Slots | This rule collects number of submitted MapReduce jobs for cluster. | 900 |
Collect TaskTracker Running Map Tasks | This rule collects number of live TaskTrackers for cluster. | 900 |
Collect TaskTracker Running Reduce tasks | This rule collects number of reserved map slots for cluster. | 900 |
Collect TaskTracker Shuffle Exceptions Caught | This rule collects number of completed maps tasks for cluster. | 900 |
Collect TaskTracker Shuffle Failed Outputs | This rule collects number of failed map tasks for cluster. | 900 |
Collect TaskTracker Shuffle Handler Busy (%) | This rule collects number of killed map tasks for cluster. | 900 |
Collect TaskTracker Shuffle Output Bytes | This rule collects number of launched map tasks for cluster. | 900 |
Collect TaskTracker Shuffle Success Outputs | This rule collects total number of TaskTrackers in cluster. | 900 |