...
Hadoop slave node will regularly emit some metrics information to reflect the service healthy, service team will look the metrics to understand if the service is in healthy state, and trace back to understand the history behavior. Some typical use cases are collected from HadoopService Team, Kylin Team and Titan Cluster Users:
Pre-caution for un-healthy HBase RegionServer (the heap usage), RPC handling metrics and region aliveness, etc.
Troubleshooting through the metrics history dashboard
NameNode RPC traffic from client is very high, identify the source of the client, grep the user from audit log as well
User can flexibly set threshold for each monitored metric and get alert notification w/o re-writing or create policies from scratch
Notification about the HDFS clients which generating abnormal RPC traffic
Extract the DN/RS list with abnormal RPC processing time
High-Level Monitoring/Alert Flow
Gliffy Diagram | ||
---|---|---|
|
Metrics Collector
Operation team is always struggling with the metrics monitoring for HBase cluster, e.g. HBase RegionServer heap usage, the RPC handling metrics for RegionServer, and region aliveness in regionserver. So we need solutions to get all those metrics. One option is to deploy a standalone JMX client in each node; another is to add JMX sink in Hadoop’s metrics system.
...
Bean Category | Bean Name | Property | Description | Metric Name |
Memory | java.lang:type=Memory | NonHeapMemoryUsage - used | hadoop.memory.nonheapmemoryusage.used | |
HeapMemoryUsage - used | hadoop.memory.heapmemoryusage.used | |||
Java Direct Memory | java.nio:type=BufferPool,name=direct | MemoryUsed | Java Direct Memory Used | hadoop.bufferpool.direct.memoryused |
JVM Metrics | Hadoop:service=HBase,name=JvmMetrics | GcCount | hadoop.hbase.jvm.gccount | |
GcTimeMillis | hadoop.hbase.jvm.gctimemillis | |||
IPC | Hadoop:service=HBase,name=RegionServer,sub=IPC | queueSize | hadoop.hbase.ipcregionserver.ipc.queuesize | |
NumCallsInGeneralQueue | hadoop.hbase.ipcregionserver.ipc.numcallsingeneralqueue | |||
NumActiveHandler | hadoop.hbase.ipcregionserver.ipc.numactivehandler | |||
QueueCallTime_99th_percentile | IPC Queue Time (99th) | hadoop.hbase.ipcregionserver.ipc.queuecalltime_99th_percentile | ||
ProcessCallTime_99th_percentile | IPC Process Time (99th) | hadoop.hbase.ipcregionserver.ipc.processcalltime_99th_percentile | ||
QueueCallTime_num_ops | hadoop.hbase.ipcregionserver.ipc.queuecalltime_num_ops | |||
ProcessCallTime_num_ops | hadoop.hbase.ipcregionserver.ipc.processcalltime_num_ops | |||
Regions | Hadoop:service=HBase,name=RegionServer,sub=Server | regionCount | hadoop.hbase.regionserver.server.regioncount | |
storeCount | hadoop.hbase.regionserver.server.storecount | |||
memStoreSize | hadoop.hbase.regionserver.server.memstoresize | |||
storeFileSize | hadoop.hbase.regionserver.server.storefilesize | |||
totalRequestCount | hadoop.hbase.regionserver.server.totalrequestcount | |||
ReadRequestCount | hadoop.hbase.regionserver.server.readrequestcount | |||
WriteRequestCount | hadoop.hbase.regionserver.server.writerequestcount | |||
splitQueueLength | hadoop.hbase.regionserver.server.splitqueuelength | |||
compactionQueueLength | hadoop.hbase.regionserver.server.compactionqueuelength | |||
flushQueueLength | hadoop.hbase.regionserver.server.flushqueuelength | |||
blockCacheSize | hadoop.hbase.regionserver.server.blockcachesize | |||
blockCacheHitCount | hadoop.hbase.regionserver.server.blockcachehitcount | |||
blockCacheCountHitPercent | hadoop.hbase.regionserver.server.blockcounthitpercentblockcachecounthitpercent |
Data Retention
Metrics should be collected at least 1 minute interval (Hadoop emits the metrics at 10 secs interval). Aggregate to 5 minute level for data older than 30 days and keep half year.
Monitoring Dashboard & Alerting
Metrics Dashboard Overview
Dashboard Chart
Generally, we will follow the UI layout in Ambari, within that, the service health check application will also be included in service status and summary information.
Metrics Query Pattern:
- Flexibly change the time range from 1 hour to