...
High-Level Monitoring/Alert Flow
Gliffy Diagram
Metrics Collector
Operation team is always struggling with the metrics monitoring for HBase cluster, e.g. HBase RegionServer heap usage, the RPC handling metrics for RegionServer, and region aliveness in regionserver. So we need solutions to get all those metrics. One option is to deploy a standalone JMX client in each node; another is to add JMX sink in Hadoop’s metrics system.
...
Bean Category | Bean Name | Property | Description | Metric Name |
Memory | java.lang:type=Memory | NonHeapMemoryUsage - used | hadoop.memory.nonheapmemoryusage.used | |
HeapMemoryUsage - used | hadoop.memory.heapmemoryusage.used | |||
Java Direct Memory | java.nio:type=BufferPool,name=direct | MemoryUsed | Java Direct Memory Used | hadoop.bufferpool.direct.memoryused |
JVM Metrics | Hadoop:service=HBase,name=JvmMetrics | GcCount | hadoop.hbase.jvm.gccount | |
GcTimeMillis | hadoop.hbase.jvm.gctimemillis | |||
IPC | Hadoop:service=HBase,name=RegionServer,sub=IPC | queueSize | hadoop.hbase.ipcregionserver.ipc.queuesize | |
NumCallsInGeneralQueue | hadoop.hbase.ipcregionserver.ipc.numcallsingeneralqueue | |||
NumActiveHandler | hadoop.hbase.ipcregionserver.ipc.numactivehandler | |||
QueueCallTime_99th_percentile | IPC Queue Time (99th) | hadoop.hbase.ipcregionserver.ipc.queuecalltime_99th_percentile | ||
ProcessCallTime_99th_percentile | IPC Process Time (99th) | hadoop.hbase.ipcregionserver.ipc.processcalltime_99th_percentile | ||
QueueCallTime_num_ops | hadoop.hbase.ipcregionserver.ipc.queuecalltime_num_ops | |||
ProcessCallTime_num_ops | hadoop.hbase.ipcregionserver.ipc.processcalltime_num_ops | |||
Regions | Hadoop:service=HBase,name=RegionServer,sub=Server | regionCount | hadoop.hbase.regionserver.server.regioncount | |
storeCount | hadoop.hbase.regionserver.server.storecount | |||
memStoreSize | hadoop.hbase.regionserver.server.memstoresize | |||
storeFileSize | hadoop.hbase.regionserver.server.storefilesize | |||
totalRequestCount | hadoop.hbase.regionserver.server.totalrequestcount | |||
ReadRequestCount | hadoop.hbase.regionserver.server.readrequestcount | |||
WriteRequestCount | hadoop.hbase.regionserver.server.writerequestcount | |||
splitQueueLength | hadoop.hbase.regionserver.server.splitqueuelength | |||
compactionQueueLength | hadoop.hbase.regionserver.server.compactionqueuelength | |||
flushQueueLength | hadoop.hbase.regionserver.server.flushqueuelength | |||
blockCacheSize | hadoop.hbase.regionserver.server.blockcachesize | |||
blockCacheHitCount | hadoop.hbase.regionserver.server.blockcachehitcount | |||
blockCacheCountHitPercent | hadoop.hbase.regionserver.server.blockcounthitpercentblockcachecounthitpercent |
Data Retention
Metrics should be collected at least 1 minute interval (Hadoop emits the metrics at 10 secs interval). Aggregate to 5 minute level for data older than 30 days and keep half year.
Monitoring Dashboard & Alerting
Metrics Dashboard Overview
Dashboard Chart
Generally, we will follow the UI layout in Ambari, within that, the service health check application will also be included in service status and summary information.
Metrics Query Pattern:
- Flexibly change the time range from 1 hour to