...
Problems with Metric Collector host
- Output of "rpm -qa | grep ambari" on the collector host.
- Total available System memory, output of : "free -g"
- Total available disk space and available partitions, output of : "df -h "
- Total number of hosts in the cluster
- Services deployed in the cluster (This is purely to estimate the amount of metric data generated)
- Collector configsConfigs: /etc/ams-hbase/conf/hbase-env.sh, /etc/ams-hbase/conf/hbase-site.xml, /etc/ambari-metrics-collector/conf/ams-env.sh, /etc/ambari-metrics-collector/conf/ams-site.xml
- Collector logs: /var/log/ambari-metrics-collector/ambari-metrics-collector.log, /var/log/ambari-metrics-collector/hbase-ams-master-<host>.log, /var/log/ambari-metrics-collector/hbase-ams-master-<host>.out
Note: Additionally, If distributed mode is enabled, /var/log/ambari-metrics-collector/hbase-ams-zookeeper-<host>.log, /var/log/ambari-metrics-collector/hbase-ams-regionserver-<host>.log - Response to the following URLs -
http://<ams-host>:6188/ws/v1/timeline/metrics/metadata
http://<ams-host>:6188/ws/v1/timeline/metrics/hosts
The response will be JSON and can be attached as a file. - From AMS HBase Master UI - http://<METRICS_COLLECTOR_HOST>:61310
- Region Count
- StoreFile Count
- JMX Snapshot - http://<METRICS_COLLECTOR_HOST>:61310/jmx
Problems with Metric Monitor host
- Monitor log file: /etc/ambari-metrics-monitor/ambari-metrics-monitor.out
Check out Configurations - Tuning for scale issue troubleshooting.
Issue 1: AMS HBase process slow disk writes
...
Behavior | How to detect |
---|---|
High CPU usage | HBase process on Collector host taking up more than close to 100% of 1 every core |
HBase Log: Compaction times | grep hbase-ams-master-<host>.log | grep "Finished memstore flush" This yields MB written in X milliseconds, generally 128 MBps and above is average speed unless the disk is contended. Also this search reveals how many times compaction ran per minute. A value greater than 6 or 8 is a warning that write speeds are volume is far greater than what HBase can hold in memory |
HBase Log: ZK timeout | HBase crashes saying zookeeper session timed out. This happens because in embedded mode the zk session timeout is limited to max of 30 seconds (HBase issue: fix planned for 2.1.3). The cause is again slow disk reads. |
Collector Log : "waiting for some tasks to finish" | ambari-metric-collector log shows messages where AsyncProcess writes are queued up |
...
Resolutions:
Upgrade to 2.1.2+ is highly recommended.
Following is a list of fixes in 2.1.2 release that should greatly help to alleviate the slow loading and timeouts:
https://issues.apache.org/jira/browse/AMBARI-12654
...