Troubleshooting Guide - 2.4.0

The following page documents common problems discovered with Ambari Metrics Service and provides a guide for things to look out for and already solved problems.

Important facts to collect from the system:

Problems with Metric Collector host

Total available System memory, output of : "free -g"
Total available disk space and available partitions, output of : "df -h "
Total number of hosts in the cluster
Services deployed in the cluster (This is purely to estimate the amount of metric data generated)
Collector configs: /etc/ams-hbase/conf/hbase-env.sh, /etc/ams-hbase/conf/hbase-site.xml, /etc/ambari-metrics-collector/conf/ams-env.sh, /etc/ambari-metrics-collector/conf/ams-site.xml
Collector logs: /var/log/ambari-metrics-collector/ambari-metrics-collector.log, /var/log/ambari-metrics-collector/hbase-ams-master-<host>.log, /var/log/ambari-metrics-collector/hbase-ams-master-<host>.out
Note: Additionally, If distributed mode is enabled, /var/log/ambari-metrics-collector/hbase-ams-zookeeper-<host>.log, /var/log/ambari-metrics-collector/hbase-ams-regionserver-<host>.log

Problems with Metric Monitor host

Monitor log file: /etc/ambari-metrics-monitor/ambari-metrics-monitor.out

Issue 1: AMS HBase process slow disk writes

The symptoms and resolutions below address the embedded mode of AMS only.

Symptoms:

Behavior	How to detect
High CPU usage	HBase process on Collector host taking up close to 100% of every core
HBase Log: Compaction times	grep hbase-ams-master-<host>.log \| grep "Finished memstore flush" This yields MB written in X milliseconds, generally 128 MBps and above is average speed unless the disk is contended. Also this search reveals how many times compaction ran per minute. A value greater than 6 or 8 is a warning that write volume is far greater than what HBase can hold in memory
HBase Log: ZK timeout	HBase crashes saying zookeeper session timed out. This happens because in embedded mode the zk session timeout is limited to max of 30 seconds (HBase issue: fix planned for 2.1.3). The cause is again slow disk reads.
Collector Log : "waiting for some tasks to finish"	ambari-metric-collector log shows messages where AsyncProcess writes are queued up

Resolutions:

Configuration Change

Description

ams-hbase-site :: hbase.rootdir

Change this path to a disk mount that is not heavily contended.

ams-hbase-ste :: hbase.tmp.dir

Change this path to a location different from hbase.rootdir

ams-hbase-env :: hbase_master_heapsize

ams-hbase-site :: hbase.hregion.memstore.flush.size

Bump this value up so more data is held in memory to address I/O speeds.

If heap size is increased and resident memory usage does not go up, this parameter can be changed to address how much data can be stored in a memstore per Region. Default is set to 128 MB. The size is in bytes.

Be careful with modifying this value, generally limit the setting between 64 MB (small heap with fast disk write), to 512 MB (large heap > 8 GB, and average write speed), since more data held in memory means longer time to write it to disk during a Flush operation.

Issue 2: Ambari Metrics take a long time to load

Symptoms:

Behavior

How to detect

Graphs: Loading time too long

Graphs: No data available

Check out service pages / host pages for metric graphs

Socket read timeouts

ambari-server.log shows: Error message saying socket timeout for metrics

Ambari UI slowing down

Host page loading time is high, heatmaps do not show data

Dashboard loading time is too high

Multiple sessions result in slowness

Resolutions:

Upgrade to 2.1.2 is highly recommended.

Following is a list of fixes in 2.1.2 release that should greatly help to alleviate the slow loading and timeouts:

https://issues.apache.org/jira/browse/AMBARI-12654

https://issues.apache.org/jira/browse/AMBARI-12983

https://issues.apache.org/jira/browse/AMBARI-13108

Space shortcuts

Child pages

Important facts to collect from the system:

Problems with Metric Collector host

Issue 1: AMS HBase process slow disk writes

Issue 2: Ambari Metrics take a long time to load