Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

#What to get?How to get?How to identify red flag?
1Is AMS able to handle raw writes*?

Look for log lines like 'AsyncProcess:1597 - #1, waiting for 13948 actions to finish' in the log.

 

If the number of actions to finish keep increasing and eventually AMS shuts down,

then it could mean AMS is not able to handle raw writes.

2How long does it take for 2 min cluster aggregator to finish?

grep "TimelineClusterAggregatorSecond" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less.

Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates'

>2 mins aggregation time
3How long does it take for 5 min host aggregator to finish?

grep "TimelineHostAggregatorMinute" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less.

Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates'

>5 mins aggregation time
4How many metrics are being collected?
  • curl -K http://<ams-host>:6188/ws/v1/timeline/metrics/metadata -o /tmp/metrics_metadata.txt
  • Number of metrics is the output of the command 'grep -o "metricname" /tmp/metrics_metadata.txt | wc -l'
  • Also find out which component is sending a lot of metrics.
> 15000 metrics
5What is the number of regions and store files in AMS HBase?

This can be got from AMS HBase Master UI.

http://<METRICS_COLLECTOR_HOST>:61310

> 150 regions

> 2000 store files

6How fast is AMS HBase flushing, and how much data is being flushed?

Check for master log in embedded mode and RS log in distributed mode.

grep "memstore flush" /var/log/metric_collector/hbase-ams-<>.log | less

Check how often METRIC_RECORD flushes happen and how much data is flushed?

 

>2-3 flushes every second >10 flushes in a minute could be a problem.

The flush size should be approx equal to flush size config in ams-hbase-site

7If AMS is in distributed mode, is there a local Datanode?From the cluster.

In distributed mode, a local datanode helps with HBase read shortcircuit feature.

(http://hbase.apache.org/0.94/book/perf.hdfs.html)



*  A raw write is a bunch of metric data points received from services written onto HBase through phoenix. There is no read or aggregation involved. 


Fixing / Recovering from the problem.

...

Underlying ProblemWhat it could causeFix / Workaround 
Too many metrics (#4 from above)It could cause ALL of the problems mentioned above.

#1 : Trying out config changes

  • First, we can try increasing memory of Metrics collector, HBase Master / RS based on mode. (Refer to memory configurations table at the top of the page)
  • Configure AMS to read more data in a single phoenix fetch
    • Set ams-site: timeline.metrics.service.resultset.fetchSize = 5000 (for <100 nodes) or 10000 (>100 nodes)
  • Increase Hbase regionserver handler count.
    • Set ams-hbase-site: hbase.regionserver.handler.count = 30
  • If Hive is sending a lot of metrics. Do not aggregate hive table metrics.
    • Set ams-site:timeline.metrics.cluster.aggregation.sql.filters = sdisk_%,boottime,default.General% (Only From Ambari-2.5.0)

#2 : Reducing number of metrics

If the above config changes do not increase AMS stability, you can whitelist selected metrics or blacklist certain components' metrics that are causing the load issue.

 

AMS node has slow disk speed.

Disk is not able to keep up with high volume data.

It can cause raw writes and aggregation problems.
  • On larger clusters (>800 nodes) with distributed mode, suggest 3-5 SSDs on metrics collector node and create a config group for DN on that host to use those 3-5 disks as directories.
  • ams-hbase-site :: hbase.rootdir - Change this path to a disk mount that is not heavily contended.
  • ams-hbase-ste :: hbase.tmp.dir - Change this path to a location different from hbase.rootdir
  • ams-hbase-ste :: hbase.wal.dir - Change this path to a location different from hbase.root.dir (From Ambari-2.5.1)
  • Metric whitelisting will help in decreasing metric load.
 

Known issues around HBase normalier and FIFO compaction.

Documented in Known Issues (#11 and #13)

This can be identified in #5 in the above table.Follow workaround steps in Known issue doc. 

...