...
# | What information to gather? | How to get that information? | How to identify if there is a red flag? | |
---|---|---|---|---|
1 | Is AMS able to handle raw writes*? | Look for log lines like 'AsyncProcess:1597 - #1, waiting for 13948 actions to finish' in the log.
| If the number of actions to finish keep increasing and eventually AMS shuts down, then it could mean AMS is not able to handle raw writes. | |
2 | How long does it take for 2 min cluster aggregator to finish? | grep "TimelineClusterAggregatorSecond" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >2 mins aggregation time | |
3 | How long does it take for 5 min host aggregator to finish? | grep "TimelineHostAggregatorMinuteTimelineMetricHostAggregatorMinute" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >5 mins aggregation time | |
4 | How many metrics are being collected? | curl -K http://<ams-host>:6188/ws/v1/timeline/metrics/metadata -o /tmp/metrics_metadata.txt Number of metrics is the output of the command 'grep -o "metricname" /tmp/metrics_metadata.txt | wc -l' Also find
| > 15000 metrics Find out which component is sending a lot of metrics. | > 15000 metrics |
5 | What is the number of regions and store files in AMS HBase? | This can be got from AMS HBase Master UI. http://<METRICS_COLLECTOR_HOST>:61310 | > 150 regions > 2000 store files | |
6 | How fast is AMS HBase flushing, and how much data is being flushed? | Check for master log in embedded mode and RS log in distributed mode. grep "memstore flush" /var/log/metric_collector/hbase-ams-<>.log | less Check how often METRIC_RECORD flushes happen and how much data is flushed?
| >10 flushes in a minute could be a problem. The flush size should be approx equal to flush size config in ams-hbase-site | |
7 | If AMS is in distributed mode, is there a local Datanode? | From the cluster. | In distributed mode, a local datanode helps with HBase read shortcircuit feature. |
...