...
Get the current state of the system
# | What information to getgather? | How to get that information? | How to identify if there is a red flag? |
---|---|---|---|
1 | Is AMS able to handle raw writes*? | Look for log lines like 'AsyncProcess:1597 - #1, waiting for 13948 actions to finish' in the log.
| If the number of actions to finish keep increasing and eventually AMS shuts down, then it could mean AMS is not able to handle raw writes. |
2 | How long does it take for 2 min cluster aggregator to finish? | grep "TimelineClusterAggregatorSecond" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >2 mins aggregation time |
3 | How long does it take for 5 min host aggregator to finish? | grep "TimelineHostAggregatorMinute" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >5 mins aggregation time |
4 | How many metrics are being collected? |
| > 15000 metrics |
5 | What is the number of regions and store files in AMS HBase? | This can be got from AMS HBase Master UI. http://<METRICS_COLLECTOR_HOST>:61310 | > 150 regions > 2000 store files |
6 | How fast is AMS HBase flushing, and how much data is being flushed? | Check for master log in embedded mode and RS log in distributed mode. grep "memstore flush" /var/log/metric_collector/hbase-ams-<>.log | less Check how often METRIC_RECORD flushes happen and how much data is flushed?
| >10 flushes in a minute could be a problem. The flush size should be approx equal to flush size config in ams-hbase-site |
7 | If AMS is in distributed mode, is there a local Datanode? | From the cluster. | In distributed mode, a local datanode helps with HBase read shortcircuit feature. |
...