...
Get the current state of the system
# | What to get? | How to get? | How to identify red flag? |
---|---|---|---|
1 | Is AMS able to handle raw writes? | Look for log lines like 'AsyncProcess:1597 - #1, waiting for 13948 actions to finish' in the log.
| If the number of actions to finish keep increasing and eventually AMS shuts down, then it could mean AMS is not able to handle raw writes. |
2 | How long does it take for 2 min cluster aggregator to finish? | grep "TimelineClusterAggregatorSecond" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >2 mins aggregation time |
3 | How long does it take for 5 min cluster host aggregator to finish? | grep "TimelineHostAggregatorMinute" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >5 mins aggregation time |
4 | How many metrics are being collected? |
| > 15000 metrics |
5 | What is the number of regions and store files in AMS HBase? | This can be got from AMS HBase Master UI. http://<METRICS_COLLECTOR_HOST>:61310 | > 150 regions > 2000 store files |
6 | How fast is AMS HBase flushing, and how much data is being flushed? | Check for master log in embedded mode and RS log in distributed mode. grep "memstore flush" /var/log/metric_collector/hbase-ams-<>.log | less Check how often METRIC_RECORD flushes happen and how much data is flushed?
| >2-3 flushes every second could be a problem. The flush size should be approx equal to flush size config in ams-hbase-site |
7 | If AMS is in distributed mode, is there a local Datanode? | From the cluster. | In distributed mode, a local datanode helps with HBase read shortcircuit feature. |
Fixing / Recovering from the problem.
The above problems could occur because of a 2-3 underlying reasons.
Underlying Problem | What it could cause | Fix / Workaround | |
---|---|---|---|
Too many metrics (#4 from above) | It could cause ALL of the problems mentioned above. | #1 : Trying out config changes
#2 : Reducing number of metrics If the above config changes do not increase AMS stability, you can whitelist selected metrics or blacklist certain components' metrics that are causing the load issue.
| |
AMS node has slow disk speed. Disk is not able to keep up with high volume data. | It can cause raw writes and aggregation problems. |
| |
Known issues around HBase normalier and FIFO compaction. Documented in Known Issues (#11 and #13) | This can be identified in #5 in the above table. | Follow workaround steps in Known issue doc. |
Other Advanced Configurations
Configuration | Property | Description | Minimum Recommended values (Host Count => MB) |
---|---|---|---|
ams-site | phoenix.query.maxGlobalMemoryPercentage | Percentage of total heap memory used by Phoenix threads in the Metrics Collector API/Aggregator daemon. | 20 - 30, based on available memory. Default = 25. |
ams-site | phoenix.spool.directory | Set directory for Phoenix spill files. (Client side) | Set this to different disk from hbase.rootdir dir if possible. |
ams-hbase-site | phoenix.spool.directory | Set directory for Phoenix spill files. (Server side) | Set this to different disk from hbase.rootdir dir if possible. |
ams-hbase-site | phoenix.query.spoolThresholdBytes | Threshold size in bytes after which results from parallelly executed query results are spooled to disk. | Set this to higher value based on available memory. Default is 12 mb. |