Suggested Memory settings
Cluster Size | Recommended Mode | Collector Heapsize ams-env : metrics_collector_heapsize | HBase Master Heapsize ams-hbase-env : hbase_master_heapsize | HBase RS Heapsize ams-hbase-env : hbase_regionserver_heapsize | HBase Master xmn size ams-hbase-env : hbase_master_xmn_size | HBase RS xmn size ams-hbase-env : regionserver_xmn_size |
---|---|---|---|---|---|---|
1 - 10 | Embedded | 512 | 1408 | 512 | 192 | - |
11 - 20 | Embedded | 1024 | 1920 | 512 | 256 | - |
21 - 100 | Embedded | 1664 | 5120 | 512 | 768 | - |
100 - 300 | Embedded | 4352 | 13056 | 512 | 2048 | - |
300 - 500 | Distributed | 4352 | 512 | 13056 | 102 | 2048 |
500 - 800 | Distributed | 7040 | 512 | 21120 | 102 | 3072 |
800 - 1000 | Distributed | 11008 | 512 | 32768 | 102 | 5120 |
1000+ | Distributed with 2 Metric Collectors (From Ambari 2.5.2) | 13696 | 512 | 32768 | 102 | 5120 |
Identifying and tackling scale problems in AMS through configs
How do we find out if AMS is experiencing scale problems?
One or more of the following consequences can be seen on the cluster.
- Metrics Collector shuts down intermittently. Since Auto Restart is enabled for Metrics collector by default, this will up show as an alert stating 'Metrics collector has been auto restarted # times the last 1 hour'.
- Partial metrics data is seen.
- All non-aggregated host metrics are seen (HDFS Namenode metrics / Host summary page on Ambari / System - Servers Grafana dashboard).
- Aggregated data is not seen. (AMS Summary page / System - Home Grafana dashboard / HBase - Home Grafana dashboard).
Get the current state of the system
# | What to get? | How to get? | How to identify red flag? |
---|---|---|---|
1 | Is AMS able to handle raw writes? | Look for log lines like 'AsyncProcess:1597 - #1, waiting for 13948 actions to finish' in the log.
| If the number of actions to finish keep increasing and eventually AMS shuts down, then it could mean AMS is not able to handle raw writes. |
2 | How long does it take for 2 min cluster aggregator to finish? | grep "TimelineClusterAggregatorSecond" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >2 mins aggregation time |
3 | How long does it take for 5 min host aggregator to finish? | grep "TimelineHostAggregatorMinute" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >5 mins aggregation time |
4 | How many metrics are being collected? |
| > 15000 metrics |
5 | What is the number of regions and store files in AMS HBase? | This can be got from AMS HBase Master UI. http://<METRICS_COLLECTOR_HOST>:61310 | > 150 regions > 2000 store files |
6 | How fast is AMS HBase flushing, and how much data is being flushed? | Check for master log in embedded mode and RS log in distributed mode. grep "memstore flush" /var/log/metric_collector/hbase-ams-<>.log | less Check how often METRIC_RECORD flushes happen and how much data is flushed?
| >2-3 flushes every second could be a problem. The flush size should be approx equal to flush size config in ams-hbase-site |
7 | If AMS is in distributed mode, is there a local Datanode? | From the cluster. | In distributed mode, a local datanode helps with HBase read shortcircuit feature. |
Fixing / Recovering from the problem.
The above problems could occur because of a 2-3 underlying reasons.
Underlying Problem | What it could cause | Fix / Workaround | |
---|---|---|---|
Too many metrics (#4 from above) | It could cause ALL of the problems mentioned above. | #1 : Trying out config changes
#2 : Reducing number of metrics If the above config changes do not increase AMS stability, you can whitelist selected metrics or blacklist certain components' metrics that are causing the load issue.
| |
AMS node has slow disk speed. Disk is not able to keep up with high volume data. | It can cause raw writes and aggregation problems. |
| |
Known issues around HBase normalier and FIFO compaction. Documented in Known Issues (#11 and #13) | This can be identified in #5 in the above table. | Follow workaround steps in Known issue doc. |
Other Advanced Configurations
Configuration | Property | Description | Minimum Recommended values (Host Count => MB) |
---|---|---|---|
ams-site | phoenix.query.maxGlobalMemoryPercentage | Percentage of total heap memory used by Phoenix threads in the Metrics Collector API/Aggregator daemon. | 20 - 30, based on available memory. Default = 25. |
ams-site | phoenix.spool.directory | Set directory for Phoenix spill files. (Client side) | Set this to different disk from hbase.rootdir dir if possible. |
ams-hbase-site | phoenix.spool.directory | Set directory for Phoenix spill files. (Server side) | Set this to different disk from hbase.rootdir dir if possible. |
ams-hbase-site | phoenix.query.spoolThresholdBytes | Threshold size in bytes after which results from parallelly executed query results are spooled to disk. | Set this to higher value based on available memory. Default is 12 mb. |