...
- Metrics Collector shuts down intermittently. Since Auto Restart is enabled for Metrics collector by default, this will up show as an alert stating 'Metrics collector has been auto restarted # times the last 1 hour'.
- Partial metrics data is seen.
- All non-aggregated host metrics are seen (HDFS Namenode metrics / Host summary page on Ambari / System - Servers Grafana dashboard).
- Aggregated data is not seen. (AMS Summary page / System - Home Grafana dashboard / HBase - Home Grafana dashboard).
Systematically Troubleshooting the scale issue
Get the current state of the system
What to get? | How to get? | Is there a Red How to identify red flag? | ||
---|---|---|---|---|
How long does it take for 2 min cluster aggregator to finish? | grep " TimelineMetricClusterAggregatorSecondTimelineClusterAggregatorSecond" /var/log/ambari-metrics-collector/ambari-metrics-collector.log ? | | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >2 mins aggregation time | |
How long does it take for 5 min cluster aggregator to finish? | grep "TimelineHostAggregatorMinute" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less. Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates' | >5 mins aggregation time | ||
How many metrics are being collected? |
| >15000 could be a problem. Find the component contributing maximum to the number of metrics d | Find the component contributing maximum to the numbe | > 15000 metrics |
What is the number of regions and store files in AMS HBase? | This can be got from AMS HBase Master UI. http://<METRICS_COLLECTOR_HOST>:61310 | |||
> 150 regions > 2000 store files | ||||
How fast is AMS HBase flushing, and how much data is being flushed? | Check for master log in embedded mode and RS log in distributed mode. grep "memstore flush" /var/log/metric_collector/hbase-ams-<>.log | less Check how often METRIC_RECORD flushes happen and how much data is flushed?
| >2-3 flushes every second could be a problem. The flush size should be approx equal to flush size config in ams-hbase-site |
Fixing / Recovering from the problem.
Advanced Configurations
Configuration | Property | Description | Minimum Recommended values (Host Count => MB) |
---|---|---|---|
ams-site | phoenix.query.maxGlobalMemoryPercentage | Percentage of total heap memory used by Phoenix threads in the Metrics Collector API/Aggregator daemon. | 20 - 30, based on available memory. Default = 25. |
ams-site | phoenix.spool.directory | Set directory for Phoenix spill files. (Client side) | Set this to different disk from hbase.rootdir dir if possible. |
ams-hbase-site | phoenix.spool.directory | Set directory for Phoenix spill files. (Server side) | Set this to different disk from hbase.rootdir dir if possible. |
ams-hbase-site | phoenix.query.spoolThresholdBytes | Threshold size in bytes after which results from parallelly executed query results are spooled to disk. | Set this to higher value based on available memory. Default is 12 mb. |