Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Metrics Collector shuts down intermittently. Since Auto Restart is enabled for Metrics collector by default, this will up show as an alert stating 'Metrics collector has been auto restarted # times the last 1 hour'.
  • Partial metrics data is seen.
    • All non-aggregated host metrics are seen (HDFS Namenode metrics  / Host summary page on Ambari / System - Servers Grafana dashboard).
    • Aggregated data is not seen. (AMS Summary page / System - Home Grafana dashboard / HBase - Home Grafana dashboard).

Systematically Troubleshooting the scale issue

Get the current state of the system

 
What to get?How to get?Is there a Red How to identify red flag? 
How long does it take for 2 min cluster aggregator to finish?

grep "

TimelineMetricClusterAggregatorSecond

TimelineClusterAggregatorSecond" /var/log/ambari-metrics-collector/ambari-metrics-collector.log

?
 

| less.

Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates'

>2 mins aggregation time
How long does it take for 5 min cluster aggregator to finish?

grep "TimelineHostAggregatorMinute" /var/log/ambari-metrics-collector/ambari-metrics-collector.log | less.

Look for the time taken between 'Start aggregation cycle....' and 'Saving ## metric aggregates'

>5 mins aggregation time 
How many metrics are being collected?
  • curl -K http://<ams-host>:6188/ws/v1/timeline/metrics/metadata -o /tmp/metrics_metadata.txt
  • Number of metrics is the output of the command 'grep -o "metricname" /tmp/metrics_metadata.txt | wc -l'

>15000 could be a problem. Find the component contributing maximum to the number of metrics d

Find the component contributing maximum to the numbe> 15000 metrics
What is the number of regions and store files in AMS HBase?

This can be got from AMS HBase Master UI.

http://<METRICS_COLLECTOR_HOST>:61310

  
    
    
   

> 150 regions

> 2000 store files

How fast is AMS HBase flushing, and how much data is being flushed?

Check for master log in embedded mode and RS log in distributed mode.

grep "memstore flush" /var/log/metric_collector/hbase-ams-<>.log | less

Check how often METRIC_RECORD flushes happen and how much data is flushed?

 

>2-3 flushes every second could be a problem.

The flush size should be approx equal to flush size config in ams-hbase-site

Fixing / Recovering from the problem.


Advanced Configurations

ConfigurationPropertyDescriptionMinimum Recommended values (Host Count => MB)
ams-sitephoenix.query.maxGlobalMemoryPercentage

Percentage of total heap memory used by Phoenix

threads in the Metrics Collector API/Aggregator daemon.

20 - 30, based on available memory. Default = 25.
ams-sitephoenix.spool.directorySet directory for Phoenix spill files. (Client side)Set this to different disk from hbase.rootdir dir if possible.
ams-hbase-sitephoenix.spool.directorySet directory for Phoenix spill files. (Server side)Set this to different disk from hbase.rootdir dir if possible.
ams-hbase-sitephoenix.query.spoolThresholdBytes

Threshold size in bytes after which results from parallelly

executed query results are spooled to disk.

Set this to higher value based on available memory.

Default is 12 mb.