Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Handle raw writes -  A raw write is a bunch of metric data points received from services written onto HBase through phoenix. There is no read or aggregation involved. 
  • Periodically aggregate data - AMS aggregates data across cluster and across time. 
    • Cluster Aggregator - Computing the min,max,avg and sum of memory across all hosts is done by a cluster aggregator. This is called a 'TimelineClusterAggregatorSecond' which runs every 2 mins. In every run it reads the entire last 2 mins data and calculates aggregates and writes back. The read is expensive since it has to read non-aggregated data, while the write volume is smaller since it is aggregated data. For example, in a 100 node cluster, mem_free from 100 hosts becomes 1 aggregate metric value in this aggregator.
    • Time Aggregator - Also called 'downsampling', this aggregator rolls up the data in the time dimension. This helps AMS TTL out smaller precision seconds data and hold aggregate data for a longer time. For example, if we have data point for every 10 seconds, the 5min time aggregator takes the 30 data points every 5 mins and creates 1 rolled up value. There are higher level downsamplers (1hour, 1day) as well, and they use their immediate predecessors data (1hr => 5mins, 1day => 1hr ). However, it is the 5min aggregator that is high compute since it reads the entire last 5 mins data  and downsamples it. Again, the read is very expensive since it has to read non-aggregated data, while the write volume is smaller. This downsampler is called 'TimelineHostAggregatorMinute'

Scale problems occur in AMS when one or both of the above operations cannot happen smoothly. The 'load' on AMS is decided based on following factors

...