Log Cleaner Improvements

The following is a list of potential improvements to the log cleaner.

Simple Things

Integrate with the system test framework. We have a good integration test, it would be nice to hook that in to the nightly test run.
Add yammer metrics for cleaner rates

Estimate Survivorship Ratio

The log cleaner works by successively estimating the "dirtiest" log and cleaning that. Intuitively we want to avoid cleaning logs that have no garbage as that will generate lots of I/O with no reduction in size. Since we can't know the amount of garbage in the log without actually performing something like the cleaning this is inherently a prediction problem. The better we make this choice the more efficient the cleaner will be.

I went through several iterations thinking about the right way to rank logs. Originally I thought maybe I would rank them by the total amount of garbage. However using the absolute amount doesn't make sense. Consider a case where you have a 1GB log that is 90% garbage and a 1TB log that is 10% garbage. It is true that cleaning the 1TB log will free up 100GB of space and cleaning the 1GB log will only free up 900MB; however cleaning the 1TB log will also take 1000 times longer. So clearly what we are looking for is to make the cleaner free the most space per second of operation (or per byte of cleaner I/O). This points to the right metric to use which is

  shrinkage = size_after_cleaning / size_before_cleaning

The problem is that we don't really know the size_after_cleaning without actually doing the cleaning.

Clearly

  size_after_cleaning = clean_size + survivorship_percentage * dirty_size

That is, if 100% of the dirty segment of the log is "updates" (survivorship_percentage = 0%) then after cleaning all that will be left is clean_size bytes. If, on the other hand, all the messages in the dirty segment are new and unique and there is no overlap with the clean segment (survivorship_percentage = 100%) then the resulting size will be clean_size + dirty_size.

Current Code

So how can we estimate the survivorship_percentage? Well in the current code I avoided this entirely with the following assumption. I assume that most data sets are more or less in a steady state, in other words they are mostly taking updates not inserts. As a result I simply assume that survivorship_percentage = 0% and just use the following ratio for ranking

dirty_bytes / (clean_bytes + dirty_bytes)

Specifically clean_bytes is the size of the log between the last cleaner point and the beginning of the active segment and dirty_bytes is the remaining tail of the log from the last cleaner end point to the beginning of the log.

A Better Approach

This heuristic has obvious drawbacks. A log which is being "bootstrapped" and is taking only inserts would get cleaned with the same frequency as a log which is in steady state, even though these cleanings would result in no size reduction.

A better approach would be to use the survivorship ratio from previous cleanings to estimate the size_after_cleaning for the current cleaning. We already calculate these statistics in CleanerStats.

The simplest implementation would just keep an in-memory map of topic/partition=>survivorship_ratio. This would be initialized with survivorship_percentage = 0% (the current heuristic) and each time we do a cleaning we would update this percentage to be the observed survivorship from the last cleaning.

There are two further improvements.

First we could save out this percentage to a file periodically so that on restart we would not reduce the cleaner efficiency by starting over with a bad estimate.

Second, using only the survivorship_percentage from the previous cleaning might leave us open to some corner cases. Two cases to consider. The first is that there is some kind of oscillating behavior where the true survivorship ratio ranges between 10% and 90%. To handle this it would be nice to use a exponentially weighted average of previous ratios:

survivorship_ratio = alpha * observed_ratio + (1-alpha) * survivorship_ratio

Here alpha is a "learning rate" between 0 and 1.0 that controls how quickly we move away from older ratios and towards the observed ratio (the simple proposal above is basically alpha = 100%). This would help smooth out oscillations at the cost of taking a few cleanings to fully adapt to the observed ratio.

Generational Collection

Space shortcuts

Child pages

Simple Things

Estimate Survivorship Ratio

Current Code

A Better Approach