Table of Contents |
---|
The following is a list of potential improvements to the log cleaner.
Simple Things
Integrate with
...
system test framework
...
We have a good integration test, it would be nice to hook that in to the nightly test run.
Add
...
a
...
Add a tool to measure the duplication in a log
It would be nice to have an operational tool to check the duplication within a log. This could be built as a simple consumer that takes in a particular topic/partition and consumes that log sequentially and estimate the duplication. Each key consumed would be checked against a bloom filter. If it is present we would count a duplicate, otherwise we would add it to the filter. A large enough bloom filter could probably produce an accurate-enough estimate of duplication rate.
Improve dedupe buffer efficiency
Currently we use a fairly naive approach to the approximate deduplication. There are several things that could be improved.
Currently we will only process up to N messages of dirty log in one cleaning where N=buffer_size/24*collision_rate. This may actually be a little bit conservative. The dirty section of the log may itself has many duplicates, in which case it is actually using up much less space in the dedupe buffer. We could check whether the key is present in the dedupe buffer and only increment the entry count if there is nothing there
...
.
...
Drive-aware scheduling and throttling
...
I went through several iterations thinking about the right way to rank logs. Originally I thought maybe I would rank them by the total amount of garbage. However using the absolute amount doesn't make sense. Consider a case where you have a 1GB log that is 90% garbage and a 1TB log that is 10% garbage. It is true that cleaning the 1TB log will free up 100GB of space and cleaning the 1GB log will only free up 900MB; however cleaning the 1TB log will also take 1000 times longer. So clearly what we are looking for is to make the cleaner free the most space per second of operation (or per byte of cleaner I/O). This points to the right metric to use which is
Code Block |
---|
shrinkage = size_after_cleaning / size_before_cleaning
|
The problem is that we don't really know the size_after_cleaning without actually doing the cleaning.
Clearly
Code Block |
---|
size_after_cleaning = clean_size + survivorship_percentage * dirty_size
|
...
So how can we estimate the survivorship_percentage? Well in the current code I avoided this entirely with the following assumption. I assume that most data sets are more or less in a steady state, in other words they are mostly taking updates not inserts. As a result I simply assume that survivorship_percentage = 0% and just use the following ratio for ranking
Code Block |
---|
dirty_bytes / (clean_bytes + dirty_bytes)
|
...
Second, using only the survivorship_percentage from the previous cleaning might leave us open to some corner cases. Two cases to consider. The first is that there is some kind of oscillating behavior where the true survivorship ratio ranges between 10% and 90%. To handle this it would be nice to use a exponentially weighted average of previous ratios:
Code Block |
---|
survivorship_ratio = alpha * observed_ratio + (1-alpha) * survivorship_ratio
|
...