Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here alpha is a "learning rate" between 0 and 1.0 that controls how quickly we move away from older ratios and towards the observed ratio (the simple proposal above is basically alpha = 100%). This would help smooth out oscillations at the cost of taking a few cleanings to fully adapt to the observed ratio.

Generational Collection

The current code does a full recopy of the log with each cleaning. I believe this is an optimal strategy if the probability of update is uniform for all log entries--that is if garbage is equally likely to be at the end of the log as anywhere else. However the update probability is almost always non-uniform (this is why we have caches and memory hierarchies).

If this sounds abstract consider a practical example. Let's say you have a log containing user account records. Some active users will update their account daily; some inactive users have forgotten their password and changed their email and can never update their account again. What will happen to the log? Well, every time we clean the log records which are updated will be removed from segments. As a result the log will become sorted by the time since the last update (things at the beginning of the log will be things updated least recently, and things at the end will be the things just updated). So now we can see the inefficiency in our cleaning--the last segments of the log will be full of the user accounts for users who will never update their account again, yet we persist in optimistically recopying these segments looking for garbage.

In other words the same survivorship ratio we used in the above section was actually more like the average survivorship ratio over the whole log, but in reality it would vary significantly within the log, with older segments (usually) having a much higher survivorship.

This is exactly the intuition behind generational collection in both log-structured storage (e.g. leveldb and cassandra) and in programming language garbage collection. You can think of our dirty and clean segments as being two generations, but you could add additional generations beyond that. Typically for these systems the following would be true:

  1. Each successive generation increases in size (e.g. maybe a generation is 2x the size in bytes of its predecessor)
  2. All but the newest generation is approximately deduplicated internally, but generations may duplicate each other.
  3. Collection proceeds by collecting a generation into the generation that succeeds it

You can view the generations as an attempt to approximate the histogram of the update frequency by recency.

A few open questions I have considered but not fully answered:

The first obvious question is if you do have generations is how many should you have and what size should they be? Existing systems seem to hard code this or have you directly configure it. However it would be better to do this in a data-driven way. That is, to measure the survivorship ratio and chose generations appropriately. Intuitively if the distribution is uniform with respect to recency then you should just have one big generation, but otherwise you should chose generation boundaries that will optimally approximate the observed distribution.

Second, though it is true that most existing systems use the concept of generations I wonder if there is another way that produces a more continuous estimation? I have not fully considered this, but I wonder if it would be possible to somehow use a memory mapped offsetmap that would be saved between runs and dynamically choose how far back int he log to collect.