Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Divide the log into two segments: a head which consists of segments that have not yet been processed and a tail which consists of segments that have been previously deduplicated.

Wiki MarkupWe will use a a Map<byte\[\], Long> S containing keys in the head of the log and the offset of their latest update. First I will describe the use of the mapping S, then a way to implement it efficiently. First we scan the head of the log from oldest to newest populating S with the keys in the log; when this scan is complete S contains the latest offset of each key in the head of the log.

Now we recopy all the log segments and their indexes. For each entry we first check S to see if it contains the key, if so we do not recopy the message unless its offset is >= the offset stored in S.unmigrated-wiki-markup

We can make the implementation of S efficient by exploiting the fact that we need not remove all duplicates (just most of them). To do this we implement S as a two parallel arrays, byte\[size*16\] and long\[size\]. The byte array stores the md5 by using the first 4 bytes of the md5 as an integer position p, and stores the md5 beginning at position p*16. In the event of hash collisions there will be a key which is missed in deduplication, however this is okay, and by changing the seed for the hashing on each compaction we can ensure that the key will likely be collected on subsequent collections.

The compacter would statically allocate these large arrays to some configurable size, and loop through the partitions compacting a given partition whenever the head of the log reaches a certain volume of uncompacted data.

...

My primary concerns would be (1) not introducing any new bugs, and (2) getting acceptable performance out of the index implementation.

Implementation Phases

  1. Physical changes to message and log to retain key and offset. This should ideally go out with 0.8 since it will be a compatibility change.
  2. Change to logical offsets.
  3. -Implement simple "brute force" log compaction for keyed messages
  4. Nice-to-have
    1. Implement a generational compaction scheme to lessen the I/O in compaction. This is a nice to have and could be done much later.
    2. Key-based auditing

Questions and Comments

  1. What are the implications for replication? I am not 100% sure there are no other cases where we conflate the use off offset as a byte offset and as a message identifier except in the log search.
  2. Is it odd to have a Message.key() field? If the name of our message class was Record it would seem natural for a Record to have a primary key, but perhaps less so for Message. The other alternative would be to have the key be a field in the log rather than in the message. I am not sure which is better.
  3. Changing to sequential offsets would imply we should change the terminology from "offset" to a more typical "message id" or "log change number" or something like that, but changing all references to offset is sort of overwhelming.
  4. I am not completely clear on the interaction with our compression implementation.