Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

Divide the log into two segments: a head which consists of segments that have not yet been processed and a tail which consists of segments that have been previously deduplicated.

Wiki MarkupWe will use a a Map<byte\[\], Long> S containing keys in the head of the log and the offset of their latest update. First I will describe the use of the mapping S, then a way to implement it efficiently. First we scan the head of the log from oldest to newest populating S with the keys in the log; when this scan is complete S contains the latest offset of each key in the head of the log.

Now we recopy all the log segments and their indexes. For each entry we first check S to see if it contains the key, if so we do not recopy the message unless its offset is >= the offset stored in S.

Wiki MarkupWe can make the implementation of S efficient by exploiting the fact that we need not remove all duplicates (just most of them). To do this we implement S as a two parallel arrays, byte\[size*16\] and long\[size\]. The byte array stores the md5 by using the first 4 bytes of the md5 as an integer position p, and stores the md5 beginning at position p*16. In the event of hash collisions there will be a key which is missed in deduplication, however this is okay, and by changing the seed for the hashing on each compaction we can ensure that the key will likely be collected on subsequent collections.

The compacter would statically allocate these large arrays to some configurable size, and loop through the partitions compacting a given partition whenever the head of the log reaches a certain volume of uncompacted data.

...