Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


dddd
Excerpt

Compaction is a def~instant-action, that takes as input a set of def~file-slices, merges all the def~log-files, in each file slice against its def~base-file, to produce a new compacted file slices, written as a def~commit on the def~timeline.  Compaction is only applicable for the def~merge-on-read (MOR) table type and what file slices are chosen for compaction is determined by a def-compaction-policy (default: chooses the file slice with maximum sized uncompacted log files) that is evaluated after each def~write-operation.

At a high level, there are two styles of compaction 

  • Synchronous compaction : Here the compaction is performed by the writer process itself synchronously after each write i.e the next write operation cannot begin until compaction finishes. This is the simplest, in terms of operation since no separate compaction process needs to be scheduled, but has lower data freshness guarantees. However, this style is still very useful in cases where say it's possible to compact the most recent def~table-partitions every write operation, while delaying the compaction on late-arriving/older partitions.
  • Asynchronous compaction : In this style, compaction process can run concurrently and asynchronously with the def~write-operation on the def~table. This has the obvious benefits of compaction not blocking the next batch of writes, yielding near-real time data freshness. Tools like Hudi DeltaStreamer support a convenient continuous mode, where compaction and write operations happen in this fashion within a single spark runtime cluster.

Design 

todo