Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

A def~table-type where a def~table's def~commits are fully merged into def~table during a def~write-operation. This can be seen as "imperative ingestion", "compaction" of the  happens right away. No def~log-files are written and def~file-slices contain only def~base-file. (e.g a single parquet file constitutes one file slice)

The Spark DAG for this storage, is relatively simpler. The key goal here is to group the tagged Hudi record RDD, into a series of updates and inserts, by using a partitioner. To achieve the goals of maintaining file sizes, we first sample the input to obtain a workload profile  that understands the spread of inserts vs updates, their distribution among the partitions etc. With this information, we bin-pack the records such that

  • For updates, the latest version of the that file id, is rewritten once, with new values for all records that have changed
  • For inserts, the records are first packed onto the smallest file in each partition path, until it reaches the configured maximum size.

Any remaining records after that, are again packed into new file id groups, again meeting the size requirements.

Image Modified

Kind of

Related concepts

  1. def~merge-on-read (MOR)