Page History

...

The implementation specifics of the two storage types types are detailed below.

Copy On Write (COW)

The Spark DAG for this storage, is relatively simpler. The key goal here is to group the tagged Hudi record RDD, into a series of updates and inserts, by using a partitioner. To achieve the goals of maintaining file sizes, we first sample the input to obtain a `workload profile` that understands the spread of inserts vs updates, their distribution among the partitions etc. With this information, we bin-pack the records such that

For updates, the latest version of the that file id, is rewritten once, with new values for all records that have changed
For inserts, the records are first packed onto the smallest file in each partition path, until it reaches the configured maximum size.

Excerpt Include

	Copy On Write (COW)
	Copy On Write (COW)

Any remaining records after that, are again packed into new file id groups, again meeting the size requirements. In this storage, index updation is a no-op, since the bloom filters are already written as a part of committing data. In the case of Copy-On-Write, a single parquet file constitutes one `file slice` which contains one complete version of the file
{% include image.html file="hudi_log_format_v2.png" alt="hudi_log_format_v2.png" max-width="1000" %}

Merge On Read (MOR)

Excerpt Include

	Merge On Read (MOR)
	Merge On Read (MOR)
nopanel	true

...

Space shortcuts

Page tree

Versions Compared

Old Version 17

New Version 18

Key

Copy On Write (COW)

Merge On Read (MOR)