...
Excerpt |
---|
A def~table-type where a def~table's def~commits are fully merged into def~table |
...
during a def~write-operation. This can be seen as "imperative ingestion", "compaction" of |
...
#todo: verify
Design details
Excerpt |
---|
the happens right away |
...
. No def~log-files are written and def~file-slices contain only def~base-file. |
The Spark DAG for this storage, is relatively simpler. The key goal here is to group the tagged Hudi record RDD, into a series of updates and inserts, by using a partitioner. To achieve the goals of maintaining file sizes, we first sample the input to obtain a `workload profile` that understands the spread of inserts vs updates, their distribution among the partitions etc. With this information, we bin-pack the records such that
Any remaining records after that, are again packed into new file id groups, again meeting the size requirements. In this storage, index updation is a no-op, since the bloom filters are already written as a part of committing data. In the case of Copy-On-Write, a single parquet file constitutes one `file slice` which contains one complete version of the file{% include image.html file="hudi_log_format_v2.png" alt="hudi_log_format_v2.png" max-width="1000" %} |
Kind of
Related concepts
...