Page History

Definition

A storage type where a dataset def~dataset's commit def~commits are merged into dataset def~dataset when read / viewed / queried.

...

#todo improve to summarize semantics relative to commit def~commits lifecycle, before and after

Design details

Excerpt

In the 135860486 storage model, there are 2 logical components :

a `Writer` for ingesting data (both inserts/updates) into the dataset def~dataset
a `Compactor` for creating compacted views

At a high level, 135860486 writer goes through same stages as Copydef~copy-Onon-Writewrite writer in ingesting data.

The key difference here is that updates are appended to latest log (delta) file belonging to the latest file slice without merging. For inserts, Hudi supports 2 modes:

Inserts to Log Files - This is done for dataset def~datasets that have an indexable log files (for eg global index)
Inserts to parquet files - This is done for dataset def~datasets that do not have indexable log files, for eg bloom index

embedded in parquer files. Hudi treats writing new records in the same way as inserting to Copydef~copy-Onon-Writewrite files.

As in the case of Copydef~copy-Onon-Writewrite, the input tagged records are partitioned such that all upserts destined to a `file id` are grouped together. This upsert-batch is written as one or more log-blocks written to log-files. Hudi allows clients to control log file sizes (See [Storage Configs](../configurations))

The WriteClient API is same for both Copydef~copy-Onon-Writewrite and 135860486 writers. With 135860486, several rounds of data-writes would have resulted in accumulation of one or more log-files. All these log-files along with base-parquet (if exists) constitute a `file slice` which represents one complete version of the file.

Kind of

storage type

Related concepts

Copydef~copy-Onon-Writewrite

Space shortcuts

Page tree

Versions Compared

Old Version 16

New Version 17

Key

Definition

Design details

Kind of

Related concepts