You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Definition

A storage model / table type where commits are merged when read (#todo improve)

Design details


In the Merge-On-Read storage model, there are 2 logical components - one for ingesting data (both inserts/updates) into the dataset and another for creating compacted views. The former is hereby referred to as `Writer` while the later is referred as `Compactor`.

At a high level, Merge-On-Read Writer goes through same stages as Copy On Write (COW) writer in ingesting data.

The key difference here is that updates are appended to latest log (delta) file belonging to the latest file slice without merging. For inserts, Hudi supports 2 modes:


1. Inserts to Log Files - This is done for datasets that have an indexable log files (for eg global index)
2. Inserts to parquet files - This is done for datasets that do not have indexable log files, for eg bloom index
embedded in parquer files. Hudi treats writing new records in the same way as inserting to Copy-On-Write files.

As in the case of Copy On Write (COW), the input tagged records are partitioned such that all upserts destined to a `file id` are grouped together. This upsert-batch is written as one or more log-blocks written to log-files. Hudi allows clients to control log file sizes (See [Storage Configs](../configurations))

The WriteClient API is same for both Copy On Write (COW) and Merge On Read (MOR) writers. With Merge On Read (MOR), several rounds of data-writes would have resulted in accumulation of one or more log-files. All these log-files along with base-parquet (if exists) constitute a `file slice` which represents one complete version of the file.

Kind of

  • storage model


  • No labels