Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Definition

A storage model / table type where a dataset's commitA def~table-type where a def~table's def~commits are merged into def~table when read / viewed / queried.

This can be seen as "delayed ingestion": "compaction" happens delayed, on demand.

dataset is read (#todo improve to summarize semantics relative to commitdef~commits lifecycle, before and after)

Design details


Excerpt

In the Merge On Read (MOR) storage model, there are 2 logical components :

  1. a `Writer` for ingesting data (both inserts/updates) into the dataset 
  2. a `Compactor` for creating compacted views

this def~table-type, records written to the def~table, are quickly first written to def~log-files, which are at a later time merged with the def~base-file, using a def~compaction action on the timeline. Various def~query-types can be supported depending on whether the query reads the merged snapshot or the change stream in the logs or the un-merged base-file alone.

At a high level, Merge On Read (MOR) 135860486 writer goes through same stages as Copy On Write def~copy-on-write (COW) writer in ingesting data. The key difference here is that updates are appended to latest log (delta) file belonging to the latest file slice without merging. For inserts, Hudi supports 2 modes:

  1. Inserts to Log Files - This is done for datasets def~tables that have an indexable log files (for eg global eg def~hbase-index)
  2. Inserts to parquet files - This is done for datasets def~tables that do not have indexable log files, for eg bloom indexeg def~bloom-index

embedded in parquer files. Hudi treats writing new records in the same way as inserting to Copy On Write (COW) files.
As in the case of Copy On Write def~copy-on-write (COW), the input tagged records are partitioned such that all upserts destined to a `file id` def~file-id are grouped together. This upsert-batch is written as one or more log-blocks written to log def~log-files. Hudi allows clients to control log file sizes (See [Storage Configs](. ./configurations))
The WriteClient API is same for both Copy On Write def~copy-on-write (COW) and Merge On Read (MOR) 135860486 writers. With Merge On Read (MOR) 135860486, several rounds of data-writes would have resulted in accumulation of one or more log-files. All these log-files along with base-parquet (if exists) constitute a `file slice` def~file-slice which represents one complete version of the file.

This table type is the most versatile, highly advanced and offers much flexibility for writing (ability specify different compaction policies, absorb bursty write traffic etc) and querying (e.g: tradeoff data freshness and query performance). At the same time, it can involve a learning curve for mastering it operationally. 

Kind of

...

Related concepts

  1. def~copy-on-write (COW)

...