Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The implementation specifics of the two storage types are detailed below.

Copy On Write (COW)

The Spark DAG for this storage, is relatively simpler. The key goal here is to group the tagged Hudi record RDD, into a series of updates and inserts, by using a partitioner. To achieve the goals of maintaining file sizes, we first sample the input to obtain a `workload profile` that understands the spread of inserts vs updates, their distribution among the partitions etc. With this information, we bin-pack the records such that

...

Any remaining records after that, are again packed into new file id groups, again meeting the size requirements. In this storage, index updation is a no-op, since the bloom filters are already written as a part of committing data. In the case of Copy-On-Write, a single parquet file constitutes one `file slice` which contains one complete version of the file

{% include image.html file="hudi_log_format_v2.png" alt="hudi_log_format_v2.png" max-width="1000" %}

...

Include Page
In the
Merge
-
On
-
Read
storage model, there are 2 logical components - one for ingesting data (both inserts/updates) into the dataset and another for creating compacted views. The former is hereby referred to as `Writer` while the later
is referred as `Compactor`.
At a high level, Merge-On-Read Writer goes through same stages as Copy-On-Write writer in ingesting data. The key difference here is that updates are appended to latest log (delta) file belonging to the latest file slice without merging. For inserts, Hudi supports 2 modes:1. Inserts to Log Files - This is done for datasets that have an indexable log files (for eg global index)
2. Inserts to parquet files - This is done for datasets that do not have indexable log files, for eg bloom index
embedded in parquer files. Hudi treats writing new records in the same way as inserting to Copy-On-Write files.
As in the case of Copy-On-Write, the input tagged records are partitioned such that all upserts destined to a `file id` are grouped together. This upsert-batch is written as one or more log-blocks written to log-files. Hudi allows clients to control log file sizes (See [Storage Configs](../configurations))
The WriteClient API is same for both Copy-On-Write and Merge-On-Read writers. With Merge-On-Read, several rounds of data-writes would have resulted in accumulation of one or more log-files. All these log-files along with base-parquet (if exists) constitute a `file slice` which represents one complete version of the file.
(MOR)
Merge On Read (MOR)

Compactor

Realtime Readers will perform in-situ merge of these delta log-files to provide the most recent (committed) view of the dataset. To keep the query-performance in check and eventually achieve read-optimized performance, Hudi supports
compacting these log-files asynchronously to create read-optimized views.

Asynchronous Compaction involves 2 steps:

Compaction Schedule : Hudi Write Client exposes API to create Compaction plans which contains the list of `file slice` to be compacted atomically in a single compaction commit. Hudi allows pluggable strategies for choosing file slices for each compaction runs. This step is typically done inline by Writer process as Hudi expects only one schedule is being generated at a time which allows Hudi to enforce the constraint that pending compaction plans do not step on each other file-slices. This constraint allows for multiple concurrent `Compactors` to run at the same time. Some of the common strategies used for choosing `file slice` for compaction are:
BoundedIO - Limit the number of file slices chosen for a compaction plan by expected total IO (read + write) needed to complete compaction run
Log File Size - Prefer file-slices with larger amounts of delta log data to be merged
Day Based - Prefer file slice belonging to latest day partitions

Compactor : Hudi provides a separate API in Write Client to execute a compaction plan. The compaction plan (just like a commit) is identified by a timestamp. Most of the design and implementation complexities for Async Compaction is for guaranteeing snapshot isolation to readers and writer when multiple concurrent compactors are running. Typical compactor deployment involves launching a separate spark application which executes pending compactions when they become available. The core logic of compacting file slices in the Compactor is very similar to that of merging updates in a Copy-On-Write table. The only difference being in the case of compaction, there is an additional step of merging the records in delta log-files.

Here are the main API to lookup and execute a compaction plan.

...