Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Everything is a log : Hudi also has an append-only, cloud data storage friendly design, that lets Hudi manage data on across all the major cloud providers seamlessly, implementing principles from def~log-structured-storage systems. 

...

key-value data model : On the writer side, Hudi table is modeled as a key-value dataset, where each def~record has a unique def~record-key. Additionally, a record key may also include the def~partitionpath under which the record is partitioned and stored. This often helps in cutting down the search space during index lookups.

Table Layout

With an understanding of key technical motivations for the projects, let's now dive deeper into design of the system itself. At a high level,  components for writing Hudi tables are embedded into an Apache Spark job using one of the supported ways and it produces a set of files on def~backing-dfs-storage, that represents a Hudi def~table. Query engines like Apache Spark, Presto, Apache Hive can then query the table, with certain guarantees (that will discuss below).

...

  • upsert() support with fast, pluggable indexing
  • Incremental queries that scan only new data efficiently
  • Atomically publish data with rollback support, Savepoints for data recovery
  • Snapshot isolation between writer & queries using def~mvcc style design
  • Manages file sizes, layout using statistics
  • Self managed def~compaction of updates/deltas against existing records.
  • Timeline metadata to audit changes to data
  • GDPR, Data deletions, Compliance.

Timeline

Excerpt Include
def~timeline
def~timeline
nopaneltrue

...

Excerpt Include
def~instant-action
def~instant-action
nopaneltrue

Excerpt Include

...

Image Removed

Merge On Read Table

def~merge-on-read

Excerpt Include
def~instant-state
def~instant-state
nopaneltrue

Table Layout

Table Types 

The implementation specifics of the two def~table-types are detailed below.

...

Copy On Write Table

def~copy-on-write

...

def~merge-on-readdef~merge-on-read
nopaneltrue

Image Removed

Hudi writing is implemented as a Spark library, which makes it easy to integrate into existing data pipelines or ingestion libraries (which we will refer to as `Hudi clients`). Hudi Clients prepare an `RDD[HoodieRecord]` that contains the data to be upserted and Hudi upsert/insert is merely a Spark DAG, that can be broken into two big pieces.

  • Indexing : A big part of Hudi's efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to. This index also helps the `HoodieWriteClient` separate upserted records into inserts and updates, so they can be treated differently. `HoodieReadClient` supports operations such as `filterExists` (used for de-duplication of table) and an efficient batch `read(keys)` api, that can read out the records corresponding to the keys using the index much quickly, than a typical scan via a query. The index is also atomically updated each commit, and is also rolled back when commits are rolled back.
  • Storage : The storage part of the DAG is responsible for taking an `RDD[HoodieRecord]`, that has been tagged as an insert or update via index lookup, and writing it out efficiently onto storage.

Data Files

Hudi organizes a datasets table into a directory folder structure under a `basepath` on def~table-basepath on DFS. Dataset is broken up into partitionsIf the table is partitioned by some columns, then there are additional def~table-partitions under the base path, which are folders containing data files for that partition, very similar to Hive tables. Each partition is uniquely identified by its `partitionpath`its def~partitionpath, which is relative to the basepath.
Within  Within each partition, files are organized into `file groups` def~file-groups, uniquely identified by a `file id`def~file-id. Each file group contains several `file slices`several def~file-slices, where each slice contains a base columnar def~base-file (`*.parquet`e.g: parquet) produced at a certain commit/compaction instant def~instant-time, along with set of log files (`*.log.*`) that def~log-files  that contain inserts/updates to the base file since the base file was producedlast written. Hudi adopts a MVCC design, where compaction action merges logs and base files to produce new file slices and cleaning action gets rid of unused/older file slices to reclaim space on DFS.

Index

Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file group def~record-key + def~partition-path combination consistently to a def~file-id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file group. In short, the mapped file group contains all versions of a group of records.

...

Excerpt Include
def~index
def~index
nopaneltrue

Table Types

The implementation specifics of the two def~table-types are detailed below.

Excerpt Include
def~table-type
def~table-type
nopaneltrue

Copy On Write Table

def~copy-on-write

Excerpt Include
def~copy-on-write
def~copy-on-write
nopaneltrue


Image Added



Merge On Read Table

def~merge-on-read

Excerpt Include
def~merge-on-read
def~merge-on-read
nopaneltrue


Image Added


Writing

Write Operations

It may be helpful to understand the 3 different write operations provided by Hudi datasource or the delta streamer tool and how best to leverage them. These operations can be chosen/changed across each commit/deltacommit issued against the dataset.

def~upsert-operation: This is the default operation where the input records are first tagged as inserts or updates by looking up the index and  the records are ultimately written after heuristics are run to determine how best to pack them on storage to optimize for things like file sizing. This operation is recommended for use-cases like database change capture where the input almost certainly contains updates.
def~insert-operation: This operation is very similar to upsert in terms of heuristics/file sizing but completely skips the index lookup step. Thus, it can be a lot faster than upserts for use-cases like log de-duplication (in conjunction with options to filter duplicates mentioned below). This is also suitable for use-cases where the dataset can tolerate duplicates, but just need the transactional writes/incremental pull/storage management capabilities of Hudi.
def~bulk-insert-operation Both upsert and insert operations keep input records in memory to speed up storage heuristics computations faster (among other things) and thus can be cumbersome for initial loading/bootstrapping a Hudi dataset at first. Bulk insert provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do.

...