Excerpt Include

	def~instant-state
	def~instant-state
nopanel	true

...

Table Layout

Table Types

The implementation specifics of the two def~table-types are detailed below.

...

Indexing : A big part of Hudi's efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to. This index also helps the `HoodieWriteClient` separate upserted records into inserts and updates, so they can be treated differently. `HoodieReadClient` supports operations such as `filterExists` (used for de-duplication of table) and an efficient batch `read(keys)` api, that can read out the records corresponding to the keys using the index much quickly, than a typical scan via a query. The index is also atomically updated each commit, and is also rolled back when commits are rolled back.
Storage : The storage part of the DAG is responsible for taking an `RDD[HoodieRecord]`, that has been tagged as an insert or update via index lookup, and writing it out efficiently onto storage.

Data Files

Hudi organizes a datasets into a directory structure under a `basepath` on DFS. Dataset is broken up into partitions, which are folders containing data files for that partition, very similar to Hive tables. Each partition is uniquely identified by its `partitionpath`, which is relative to the basepath.

Within each partition, files are organized into `file groups`, uniquely identified by a `file id`. Each file group contains several `file slices`, where each slice contains a base columnar file (`*.parquet`) produced at a certain commit/compaction instant time, along with set of log files (`*.log.*`) that contain inserts/updates to the base file since the base file was produced. Hudi adopts a MVCC design, where compaction action merges logs and base files to produce new file slices and cleaning action gets rid of unused/older file slices to reclaim space on DFS.

Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file group, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.

...

Space shortcuts

Page tree

Versions Compared

Old Version 48

New Version 49

Key

Table Layout

Table Types

Data Files

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 48

New Version 49

Key

Table Layout

Table Types

Data Files