upsert() support with fast, pluggable indexing
Incremental queries that scan only new data efficiently
Atomically publish data with rollback support, Savepoints for data recovery
Snapshot isolation between writer & queries using def~mvcc style design
Manages file sizes, layout using statistics
Self managed def~compaction of updates/deltas against existing records.
Timeline metadata to audit changes to data
GDPR, Data deletions, Compliance.

Design Principles

Streaming Reads/Writes : Hudi is designed, from ground-up, for streaming records in and out of large datasets, borrowing principles from database design and def~log-structured-storage systems. To that end, Hudi provides def~index implementations, that can quickly map a record's key to the file location it resides at. Similarly, for streaming data out, Hudi adds and tracks record level metadata via def~hoodie-special-columns, that enables providing a precise incremental stream of all changes that happened.

Self-Managing : Hudi recognizes the different expectation of data freshness (write friendly) vs query performance (read/query friendliness) users may have, and supports three different def~query-types that provide real-time snapshots, incremental streams or purely columnar data that slightly older. At each step, Hudi strives to be self-managing (e.g: autotunes the writer parallelism, maintains file sizes) and self-healing (e.g: auto rollbacks failed commits), even if it comes at cost of slightly additional runtime cost (e.g: caching input data in memory to profile the workload). The core premise here, is that, often times operational costs of these large data pipelines without such operational levers/self-managing features built-in, dwarf the extra memory/runtime costs associated.

Everything is a log : Hudi also has an append-only, cloud data storage friendly design, that lets Hudi manage data on across all the major cloud providers seamlessly, implementing principles from def~log-structured-storage systems.

Timeline

Excerpt Include

	def~timeline
	def~timeline
nopanel	true

...

This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself.

At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.

DAG with Range Pruning:

draw.io Diagram

border	true
viewerToolbar	true

fitWindow	false
diagramName	hoodie-bloom-index-dag
simpleViewer	false
width	400
diagramWidth	1003
revision	2

...

<WIP>

Incremental Queries

<WIP>

Read Optimized Queries

<WIP>

Space shortcuts

Page tree

Versions Compared

Old Version 42

New Version 43

Key

Design Principles

Timeline

Incremental Queries

Read Optimized Queries

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 42

New Version 43

Key

Design Principles

Timeline

Incremental Queries

Read Optimized Queries