...
- upsert() support with fast, pluggable indexing
- Incremental queries that scan only new data efficiently
- Atomically publish data with rollback support, Savepoints for data recovery
- Snapshot isolation between writer & queries using def~mvcc style design
- Manages file sizes, layout using statistics
- Self managed def~compaction of updates/deltas against existing records.
- Timeline metadata to audit changes to data
- GDPR, Data deletions, Compliance.
Design Principles
Streaming Reads/Writes : Hudi is designed, from ground-up, for streaming records in and out of large datasets, borrowing principles from database design and def~log-structured-storage systems. To that end, Hudi provides def~index implementations, that can quickly map a record's key to the file location it resides at. Similarly, for streaming data out, Hudi adds and tracks record level metadata via def~hoodie-special-columns, that enables providing a precise incremental stream of all changes that happened.
Self-Managing : Hudi recognizes the different expectation of data freshness (write friendly) vs query performance (read/query friendliness) users may have, and supports three different def~query-types that provide real-time snapshots, incremental streams or purely columnar data that slightly older. At each step, Hudi strives to be self-managing (e.g: autotunes the writer parallelism, maintains file sizes) and self-healing (e.g: auto rollbacks failed commits), even if it comes at cost of slightly additional runtime cost (e.g: caching input data in memory to profile the workload). The core premise here, is that, often times operational costs of these large data pipelines without such operational levers/self-managing features built-in, dwarf the extra memory/runtime costs associated.
Everything is a log : Hudi also has an append-only, cloud data storage friendly design, that lets Hudi manage data on across all the major cloud providers seamlessly, implementing principles from def~log-structured-storage systems.
Timeline
Excerpt Include | ||||||
---|---|---|---|---|---|---|
|
...
This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself.
At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.
DAG with Range Pruning:
draw.io Diagram | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
<WIP>
Incremental Queries
<WIP>
Read Optimized Queries
<WIP>