...
- upsert() support with fast, pluggable indexing
- Incremental queries that scan only new data efficiently
- Atomically publish data with rollback support, Savepoints for data recovery
- Snapshot isolation between writer & queries
- Savepoints for data recovery
- queries using def~mvcc style design
- Manages file sizes, layout using statistics
- Self managed def~compaction of updates/deltas against existing records.
- Timeline metadata to track lineageaudit changes to data
- GDPR, Data deletions, Compliance.
Design Principles
Hudi is designed, from ground-up, for streaming records in and out of large datasets, borrowing principles from database design and def~log-structured-storage systems. To that end, Hudi provides def~index implementations, that can quickly map a record's key to the file location it resides at. Similarly, for streaming data out, Hudi adds and tracks record level metadata via def~hoodie-special-columns, that enables providing a precise incremental stream of all changes that happened.
Hudi recognizes the different expectation of data freshness (write friendly) vs query performance (read/query friendliness) users may have, and supports three different def~query-types that provide real-time snapshots, incremental streams or purely columnar data that slightly older. At each step, Hudi strives to be self-managing (e.g: autotunes the writer parallelism, maintains file sizes) and self-healing (e.g: auto rollbacks failed commits), even if it comes at cost of slightly additional runtime cost (e.
...
Concepts
...
g: caching input data in memory to profile the workload). The core premise here, is that, often times operational costs of these large data pipelines without such operational levers/self-managing features built-in, dwarf the extra memory/runtime costs associated.
Hudi also has an append-only, cloud data storage friendly design, that lets Hudi manage data on across all the major cloud providers seamlessly.
Timeline
Excerpt Include | ||||||
---|---|---|---|---|---|---|
|
Storage/Writing
The implementation specifics of the two def~table-types are detailed below.
Excerpt Include | ||||||
---|---|---|---|---|---|---|
|
Copy On Write
def~copy-on-write
...