Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • upsert() support with fast, pluggable indexing
  • Incremental queries that scan only new data efficiently
  • Atomically publish data with rollback support, Savepoints for data recovery
  • Snapshot isolation between writer & queries 
  • Savepoints for data recovery
  • queries using def~mvcc style design
  • Manages file sizes, layout using statistics
  • Self managed def~compaction of updates/deltas against existing records.
  • Timeline metadata to track lineageaudit changes to data
  • GDPR, Data deletions, Compliance.

Design Principles

Hudi is designed, from ground-up, for streaming records in and out of large datasets, borrowing principles from database design and def~log-structured-storage systems. To that end, Hudi provides def~index implementations, that can quickly map a record's key to the file location it resides at. Similarly, for streaming data out, Hudi adds and tracks record level metadata via def~hoodie-special-columns, that enables providing a precise incremental stream of all changes that happened. 

Hudi recognizes the different expectation of data freshness (write friendly) vs query performance (read/query friendliness) users may have, and supports three different def~query-types that provide real-time snapshots, incremental streams or purely columnar data that slightly older. At each step, Hudi strives to be self-managing (e.g: autotunes the writer parallelism, maintains file sizes) and self-healing (e.g: auto rollbacks failed commits), even if it comes at cost of slightly additional runtime cost (e.

...

Concepts

...

g: caching input data in memory to profile the workload). The core premise here, is that, often times operational costs of these large data pipelines without such operational levers/self-managing features built-in, dwarf the extra memory/runtime costs associated.

Hudi also has an append-only, cloud data storage friendly design, that lets Hudi manage data on across all the major cloud providers seamlessly.

Timeline

Excerpt Include
def~timeline
def~timeline
nopaneltrue

Storage/Writing

The implementation specifics of the two def~table-types are detailed below.

Excerpt Include
def~table-type
def~table-type
nopaneltrue

Copy On Write 

def~copy-on-write

...