...

With an understanding of key technical motivations for the projects, let's now dive deeper into design of the system itself. At a high level, components for writing Hudi datasets are embedded into an Apache Spark job using one of the supported ways and it produces a set of files on def~backing-dfs-storage, that represents a Hudi def~dataset. Query engines like Apache Spark, Presto, Apache Hive can then query the dataset, with certain guarantees (that will discuss below).

There are three main components to a def~dataset

Set of def~data-files that actually contain the records that were written to the dataset.
An def~index (which could be implemented in many ways), that maps a given record to a subset of the data-files that contains the record.
Ordered sequence of def~timeline-metadata about all the write operations done on the dataset, akin to a database transaction log.

Additionally,

Hudi provides the following capabilities for writers, queries and on the underlying data, which makes it a great building block for large def~data-lakes.

upsert() support with fast, pluggable indexing
Atomically publish data with rollback support
Snapshot isolation between writer & queries
Savepoints for data recovery
Manages file sizes, layout using statistics
Async compaction of row & columnar data
Timeline metadata to track lineageUnified, Optimized analytical storage
GDPR, Data deletions, Compliance.
Building block for great data lakes!
Optimized storage/sizing

Concepts

Timeline

Storage/Writing

...

Space shortcuts

Page tree

Versions Compared

Old Version 35

New Version 36

Key

Concepts

Timeline

Storage/Writing

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 35

New Version 36

Key

Concepts

Timeline

Storage/Writing