...
With an understanding of key technical motivations for the projects, let's now dive deeper into design of the system itself. At a high level, components for writing Hudi datasets are embedded into an Apache Spark job using one of the supported ways and it produces a set of files on def~backing-dfs-storage, that represents a Hudi def~dataset. Query engines like Apache Spark, Presto, Apache Hive can then query the dataset, with certain guarantees (that will discuss below).
There are three main components to a def~dataset
- Set of def~data-files that actually contain the records that were written to the dataset.
- An def~index (which could be implemented in many ways), that maps a given record to a subset of the data-files that contains the record.
- Ordered sequence of def~timeline-metadata about all the write operations done on the dataset, akin to a database transaction log.
Additionally,
Hudi provides the following capabilities for writers, queries and on the underlying data, which makes it a great building block for large def~data-lakes.
- upsert() support with fast, pluggable indexing
- Atomically publish data with rollback support
- Snapshot isolation between writer & queries
- Savepoints for data recovery
- Manages file sizes, layout using statistics
- Async compaction of row & columnar data
- Timeline metadata to track lineageUnified, Optimized analytical storage
- GDPR, Data deletions, Compliance.
- Building block for great data lakes!
- Optimized storage/sizing
Concepts
Timeline
Storage/Writing
...