Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

With an understanding of key technical motivations for the projects, let's now dive deeper into design of the system itself. At a high level,  components for writing Hudi datasets are embedded into an Apache Spark job using one of the supported ways and it produces a set of files on def~backing-dfs-storage, that represents a Hudi def~dataset. Query engines like Apache Spark, Presto, Apache Hive can then query the dataset, with certain guarantees (that will discuss below).

There are three main components to a def~dataset

  1. Set of  def~data-files that actually contain the records that were written to the dataset.
  2. An def~index (which could be implemented in many ways), that maps a given record to a subset of the data-files that contains the record.
  3. Ordered sequence of def~timeline-metadata about all the write operations done on the dataset, akin to a database transaction log.

Additionally, 


Hudi provides the following capabilities for writers, queries and on the underlying data, which makes it a great building block for large def~data-lakes.

  • upsert() support with fast, pluggable indexing
  • Atomically publish data with rollback support
  • Snapshot isolation between writer & queries 
  • Savepoints for data recovery
  • Manages file sizes, layout using statistics
  • Async compaction of row & columnar data
  • Timeline metadata to track lineageUnified, Optimized analytical storage
  • GDPR, Data deletions, Compliance.
  • Building block for great data lakes!
  • Optimized storage/sizing

Concepts


Timeline



Storage/Writing

...