Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Update/Delete Records : Hudi provides support for updating/deleting records, using fine grained file/record level indexes, while providing transactional guarantees for the write operation. Queries process  the last such committed snapshot, to produce results.
  • Change Streams : Hudi also provides first-class support for obtaining an incremental stream of all the records that were updated/inserted/deleted in a given datasettable, from a given point-in-time, and unlocks a new def~incremental-query category.

...

With an understanding of key technical motivations for the projects, let's now dive deeper into design of the system itself. At a high level,  components for writing Hudi datasets tables are embedded into an Apache Spark job using one of the supported ways and it produces a set of files on def~backing-dfs-storage, that represents a Hudi def~table. Query engines like Apache Spark, Presto, Apache Hive can then query the datasettable, with certain guarantees (that will discuss below).

...

  1. Set of  def~data-files that actually contain the records that were written to the datasettable.
  2. An def~index (which could be implemented in many ways), that maps a given record to a subset of the data-files that contains the record.
  3. Ordered sequence of def~timeline-metadata about all the write operations done on the datasettable, akin to a database transaction log.

...

Hudi currently provides two choices for indexes : `BloomIndex` and `HBaseIndex` to map a record key into the file id to which it belongs to. This enables us to speed up upserts significantly, without scanning over every record in the datasettable. Hudi Indices can be classified based on their ability to lookup records across partition. A `global` index does not need partition information for finding the file-id for a record key but a `non-global` does.

...