Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Update/Delete Records : Hudi provides support for updating/deleting records, using fine grained file/record level indexes, while providing transactional guarantees for the write operation. Queries process  the last such committed snapshot, to produce results.
  • Change Streams : Hudi also provides first-class support for obtaining an incremental stream of all the records that were updated/inserted/deleted in a given dataset, from a given point-in-time, and unlocks a new def~incremental-query category.


Image RemovedImage Added
These primitives work closely hand-in-glove and unlock stream/incremental processing capabilities directly on top of def~DFS-abstractions. If you are familiar def~stream-processing, this is very similar to consuming events from a def~kafka-topic and then using a def~state-stores to accumulate intermediate results incrementally.

...

  • Increased Efficiency : Ingesting data often needs to deal with updates (resulting from def~database-change-capture), deletions (due to def~data-privacy-regulations) and enforcing def~unique-key-constraints (to ensure def~data-quality of event streams/analytics). However, due to lack of standardized support for such functionality using a system like Hudi, data engineers often resort to big batch jobs that reprocess entire day's events or reload the entire upstream database every run, leading to massive waste of def~computational-resources. Since Hudi supports record level updates, it brings an order of magnitude improvement to these operations, by only reprocessing changes records and rewriting only the part of the def~table, that was updated/deleted, as opposed to rewriting entire def~datasetdef~table-partitions or even the entire def~table.
  • Faster ETL/Derived Pipelines : An ubiquitous next step, once the data has been ingested from external sources is to build derived data pipelines using Apache Spark/Apache Hive or any other data processing framework to  def~ETL the ingested data for a variety of use-cases like def~data-warehousing, def~machine-learning-feature-extraction, or even just def~analytics. Typically, such processes again rely on def~batch-processing jobs expressed in code or SQL, that process all input data in bulk and recompute all the output results. Such data pipelines can be sped up dramatically, by querying one or more input tables using an def~incremental-query instead of a regular def~snapshot-query, resulting once again in only processing the incremental changes from upstream tables and then def~upsert or delete the target derived table, like above.
  • Access to fresh data :  It's not everyday, that you will find that reduced resource usage also result in improved performance, since typically we add more resources (e.g memory) to improve performance metric (e.g query latency) . By fundamentally shifting away from how datasets have been traditionally managed for may be the first time since the dawn of the big data era, Hudi, in fact, realizes this rare combination. A sweet side-effect of incrementalizing def~batch-processing is that the pipelines also much much smaller amount of time to run, putting data into hands of organizations much much quickly, than it has been possible with def~data-lakes before.
  • Unified Storage : Building upon all the three benefits above, faster and lighter processing right on top of existing def~data-lakes mean lesser need for specialized storage or  def~data-marts, simply for purposes of gaining access to near real-time data.

...

  1. Set of  def~data-files that actually contain the records that were written to the dataset.
  2. An def~index (which could be implemented in many ways), that maps a given record to a subset of the data-files that contains the record.
  3. Ordered sequence of def~timeline-metadata about all the write operations done on the dataset, akin to a database transaction log.

Image RemovedImage Added


Hudi provides the following capabilities for writers, queries and on the underlying data, which makes it a great building block for large def~data-lakes.

  • upsert() support with fast, pluggable indexing
  • Atomically publish data with rollback support
  • Snapshot isolation between writer & queries 
  • Savepoints for data recovery
  • Manages file sizes, layout using statistics
  • Async compaction of row & columnar dataSelf managed def~compaction of updates/deltas against existing records.
  • Timeline metadata to track lineage
  • GDPR, Data deletions, Compliance.

...

The implementation specifics of the two def~dataset def~table-types are detailed below.


Copy On Write 

...