Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info
titledef: annotations

In an effort to keep this page crisp for reading, any concepts that we need to explain are marked annotated with a def: annotation and def~  and hyperlinked off. You can contribute immensely to our docs, by writing the missing pages for annotated terms. These are marked in purple. Please mention any PMC/Committers on these pages for review, and mark the links blue

Introduction

Apache Hudi (Hudi for short, here on) allows you to store vast amounts of data, on top existing def:hadoopexisting def~hadoop-compatible-storage, while providing two primitives, that enable def:stream def~stream-processing ondef:data def~data-lakes, in addition to typical def:batchtypical  def~batch-processing.

Specifically,

  • Update/Delete Records : Hudi provides support for updating/deleting records, using fine grained file/record level indexes, while providing transactional guarantees for the write operation. Queries process  the last such committed snapshot, to produce results.
  • Change Streams : Hudi also provides first-class support for obtaining an incremental stream of all the records that were updated/inserted/deleted in a given dataset, from a given point-in-time, and unlocks a new category of incremental queries.


Unlocking such Together these primitives unlock stream/incremental processing capabilities on these def:DFS abstractions, has several advantages.

...

directly on top of def~DFS-abstractions. If you are familiar def~stream-processing, this is very similar to consuming events from a def~kafka-topic and then using a def~state-stores to accumulate intermediate results incrementally.

It has several architectural advantages.

  • Increased Efficiency : Ingesting data often needs to deal with updates (resulting from def~database-change-capture), deletions (due to def~data-privacy-regulations) and enforcing def~unique-key-constraints (to ensure def~data-quality of event streams/analytics). However, due to lack of standardized support for such functionality using a system like Hudi, data engineers often resort to big batch jobs that reprocess entire day's events or reload the entire upstream database every run, leading to massive waste of def~computational-resources. Since Hudi supports record level updates, it brings an order of magnitude improvement to these operations, by only reprocessing changes records and rewriting only the part of the def~dataset, that was updated/deleted, as opposed to rewriting entire def~dataset-partitions or even the entire def~dataset.
  • Faster ETL/Derived Pipelines : An ubiquitous next step, once the data has been ingested from external sources is to build derived data pipelines using Apache Spark/Apache Hive or any other data processing framework to def~ETL the ingested data for a variety of use-cases like def~data-warehousing, def~machine-learning-feature-extraction, or even just def~analytics. Typically, such processes again rely on def~batch-processing jobs expressed in code or SQL, that process all input data in bulk and recompute all the output results. 
  • Access to fresh data : 
  • Unified Storage : 


System Overview


<WIP>


Image Added


Additionally, 

...

  • Unified, Optimized analytical storage
  • GDPR, Data deletions, Compliance.
  • Building block for great data lakes!

System Overview

<WIP>

...

  • Optimized storage/sizing

Concepts


Timeline



Storage/Writing

...