Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this page hierarchy, we explain the concepts, design and the overall architectural underpinnings of Apache Hudi. This content is intended to be the technical documentation of the project and will be kept up-to date with 

Introduction

Apache Hudi (Hudi for short, here on) allows you to store vast amounts of data, on top existing def:hadoop compatible storage, while providing two primitives, that enable def:stream processing on def:data lakes, in addition to typical def:batch processing.

Specifically,

  • Update/Delete Records : Hudi provides support for updating/deleting records, using fine grained file/record level indexes, while providing transactional guarantees for the write operation. 
  • Change Streams : Hudi also provides first-class support for obtaining an incremental stream of change records i.e all the records that were updated/inserted/deleted in a given dataset. 

Unlocking such stream/incremental processing capabilities on these def:DFS abstractions, has several advantages.

  • Near real-time data ingestion to Cloud storage/DFS
  • Batch jobs on Steroids
  • Stream processing on batch data
  • Unified, Optimized analytical storage
  • GDPR, Data deletions, Compliance.
  • Building block for great data lakes!


System Overview


<WIP>




Concepts

...

The implementation specifics of the two storage types are detailed below.

Copy-On-Write

...

Excerpt Include
Copy-On-Write (COW)
Copy-On-Write (COW)
nopaneltrue





Merge-On-Read

...

Excerpt Include
Merge-On-Read (MOR)
Merge-On-Read (MOR)
nopaneltrue



Hudi writing is implemented as a Spark library, which makes it easy to integrate into existing data pipelines or ingestion libraries (which we will refer to as `Hudi clients`). Hudi Clients prepare an `RDD[HoodieRecord]` that contains the data to be upserted and Hudi upsert/insert is merely a Spark DAG, that can be broken into two big pieces.

...