In this page hierarchy, we explain the concepts, design and the overall architectural underpinnings of Apache Hudi. This content is intended to be the technical documentation of the project and will be kept up-to date with

Introduction

Apache Hudi (Hudi for short, here on) allows you to store vast amounts of data, on top existing def:hadoop compatible storage, while providing two primitives, that enable def:stream processing on def:data lakes, in addition to typical def:batch processing.

Specifically,

Update/Delete Records : Hudi provides support for updating/deleting records, using fine grained file/record level indexes, while providing transactional guarantees for the write operation.
Change Streams : Hudi also provides first-class support for obtaining an incremental stream of change records i.e all the records that were updated/inserted/deleted in a given dataset.

Unlocking such stream/incremental processing capabilities on these def:DFS abstractions, has several advantages.

Near real-time data ingestion to Cloud storage/DFS
Batch jobs on Steroids
Stream processing on batch data
Unified, Optimized analytical storage
GDPR, Data deletions, Compliance.
Building block for great data lakes!

System Overview

<WIP>

Concepts

...

The implementation specifics of the two storage types are detailed below.

Copy-On-Write

...

Excerpt Include
Copy-On-Write (COW)
Copy-On-Write (COW)
nopanel true

Merge-On-Read

...

Excerpt Include
Merge-On-Read (MOR)
Merge-On-Read (MOR)
nopanel true

Hudi writing is implemented as a Spark library, which makes it easy to integrate into existing data pipelines or ingestion libraries (which we will refer to as `Hudi clients`). Hudi Clients prepare an `RDD[HoodieRecord]` that contains the data to be upserted and Hudi upsert/insert is merely a Spark DAG, that can be broken into two big pieces.
...

Space shortcuts

Page tree

Versions Compared

Old Version 27

New Version 28

Key

Introduction

System Overview

Concepts

Copy-On-Write

Excerpt Include
Copy-On-Write (COW)
Copy-On-Write (COW)
nopanel true

Merge-On-Read

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 27

New Version 28

Key

Introduction

System Overview

Concepts

Copy-On-Write

Excerpt IncludeCopy-On-Write (COW)Copy-On-Write (COW)nopaneltrue

Merge-On-Read

Excerpt Include
Copy-On-Write (COW)
Copy-On-Write (COW)
nopanel true