Table of Contents

Background & Motivation

In this page hierarchy, we explain the concepts, design and the overall architectural underpinnings of Apache Hudi. This content is intended to be the technical documentation of the project and will be kept up-to date with

Introduction

Image Added<WIP>

System Overview

<WIP>

Implementation

Image Added

Concepts

Timeline

Storage/Writing

The implementation specifics of the two storage types are detailed below.

Copy On Write (COW)

Excerpt Include

	Copy On Write (COW)
	Copy On Write (COW)
nopanel	true

Merge On Read (MOR)

Excerpt Include

	Merge On Read (MOR)
	Merge On Read (MOR)
nopanel	true

...

Hudi writing is implemented as a Spark library, which makes it easy to integrate into existing data pipelines or ingestion libraries (which we will refer to as `Hudi clients`). Hudi Clients prepare an `RDD[HoodieRecord]` that contains the data to be upserted and Hudi upsert/insert is merely a Spark DAG, that can be broken into two big pieces.

...

Indexing : A big part of Hudi's efficiency comes from indexing the mapping from record keys to the file ids, to which they belong to. This index also helps the `HoodieWriteClient` separate upserted records into inserts and updates, so they can be treated differently. `HoodieReadClient` supports operations such as `filterExists` (used for de-duplication of table) and an efficient batch `read(keys)` api, that can read out the records corresponding to the keys using the index much quickly, than a typical scan via a query. The index is also atomically updated each commit, and is also rolled back when commits are rolled back.
Storage : The storage part of the DAG is responsible for taking an `RDD[HoodieRecord]`, that has been tagged as an insert or update via index lookup, and writing it out efficiently onto storage.

...

File Layout

Indexing

Hudi currently provides two choices for indexes : `BloomIndex` and `HBaseIndex` to map a record key into the file id to which it belongs to. This enables us to speed up upserts significantly, without scanning over every record in the dataset. Hudi Indices can be classified based on their ability to lookup records across partition. A `global` index does not need partition information for finding the file-id for a record key but a `non-global` does.

...

This index is built by adding bloom filters with a very high false positive tolerance (e.g: 1/10^9), to the parquet file footers. The advantage of this index over HBase is the obvious removal of a big external dependency, and also nicer handling of rollbacks & partial updates since the index is part of the data file itself.

At runtime, checking the Bloom Index for a given set of record keys effectively amounts to checking all the bloom filters within a given partition, against the incoming records, using a Spark join. Much of the engineering effort towards the Bloom index has gone into scaling this join by caching the incoming RDD[HoodieRecord] and dynamically tuning join parallelism, to avoid hitting Spark limitations like 2GB maximum for partition size. As a result, Bloom Index implementation has been able to handle single upserts upto 5TB, in a reliable manner.

DAG with Range Pruning:

draw.io Diagram

border	true
viewerToolbar	true

fitWindow	false
diagramName	hoodie-bloom-index-dag
simpleViewer	false
width
diagramWidth	1003
revision	1

Storage

The implementation specifics of the two storage types are detailed below.

Copy On Write (COW)

...

Merge On Read (MOR)

Realtime Readers will perform in-situ merge of these delta log-files to provide the most recent (committed) view of the dataset. To keep the query-performance in check and eventually achieve read-optimized performance, Hudi supports
compacting these log-files asynchronously to create read-optimized views.
Asynchronous Compaction involves

Excerpt Include

Merge On Read (MOR)

	Merge On Read (MOR)
nopanel	true

Compactor

2

steps:

Compaction

...

Code Block

Main API in HoodieWriteClient for running Compaction:
/**
* Performs Compaction corresponding to instant-time
* @param compactionInstantTime Compaction Instant Time
* @return
* @throws IOException
*/
public JavaRDD<WriteStatus> compact(String compactionInstantTime) throws IOException;

To lookup all pending compactions, use the API defined in HoodieReadClient

/**
* Return all pending compactions with instant time for clients to decide what to compact next.
* @return
*/
public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
```
API for scheduling compaction

```

/**
* Schedules a new compaction instant
* @param extraMetadata
* @return Compaction Instant timestamp if a new compaction plan is scheduled
*/
Optional<String> scheduleCompaction(Optional<Map<String, String>> extraMetadata) throws IOException;

Refer to __hoodie-client/src/test/java/HoodieClientExample.java__ class for an example of how compaction is scheduled and executed.

...

File Sizing

Querying

...

Space shortcuts

Page tree

Versions Compared

Old Version 23

New Version 24

Key

Background & Motivation

Introduction

System Overview

Implementation

Concepts

Timeline

Storage/Writing

Copy On Write (COW)

Merge On Read (MOR)

File Layout

Indexing

Storage

Copy On Write (COW)

Merge On Read (MOR)

Compactor

Compaction

File Sizing

Querying

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 23

New Version 24

Key

Background & Motivation

Introduction

System Overview

Implementation

Concepts

Timeline

Storage/Writing

Copy On Write (COW)

Merge On Read (MOR)

File Layout

Indexing

Storage

Copy On Write (COW)

Merge On Read (MOR)

Compactor

Compaction

File Sizing

Querying