Proposers

liwei

Approvers

Status

Current state:

	Current State
UNDER DISCUSSION
IN PROGRESS
ABANDONED
COMPLETED
INACTIVE

Discussion thread:

JIRA: HUDI-897

Released: <Hudi Version>

Abstract

The business scenarios of the data lake mainly include analysis of databases, logs, and files.

Databricks delta lake also aim at these three scenario. [1]

Background

At present, hudi can better support the scenario where the database cdc is incrementally written to hudi, and it is also doing bulkload files to hudi.

However, there is no good native support for log scenarios (requiring high-throughput writes, no updates, deletions, and focusing on small file scenarios);now can write through inserts without deduplication, but they will still merge on the write side.

In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but every batch small will cost some time for merge,it will reduce write throughput.
This scene is not suitable for merge on read.
the actual scenario only needs to write parquet in batches when writing, and then provide reverse compaction (similar to delta lake )

Implementation

1. On the write side, just write every batch to parquet file base on the snapshot mechanism,default open the merge,use can close the auto merge for more write throughput.

2. hudi support asynchronous merge small parquet files like databricks delta lake's OPTIMIZE command [2]

[1] https://databricks.com/product/delta-lake-on-databricks

[2] https://docs.databricks.com/delta/optimizations/file-mgmt.html

Rollout/Adoption Plan

No impact on the existing users because add new function

Test Plan

Unit tests
Integration tests
Test on the cluster for a larger dataset.

Space shortcuts

Page tree

Proposers

Approvers

Status

Abstract

Background

Implementation

Rollout/Adoption Plan

Test Plan

Space shortcuts

Page tree

RFC - 19 hudi support log append scenario with better write and asynchronous compaction

Proposers

Approvers

Status

Abstract

Background

Implementation

Rollout/Adoption Plan

Test Plan