Proposers
Approvers
Status
Abstract
Background
- Batch import and export
- Batch insert
Implementation
- Implementation
  - How batchworks

Concurrrency

Performance Evaluation
Rollout/Adoption Plan
Test Plan

Proposers

caoming

Approvers

@<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
@<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]

Status

Current state:

	Current State
UNDER DISCUSSION
IN PROGRESS
ABANDONED
COMPLETED
INACTIVE

Discussion thread: here

JIRA: here

Released: <Hudi Version>

Abstract

Hudi support the single insertion statement.However if a large amount of data is written to Hudi at one time.Batch operation is required.The most common used method are iimport and export.We can import and export the batch data according the Hudi rather than HDFS.In the nature analysis data batch operation method is based on hadoop mapreduce method.Finally,the data is imported into HDFS.

Background

Hudi batch operation is divided into 2 main components:

Parallel data insertion in the form of data stream.
According to the table information,the original data is directly transformed into parqust file ,the data is copied to the corresponding location of HDFS and at last the data is managed.

However the batch operation in the hudi maybe produce the high latency of listing and reading object.The latency make patterns like load a stream as thousands of small object. As everyone knows,small file storage are a problem in HDFS.The performance impact is worse because of sophisticated metadata under this scene.

Implementation

Write Implementation

How batch operation works

We use concurrrency to achive the batch operation.

We use concurrrency to achive the batch operation.Main class hudiimport is used to import the rdd dataset to hudi.what the proceduce is import?
(1)We apply the fileparse class to parse the file which we need to import such as txt,csv,tsv.At the same time we define the badfileparseexception to handle the bad file when the file have the problem.
(2)We apply the lineparse class to parse the command line.At the same time we define the badparselineexception to the bad command line.
(3)We apply the multithreading to run fileparse class to batch processing to accelerate the import process.

Hudi provide two standard command lines which hudiimport and hudiexport to realize data batch operation.

Import data example

Hudiimport -h master:1990 -l lake -t table hdfs://master:8020/input -i source.txt

Among -h specifies the host and port number to connect.-l refer to the data lake name.-t refer to the table name.-i refer to the source file.source.txt can import to hudi.

Export data example

Hudiimport -h master:1990 -l lake -t table -e target.txt

Limitation and Solution

Independent Operation

Hudi provide the serializable operation.However they have the own operation log.Sharing the operaion log across multiple tables would remove the limitation.

Low latency

Hudi is limited by the latency of the underlying object storage.It is difficult to achive millisecond streaming latency using batch ,hudi run the parallel jobs.

Performance Evaluation

Todo: performance comparison

Single insertion and average value of single insert performance comparision
importand export performance comparison

Rollout/Adoption Plan

No impact on existing users because add a new batchmethod
New configurations will be added to the documentationto optimizate.

Test Plan

Similar to batch test

Unit tests

Integration testing

Black box testing

Performance testing

Stress testing

TPC-DS testing

Space shortcuts

Page tree

Proposers

Approvers

Status

Abstract

Background

Implementation

Write Implementation

How batch operation works

Hudi provide two standard command lines which hudiimport and hudiexport to realize data batch operation.

Limitation and Solution

Independent Operation

Low latency

Performance Evaluation

Rollout/Adoption Plan

Test Plan

1 Comment

Ethan Guo

Space shortcuts

Page tree

RFC - 30: Batch operation

Proposers

Approvers

Status

Abstract

Background

Implementation

Write Implementation

How batch operation works

Hudi provide two standard command lines which hudiimport and hudiexport to realize data batch operation.

Limitation and Solution

Independent Operation

Low latency

Performance Evaluation

Rollout/Adoption Plan

Test Plan

1 Comment

Ethan Guo