Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

                                  Concurrrency

Proposers

Approvers

  • @<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
  • @<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]

Status

Current state


Current State

UNDER DISCUSSION


IN PROGRESS


ABANDONED

(tick)

COMPLETED


INACTIVE


Discussion threadhere

JIRA: here

Released: <Hudi Version>

Abstract

Hudi support the single insertion statement.However if a large amount of data is written to Hudi at one time.Batch operation is required.The most common used method are iimport and export.We can import and export the batch data according the Hudi rather than HDFS.In the nature analysis data batch operation method is based on hadoop mapreduce method.Finally,the data is imported into HDFS.

...

Implementation

Write Implementation

How

...

batch operation works

We use concurrrency to achive the batch operation.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 We use concurrrency to achive the batch operation.Main class hudiimport is used to import the rdd dataset to hudi.what the proceduce is import?
(1)We apply the fileparse class to parse the file which we need to import such as txt,csv,tsv.At the same time we define the badfileparseexception to handle the bad file when the file have the problem.
(2)We apply the lineparse class to parse the command line.At the same time we define the badparselineexception to the bad command line.
(3)We apply the multithreading to run fileparse class to batch processing to accelerate the import process.

Hudi provide two standard command lines which hudiimport and hudiexport to realize data batch operation.

...

Hudiimport -h master:1990 -l lake -t table -e target.txt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Limitation and Solution

Independent Operation 

  Hudi provide the serializable operation.However they have the own operation log.Sharing the operaion log across multiple tables would remove the limitation.

Low latency

   Hudi is limited by the latency of the underlying object storage.It is difficult to achive millisecond streaming latency using batch ,hudi run the parallel jobs.

...

 

 

 

 

 

...

 

Performance Evaluation

Todo: performance comparison

...