Div | ||
---|---|---|
| ||
RFC - <16> :<Abstract<Abstraction for HoodieInputFormat and RecordReader> |
Table of Contents |
---|
Proposers
- @Yanjia Gary Li
Approvers
- Vinoth Chandar APPROVED
- lamber-ken APPROVED
- Bhavani Sudha APPROVED
- @<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
- @<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
- ...
Status
Current state:
Current State | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
| |||||||||
| |||||||||
| |||||||||
|
...
Released: <Hudi Version>
Abstract
Currently, reading Hudi Merge on Read table is depending on MapredParquetInputFormat from Hive and RecordReader<NullWritable, ArrayWritable> from package org.apache.hadoop.mapred, which is the first generation MapReduce API.
...
So, I am proposing to support for both generations' API and abstract the Hudi record merging logic. Then we will have flexibility when adding support for other query engines.
Background
Problems to solve:
- Decouple Hudi related logic from existing HoodieParquetInputFormat, HoodieRealtimeInputFormat, HoodieRealtimeRecordReader, e.t.c
- Create new classes to use org.apache.hadoop.mapreduce APIs and warp Hudi related logic into it.
- Warp the FileInputFormat from the query engine to take advantage of the optimization. As Spark SQL for example, we can create a HoodieParquetFileFormat by wrapping ParquetFileFormat and ParquetRecordReader<Row> from Spark codebase with Hudi merging logic. And extend the support for OrcFileFormat in the future.
Implementation
https://github.com/apache/incubator-hudi/pull/1592
<Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.>
Rollout/Adoption Plan
- <What impact (if any) will there be on existing users?>
- <If we are changing behavior how will we phase out the older behavior?>
- <If we need special migration tools, describe them here.>
- <When will we remove the existing behavior?>
Test Plan
...
- No impact on the existing users because the existing Hive related InputFormat won't be changed, except some methods was relocated to HoodieInputFormatUtils class. Will test this won't impact the Hive query.
- New Spark Datasource support for Merge on Read table will be added
Test Plan
- Unit tests
- Integration tests
- Test on the cluster for a larger dataset.