Div | ||
---|---|---|
| ||
RFC - <16> :<Abstract<Abstraction for HoodieInputFormat and RecordReader> |
Table of Contents |
---|
Proposers
Approvers
- Vinoth Chandar : [ APPROVED/REQUESTED_INFO/REJECTED]
- lamber-ken : [APPROVED/REQUESTED_INFO/REJECTED] APPROVED
- Bhavani Sudha APPROVED
- ...
Status
Current state:
Current State | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
| |||||||||
| |||||||||
| |||||||||
|
...
Released: <Hudi Version>
Abstract
Currently, reading Hudi Merge on Read table is depending on MapredParquetInputFormat from Hive and RecordReader<NullWritable, ArrayWritable> from package org.apache.hadoop.mapred, which is the first generation MapReduce API.
...
So, I am proposing to support for both generations' API and abstract the Hudi record merging logic. Then we will have flexibility when adding support for other query engines.
Background
Problems to solve:
- Decouple Hudi related logic from existing HoodieParquetInputFormat, HoodieRealtimeInputFormat, HoodieRealtimeRecordReader, e.t.c
- Create new classes to use org.apache.hadoop.mapreduce APIs and warp Hudi related logic into it.
- Warp the FileInputFormat from the query engine to take advantage of the optimization. As Spark SQL for example, we can create a HoodieParquetFileFormat by wrapping ParquetFileFormat and ParquetRecordReader<Row> from Spark codebase with Hudi merging logic. And extend the support for OrcFileFormat in the future.
Implementation
https://github.com/apache/incubator-hudi/pull/1592
WIP
Rollout/Adoption Plan
- No impact on the existing users because the existing Hive related InputFormat won't be changed, except some methods was relocated to HoodieInputFormatUtils class. Will test this won't impact the Hive query.
- New Spark Datasource support for Merge on Read table will be added
Test Plan
- Unit tests
- Integration tests
- Test on the cluster for a larger dataset.
...