Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Div
classhome-banner

 RFC - <16> :

<Abstract

<Abstraction for HoodieInputFormat and RecordReader>

Table of Contents

Proposers

Approvers

Status

Current state


Current State

Status
titleUnder Discussion

(tick)

Status
colourYellow
titleIn Progress


Status
colourRed
titleABANDONED


Status
colourGreen
titleCompleted


Status
colourBlue
titleINactive


...

JIRA: Hudi-69 Hudi-822

Released: <Hudi Version>

Abstract

Currently, reading Hudi Merge on Read table is depending on MapredParquetInputFormat from Hive and RecordReader<NullWritable, ArrayWritable> from package org.apache.hadoop.mapred, which is the first generation MapReduce API. 

...

So, I am proposing to support for both generations' API and abstract the Hudi record merging logic. Then we will have flexibility when adding support for other query engines. 

Background

Problems to solve:

  • Decouple Hudi related logic from existing HoodieParquetInputFormat, HoodieRealtimeInputFormat, HoodieRealtimeRecordReader, e.t.c
  • Create new classes to use org.apache.hadoop.mapreduce APIs and warp Hudi related logic into it. 
  • Warp the FileInputFormat from the query engine to take advantage of the optimization. As Spark SQL for example, we can create a HoodieParquetFileFormat by wrapping ParquetFileFormat and ParquetRecordReader<Row> from Spark codebase with Hudi merging logic. And extend the support for OrcFileFormat in the future.


Implementation

WIP

Rollout/Adoption Plan

  • No impact on the existing users because the existing Hive related InputFormat won't be changed, except some methods was relocated to HoodieInputFormatUtils class. Will test this won't impact the Hive query.
  • New Spark Datasource support for Merge on Read table will be added

Test Plan

  • Unit tests
  • Integration tests
  • Test on the cluster for a larger dataset. 

...