Div

class	home-banner

RFC - <16> :

<Abstract

<Abstraction for HoodieInputFormat and RecordReader>

Table of Contents

Proposers

Gary Li

Approvers

Vinoth Chandar : [ APPROVED/REQUESTED_INFO/REJECTED]
lamber-ken : [APPROVED/REQUESTED_INFO/REJECTED] APPROVED
Bhavani Sudha APPROVED
...

Status

Current state:

Current State

Status

title	Under Discussion

Status

colour	Yellow
title	In Progress

Status


colour	Red
title	ABANDONED

Status

colour	Green
title	Completed

Status


colour	Blue
title	INactive

...

JIRA: Hudi-69 Hudi-822

Released: <Hudi Version>

Abstract

Currently, reading Hudi Merge on Read table is depending on MapredParquetInputFormat from Hive and RecordReader<NullWritable, ArrayWritable> from package org.apache.hadoop.mapred, which is the first generation MapReduce API.

...

So, I am proposing to support for both generations' API and abstract the Hudi record merging logic. Then we will have flexibility when adding support for other query engines.

Background

Problems to solve:

Decouple Hudi related logic from existing HoodieParquetInputFormat, HoodieRealtimeInputFormat, HoodieRealtimeRecordReader, e.t.c
Create new classes to use org.apache.hadoop.mapreduce APIs and warp Hudi related logic into it.
Warp the FileInputFormat from the query engine to take advantage of the optimization. As Spark SQL for example, we can create a HoodieParquetFileFormat by wrapping ParquetFileFormat and ParquetRecordReader<Row> from Spark codebase with Hudi merging logic. And extend the support for OrcFileFormat in the future.

Image RemovedImage Added

Implementation

https://github.com/apache/incubator-hudi/pull/1592

Image AddedWIP

Rollout/Adoption Plan

No impact on the existing users because the existing Hive related InputFormat won't be changed, except some methods was relocated to HoodieInputFormatUtils class. Will test this won't impact the Hive query.
New Spark Datasource support for Merge on Read table will be added

Test Plan

Unit tests
Integration tests
Test on the cluster for a larger dataset.

...

Space shortcuts

Page tree

Versions Compared

Old Version 4

New Version Current

Key

RFC - <16> :

<Abstraction for HoodieInputFormat and RecordReader>

Proposers

Approvers

Status

Abstract

Background

Problems to solve:

Implementation

Rollout/Adoption Plan

Test Plan

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

RFC - <16> :

<Abstraction for HoodieInputFormat and RecordReader>

Proposers

Approvers

Status

Abstract

Background

Problems to solve:

Implementation

Rollout/Adoption Plan

Test Plan