You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

 RFC - <16> : <Abstract HoodieInputFormat and RecordReader>

Proposers

  • @Yanjia Li

Approvers

  • @<approver1 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
  • @<approver2 JIRA username> : [APPROVED/REQUESTED_INFO/REJECTED]
  • ...

Status

Current state


Current State

UNDER DISCUSSION

(tick)

IN PROGRESS


ABANDONED


COMPLETED


INACTIVE


Discussion thread: here

JIRA: Hudi-69 Hudi-822

Released: <Hudi Version>

Abstract

Currently, reading Hudi Merge on Read table is depending on MapredParquetInputFormat from Hive and RecordReader<NullWritable, ArrayWritable> from package org.apache.hadoop.mapred, which is the first generation MapReduce API. 

This dependency makes it very difficult for other Query engines (e.g. Spark sql) to use the existing record merging code due to the incapability of the first generation MapReduce API.

Hive doesn't support for the second-generation MapReduce API at this point. cwiki-proposal. Hive-3752

For Spark datasource, the FileInputFormat and RecordReader are based on org.apache.hadoop.mapreduce, which is the second generation API. Based on the discussion online, mix usage of first and second generations API will lead to unexpected behavior.

So, I am proposing to support for both generations' API and abstract the Hudi record merging logic. Then we will have flexibility when adding support for other query engines. 

Background

Problems to solve:

  • Decouple Hudi related logic from existing HoodieParquetInputFormat, HoodieRealtimeInputFormat, HoodieRealtimeRecordReader, e.t.c
  • Create new classes to use org.apache.hadoop.mapreduce APIs and warp Hudi related logic into it. 
  • Warp the FileInputFormat from the query engine to take advantage of the optimization. As Spark SQL for example, we can create a HoodieParquetFileFormat by wrapping ParquetFileFormat and ParquetRecordReader<Row> from Spark codebase with Hudi merging logic. And extend the support for OrcFileFormat in the future.


Implementation

<Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.>

Rollout/Adoption Plan

  • <What impact (if any) will there be on existing users?>
  • <If we are changing behavior how will we phase out the older behavior?>
  • <If we need special migration tools, describe them here.>
  • <When will we remove the existing behavior?>

Test Plan

<Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>







  • No labels