Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents

Proposers

Approvers

  • @<approver1 JIRA username> Vinoth Chandar  [APPROVED/REQUESTED_INFO/REJECTED]@<approver2 JIRA
  • username> lamber-ken  [APPROVED/REQUESTED_INFO/REJECTED]
  • ...

...

  • Decouple Hudi related logic from existing HoodieParquetInputFormat, HoodieRealtimeInputFormat, HoodieRealtimeRecordReader, e.t.c
  • Create new classes to use org.apache.hadoop.mapreduce APIs and warp Hudi related logic into it. 
  • Warp the FileInputFormat from the query engine to take advantage of the optimization. As Spark SQL for example, we can create a HoodieParquetFileFormat by wrapping ParquetFileFormat and ParquetRecordReader<Row> from Spark codebase with Hudi merging logic. And extend the support for OrcFileFormat in the future.

Image Added


Implementation

<Describe the new thing you want to do in appropriate detail, how it fits into the project architecture. Provide a detailed description of how you intend to implement this feature.This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.>WIP

Rollout/Adoption Plan

  • <What impact (if any) will there be on existing users?>
  • <If we are changing behavior how will we phase out the older behavior?>
  • <If we need special migration tools, describe them here.>
  • <When will we remove the existing behavior?>

Test Plan

...

  • No impact on the existing users because the existing Hive related InputFormat won't be changed, except some methods was relocated to HoodieInputFormatUtils class. Will test this won't impact the Hive query.
  • New Spark Datasource support for Merge on Read table will be added

Test Plan

  • Unit tests
  • Integration tests
  • Test on the cluster for a larger dataset.