Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
maxLevel4
minLevel3

Proposer

...

  • @rahuledavalath 

Approver

  • <approver1 JIRA username> Vinoth Chandar  [APPROVED/REQUESTED_INFO/REJECTED]<approver2 JIRA username> 
  • Balaji Varadarajan [APPROVED/REQUESTED_INFO/REJECTED]...

Status

Current state: [One of "Under Discussion", "Accepted", "Rejected"]

Discussion thread: here

JIRA: here

Released: <Hudi Version>

Abstract

<Describe the problem you are trying to solve and a brief description of why it’s needed>

Background

N/A

Prior doc link : https://docs.google.com/document/d/1bj-xpkRomVtbzvLb_4BRngDIGkkMR5yzxXRRzkA7QVo/edit#heading=h.di66rda5xhp2

Abstract

Hudi delta Streamer does not have direct support for pulling data in csv format from kafa/HDFS-logs. The only possible alternative  to ingesting Csv data to hudi dataset is to first convert them into json/avro before pulling in through delta-streamer. This HIP proposes a mechanism to directly support sources in csv format.  

Background

Introduce <Introduce any much background context which is relevant or necessary to understand the feature and design choices.>

Implementation

...

Extend the DeltaStreamer by implementing a CSV Source(kafka/hdfs)

  • We can use existing  FilebasedSchemaProvider class for decoding csv data.
  • If the Csv data does not  contains header then, for field names in source  schema we can use _c0,_c1,_c2..etc as the field names according to the record position in the csv data. (For kakfa data & non header csv files we can use this method).
  • If header is present in the csv data then  need to use header information for the field names.
  • Need to introduce a configuration property to inform whether header is present  or not.
  • Need to introduce a new configuration property to support any configurable delimiters.

Rollout/Adoption Plan

  • What impact (if any) will there be on existing users?
  • If we are changing behavior how will we phase out the older behavior?
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan

  • There won’t be any impact for existing users. This is just a new feature

Test Plan

            Unit and Manual Integration test<Describe in few sentences how the HIP will be tested. How will we know that the implementation works as expected? How will we know nothing broke?>