Div | ||
---|---|---|
| ||
HIPRFC-1 : CSV Source Support for Delta Streamer |
Table of Contents | ||||
---|---|---|---|---|
|
Proposer
- @rahuledavalath
Approver
- Vinoth Chandar : [APPROVED/REQUESTED_INFO/REJECTED]
- Balaji Varadarajan : REQUESTED_INFO
Status
Current state: "Under Discussion"
...
Prior doc link : https://docs.google.com/document/d/1bj-xpkRomVtbzvLb_4BRngDIGkkMR5yzxXRRzkA7QVo/edit#heading=h.di66rda5xhp2
Abstract
Hudi delta Streamer does not have direct support for pulling data in csv format from kafa/HDFS-logs. The only possible alternative to ingesting Csv data to hudi dataset is to first convert them into json/avro before pulling in through delta-streamer. This HIP proposes a mechanism to directly support sources in csv format.
Background
Introduce any much background context which is relevant or necessary to understand the feature and design choices.
Implementation
Extend the DeltaStreamer by implementing a CSV Source(kafka/hdfs)
- We can use existing FilebasedSchemaProvider class for decoding csv data.
- If the Csv data does not contains header then, for field names in source schema we can use _c0,_c1,_c2..etc as the field names according to the record position in the csv data. (For kakfa data & non header csv files we can use this method).
- If header is present in the csv data then need to use header information for the field names.
- Need to introduce a configuration property to inform whether header is present or not.
- Need to introduce a new configuration property to support any configurable delimiters.
Rollout/Adoption Plan
There won’t be any impact for existing users. This is just a new feature
Test Plan
Unit and Manual Integration test
...