Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


 RFC-1 : CSV Source Support for Delta Streamer

Table of Contents


  • @rahuledavalath 



Current state: "Under Discussion"


Prior doc link :


Hudi delta Streamer does not have direct support for pulling data in csv format from kafa/HDFS-logs. The only possible alternative  to ingesting Csv data to hudi dataset is to first convert them into json/avro before pulling in through delta-streamer. This HIP proposes a mechanism to directly support sources in csv format.  


Introduce any much background context which is relevant or necessary to understand the feature and design choices.


Extend the DeltaStreamer by implementing a CSV Source(kafka/hdfs)

  • We can use existing  FilebasedSchemaProvider class for decoding csv data.
  • If the Csv data does not  contains header then, for field names in source  schema we can use _c0,_c1,_c2..etc as the field names according to the record position in the csv data. (For kakfa data & non header csv files we can use this method).
  • If header is present in the csv data then  need to use header information for the field names.
  • Need to introduce a configuration property to inform whether header is present  or not.
  • Need to introduce a new configuration property to support any configurable delimiters.

Rollout/Adoption Plan

  • There won’t be any impact for existing users. This is just a new feature

Test Plan

            Unit and Manual Integration test
