Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Use `spark.read.format("csv")` as the reader of CSV files, using Spark's internal CSV parsing logic to convert it to Dataset of Rows
  • Define new Hudi configurations related to the CSV source to be on par with Spark CSV options, and pass these configs to the reader through `.option()`
  • The smallest unit of incremental pull will be one CSV file.  Assume the CSV files are named after timestamps which are monotonically increasing, the filename of the last ingested CSV can be taken as the last checkpoint.  A directory that holds all CSV files to be ingested will be given in the config.

Rollout/Adoption Plan

  • There won’t be any impact for existing users. This is just a new feature.

  • New configurations will be added to the documentation.

...