Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

(UC) Integrate Hudi with Apache Beam so that the sliding data window abstractions of beam can run on top of Parquet files incrementally updated through `Hudi`

Image Added

source: https://qcon.ai/system/files/presentation-slides/simplifying_ml_workflows_with_apache_beam.pdf

Hypothesis The `sliding data window` abstraction from Apache Beam (also present in Spark and Flink) can eliminate most (all (question)) of the ad-hoc attempts to handle incremental data inside analysis code.

(UC) Use Hudi to build file based data lakes that are self updating with arrival of new data

`Hudi` works with one `data set` at a time but when building a `data lake` we need to relate `data set`s logically.

The first kind is relational data but we also need graph, array and other forms of relations in data.