Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Hypothesis The `sliding data window` abstraction from Apache Beam (also present in Spark and Flink) can eliminate most (all (question)) of the ad-hoc attempts to handle incremental data inside analysis code.

(UC) Use Hudi to build file based data lakes that are self updating with arrival of new data (data fabric)

`Hudi` works with one `data set` at a time but when building a `data lake` we need to relate `data set`s logically.structurally and logically (business semantics) so that `feature store`s can be built from raw data

(UC) Use Hudi to build file based feature stores (data fabric)

The first kind is relational data but we also need graph, array and other forms of relations in data, ideally in an unified `data fabric`.

Resources about how Dremio relational cache works for inspiration on how `Hudi` might play in

Technologies on the radar

  1. Apache Arrow
  2. Dremio
  3. Project DAWM Weld