Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Not about "batch" or "stream" - this (only) defines how data arrives first time time (in the `data lake`)  but not about how analysis code uses data programmatically
  • A more appropriate mindset might `sliding window`  from `Apache Beam`
  • Not about "pipelines" - this implies a unidirectional, one-pass over data
  • Most complex algorithms / analyses / ML machine learning applications in a production/commercial setting are multi-pass, iterative and must run "online".
  • This means that each component of the analysis produces data that others consume.

...

"Our data is deeply nested and cross linked"
https://youtu.be/jvt4v2LTGK0?t=455Hint: this is where Hudi comes into the picture by allowing data to be kept in files, not just input data but also output


Tip
titleHudi keeps data in files
This is where Hudi comes into the picture by allowing data to be kept in files, not just input data but also output data.


So, when we see architectures that use "streaming pipelines" that read from files and write to databases we can tell that those architectures are not useful for this vision of "continuous deep analytics".

...