Page History

...

Not about "batch" or "stream" - this (only) defines how data arrives first time time (in the `data lake`) but not about how analysis code uses data programmatically
A more appropriate mindset might `sliding window` from `Apache Beam`
Not about "pipelines" - this implies a unidirectional, one-pass over data
Most complex algorithms / analyses / ML machine learning applications in a production/commercial setting are multi-pass, iterative and must run "online".
This means that each component of the analysis produces data that others consume.

...

"Our data is deeply nested and cross linked"
https://youtu.be/jvt4v2LTGK0?t=455Hint: this is where Hudi comes into the picture by allowing data to be kept in files, not just input data but also output

Tip

title	Hudi keeps data in files

This is where Hudi comes into the picture by allowing data to be kept in files, not just input data but also output data.

So, when we see architectures that use "streaming pipelines" that read from files and write to databases we can tell that those architectures are not useful for this vision of "continuous deep analytics".

...

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version 2

Key