Hudi for Continuous Deep Analytics

Purpose

To further the initial vision from https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop through use cases, patterns, applications, code, etc.

General context

While "analytics" covers a broad area of interest here we distinguish between `experimental` and `production` and we focus on `production`.

`experimental` =
1. exploratory data analysis
2. development in notebooks
3. essentially ad-hoc choice of tools
4. generally batch only, "one off" manual execution
5. small data, manual sampling
6. models are trained offline
7. the end result being reports, diagrams, etc,
`production` = pretty the opposite
1. end result is data science applications
2. ran in production
3. with large `data set`s that do not fit in RAM
4. hence they algorithms / analysis must be incremental
5. use of managed `data set`s : `data lake`s, `feature store`s
6. models are trained online
7. with awareness of `concept drift`, `distribution drift`, `adversarial attacks` and adaptation
8. complex orchestration between core analysis and decision layer, model monitoring and other application logic and business processes, some involving human interactions

Of course the boundary is not exact but the challenges are very different and the mindset and means to build solutions for `production` are very different than for `experimental,

A visual metaphor for the transition from `experimental` to `production` may be this:

source https://github.com/productml/blurr

Why "continuous"

Not about "batch" or "stream" - this (only) defines how data arrives first time
Not about "pipelines" - this implies a unidirectional, one-pass over data
Most complex algorithms / analyses / ML applications in a production/commercial setting are multi-pass, iterative and must run "online".
This means that each component of the analysis produces data that others consume.

Putting this data in kafka, separate databases, etc breaks not just the logical cohesion of the analysis but also leaves no room for optimization.
See the premise of Weld : https://www.weld.rs/assets/weld-strata.pdf

We need "continuous" in both the time dimension and in the physical dimension.

Why "deep"

In terms of algos we have deep learning, ofc.
But also in terms of "data fabric" we need to handle multi-dimensional, heterogeneous, business rich meaning data and abstractions over data.
- relational data - need #RelationalAlgebra
- arrays / matrices / tensors - need #LinearAlgebra
- graph data - need graph representations and algos (https://www.slideshare.net/oracle4engineer/using-graphs-for-data-analysis)

Since the modern data analytics started, arguably with Spark, these new technologies have essentially ripped apart the "database".

See "Bridging the Gap: Towards Optimization Across Linear and Relational Algebra"
https://h2020-streamline-project.eu/wp-content/uploads/2017/11/lara.pdf

Also, #BigData analytics seem to choose "files over databases".
https://youtu.be/jvt4v2LTGK0?t=345

"Our data is deeply nested and cross linked"
https://youtu.be/jvt4v2LTGK0?t=455

Hint: this is where Hudi comes into the picture by allowing data to be kept in files, not just input data but also output data.

So, when we see architectures that use "streaming pipelines" that read from files and write to databases we can tell that those architectures are not useful for this vision of "continuous deep analytics".

The initial Uber vision that lead to Hudi was published in https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop.
Seems to me that the ability to keep data in the basic files is key to building the above vision.

In Uber #Michelangelo https://eng.uber.com/michelangelo/ there is still a distinction in the architecture, implementation and programming model between "batch" and "streaming" and the data is placed in distinct kinds of "feature stores" between batch and continuous analyses.

The thread of discussion that Vinoth and I have is about proving the initial vision with code, to see how far we can chew at this issue.

Space shortcuts

Page tree

Purpose

General context

Why "continuous"

Why "deep"