Hudi for Continuous Deep Analytics

Purpose

To further the initial vision from https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop through use cases, patterns, applications, code, etc.

General context

While "analytics" covers a broad area of interest here we distinguish between `experimental` and `production` and we focus on `production`.

`experimental` =
1. exploratory data analysis
2. development in notebooks
3. essentially ad-hoc choice of tools
4. generally batch only, "one off" manual execution
5. small data, manual sampling
6. models are trained offline
7. the end result being reports, diagrams, etc,
`production` = pretty the opposite
1. end result is data science applications
2. ran in production
3. with large `data set`s that do not fit in RAM
4. hence they algorithms / analysis must be incremental
5. use of managed `data set`s : `data lake`s, `feature store`s
6. models are trained online
7. with awareness of `concept drift`, `distribution drift`, `adversarial attacks` and adaptation
8. complex orchestration between core analysis and decision layer, model monitoring and other application logic and business processes, some involving human interactions

Of course the boundary is not exact but the challenges are very different and the mindset and means to build solutions for `production` are very different than for `experimental,

A visual metaphor for the transition from `experimental` to `production` may be this:

source https://github.com/productml/blurr

Why "continuous"

Not about "batch" or "stream" - this (only) defines how data arrives first time (in the `data lake`) but not about how analysis code uses data programmatically
A more appropriate mindset might `sliding window` from `Apache Beam`
Not about "pipelines" - this implies a unidirectional, one-pass over data
Most complex algorithms / analyses / machine learning applications in a production/commercial setting are multi-pass, iterative and must run "online".
This means that each component of the analysis produces data that others consume.

Putting this data in kafka, separate databases, etc breaks not just the logical cohesion of the analysis but also leaves no room for optimization.
See the premise of Weld : https://www.weld.rs/assets/weld-strata.pdf

We need "continuous" in both the time dimension and in the physical dimension.

Why "deep"

In terms of algos we have deep learning, ofc.
But also in terms of "data fabric" we need to handle multi-dimensional, heterogeneous, business rich meaning data and abstractions over data.
- relational data - need #RelationalAlgebra
- arrays / matrices / tensors - need #LinearAlgebra
- graph data - need graph representations and algos (https://www.slideshare.net/oracle4engineer/using-graphs-for-data-analysis)

Since the modern data analytics started, arguably with Spark, these new technologies have essentially ripped apart the "database".

See "Bridging the Gap: Towards Optimization Across Linear and Relational Algebra"
https://h2020-streamline-project.eu/wp-content/uploads/2017/11/lara.pdf

Also, #BigData analytics seem to choose "files over databases".

"Particle physics, 10,000 times faster" by Jim Pivarski
"Our data is deeply nested and cross linked"
https://youtu.be/jvt4v2LTGK0?t=455

Hudi keeps data in files

This is where `Hudi` comes into the picture by allowing data to be kept in files, not just input data but also output data.

*Hypothesis* : the ability to keep data in the basic files is key to building the above vision.

This page about Uber #Michelangelo https://eng.uber.com/michelangelo/ suggests there is still a distinction in the architecture and implementation (and programming model) between "batch" and "streaming" and the data is placed in distinct kinds of physical repositories between batch and continuous analyses.

Activity

Here we will list a growing list of use cases that we find useful in the above context.

But this is an open invitation to others who share in interest in this `Continuous Deep Analytics` paradigm to contribute use cases, problems, needs, designs, ideas, code and in every way help further the vision.

Some of these may lead to `Hudi` HIPs, some to extensions and others to more broad solutions, beyond `Hudi` itself but where `Hudi` plays a part.

Use cases

(UC) Integrate Hudi with Apache Beam so that the sliding data window abstractions of beam can run on top of Parquet files incrementally updated through `Hudi`

Space shortcuts

Page tree