Hudi for Continuous Deep Analytics

Purpose

To further the initial vision from https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop - to define "incremental processing" - through use cases, patterns of functionality, designs, applications (like feature extraction, model training), algorithms, code, etc. and by building on other technologies and state of the art research.

General context

While "analytics" covers a broad area of functionality, here we distinguish between `experimental` and `production` and we focus on `production`.

`experimental` is defined as
1. exploratory data analysis
2. development in notebooks
3. essentially ad-hoc choice of tools
4. generally batch only, "one off", manual execution
5. small data, manual sampling
6. models are trained offline
7. the end result being reports, diagrams, etc,
`production` = pretty much the opposite
1. end result are enterprise data science applications
2. ran in production
3. with large, multi-dimensional data set`s that do not fit in RAM, logically infinite
4. hence the algorithms / analysis must be incremental
5. use of managed `data set`s : `data lake`s, `feature store`s
6. models are trained incrementally
  1. offline periodically and refreshed / deployed every few hours/days
7. with awareness of `concept drift`, `distribution drift`, `adversarial attacks` and able to adapt
8. use complex orchestration between core analysis and decision layer, model monitoring and other application logic and business processes, some involving human interactions

Of course the boundary is not exact but the challenges are very different and the mindset and means to build solutions for `production` are very different than for `experimental,

A visual metaphor for the transition from `experimental` to `production` may be this:

source https://github.com/productml/blurr

Why "continuous"

Not about "batch" or "stream" - this (only) defines how data arrives first time (in the `data lake`) but not fully define how analysis code uses data programmatically
A more appropriate mindset might `sliding window` from `Apache Beam` which provides powerful "window` semantics
Not about "pipelines" - this implies an unidirectional, one-pass over data
Complex algorithms / analyses / machine learning applications in a production/commercial setting are multi-pass, iterative and tend to run "online"

Putting this data in kafka, separate databases, etc breaks not just the logical cohesion of the analysis but also leaves no room for global optimization.
See the premise of Weld : https://www.weld.rs/assets/weld-strata.pdf

Lastly, there are "too many frameworks"and the programming model is too varied
- No! Not Another Deep Learning Framework
- https://www.mosharaf.com/wp-content/uploads/deepstack-hotos17.pdf

Hypothesis : We need "continuous" analytics in the time dimension, in the data dimension and in terms of programming model

relational algebra
linear algebra
differential programming
probabilistic programming
computational graphs
differential dataflow
functional programming
- lazy evaluation
- monadic comprehensions
- etc
etc

Why "deep"

In terms of algos we have deep learning, ofc.
But also in terms of "data fabric" we need to handle multi-dimensional, heterogeneous, business rich meaning data and abstractions over data.
- relational data - need #RelationalAlgebra
- arrays / matrices / tensors - need #LinearAlgebra
- graph data - need graph representations and algos (https://www.slideshare.net/oracle4engineer/using-graphs-for-data-analysis)

Since the modern data analytics started, arguably with Spark, these new technologies have essentially ripped apart the "database".

See "Bridging the Gap: Towards Optimization Across Linear and Relational Algebra"
https://h2020-streamline-project.eu/wp-content/uploads/2017/11/lara.pdf

Also, #BigData analytics seem to choose "files over databases".

"Particle physics, 10,000 times faster" by Jim Pivarski
"Our data is deeply nested and cross linked"
https://youtu.be/jvt4v2LTGK0?t=455

This is where `Hudi` comes into the picture by allowing data to be kept in files, not just input data but also output data.
Hypothesis : the ability to keep data in Parquet files throughout the analysis is key to building the above vision.

Activity

This page about Uber #Michelangelo https://eng.uber.com/michelangelo/ suggests there is still a distinction in the architecture and implementation (and programming model) between "batch" and "streaming" and the data is placed in distinct kinds of physical repositories between batch and continuous analyses.

Here we will list a growing list of use cases that we find useful in the above context.

But this is an open invitation to others who share in interest in this `Continuous Deep Analytics` paradigm to contribute use cases, problems, needs, designs, ideas, code and in every way help further the vision.

Some of these may lead to `Hudi` HIPs, some to extensions and others to more broad solutions, beyond `Hudi` itself but where `Hudi` plays a part.

Use cases

(UC) Ability to support deletes

Application to GDPR "right to be forgotten" requirement.
Context : https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
Status : wip https://github.com/apache/incubator-hudi/pull/635

(UC) Integrate Hudi with Apache Beam so that the sliding data window abstractions of beam can run on top of Parquet files incrementally updated through `Hudi`

source: https://qcon.ai/system/files/presentation-slides/simplifying_ml_workflows_with_apache_beam.pdf

Hypothesis The `sliding data window` abstraction from Apache Beam (also present in Spark and Flink) can eliminate most (all ) of the ad-hoc attempts to handle incremental data inside analysis code.

(UC) Use Hudi to build file based data lakes that are incrementally self updating (data fabric)

`Hudi` works with one `data set` at a time but when building a `data lake` we need to relate `data set`s structurally and logically (business semantics) so that `feature store`s can be built from raw data

(UC) Use Hudi to build file based feature stores (data fabric)

The first kind is relational data but we also need graph, array and other forms of relations in data, ideally in an unified `data fabric`.

Resources about how Dremio relational cache works for inspiration on how `Hudi` might play in

Technologies on the radar

Apache Arrow
Dremio
Project DAWN Weld (https://www.weld.rs/)

Resources / reading list

"Evaluating End-to-End Optimization for Data Analytics Applications in Weld"
1. http://www.vldb.org/pvldb/vol11/p1002-palkar.pdf
"Bridging the Gap: Towards Optimization Across Linear and Relational Algebra"
1. https://h2020-streamline-project.eu/wp-content/uploads/2017/11/lara.pdf
"Accessible Machine Learning through Data Workflow Management"
1. https://eng.uber.com/machine-learning-data-workflow-management/

Space shortcuts

Page tree