...
To further the initial vision from https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop through - to define "incremental processing" - through use cases, patterns of functionality, designs, applications (like feature extraction, model training), algorithms, code, etc. and by building on other technologies and state of the art research.
...
- `experimental` is defined as
- exploratory data analysis
- development in notebooks
- essentially ad-hoc choice of tools
- generally batch only, "one off", manual execution
- small data, manual sampling
- models are trained offline
- the end result being reports, diagrams, etc,
- `production` = pretty much the opposite
- end result are enterprise data science applications
- ran in production
- with large, multi-dimensional data set`s that do not fit in RAM, logically infinite
- hence the algorithms / analysis must be incremental
- use of managed `data set`s : `data lake`s, `feature store`s
- models are trained onlineincrementally
- offline periodically and refreshed / deployed every few hours/days
- with awareness of `concept drift`, `distribution drift`, `adversarial attacks` and able to adapt
- use complex orchestration between core analysis and decision layer, model monitoring and other application logic and business processes, some involving human interactions
...
- relational algebra
- linear algebra
- differential programming
- probabilistic programming
- computational graphs
- differential dataflow
- functional programming
- lazy evaluation
- monadic comprehensions
- etc
- etc
Why "deep"
- In terms of algos we have deep learning, ofc.
- But also in terms of "data fabric" we need to handle multi-dimensional, heterogeneous, business rich meaning data and abstractions over data.
- relational data - need #RelationalAlgebra
- arrays / matrices / tensors - need #LinearAlgebra
- graph data - need graph representations and algos (https://www.slideshare.net/oracle4engineer/using-graphs-for-data-analysis)
...
Some of these may lead to `Hudi` HIPs, some to extensions and others to more broad solutions, beyond `Hudi` itself but where `Hudi` plays a part.
Use cases
(UC) Ability to support deletes
- Application to GDPR "right to be forgotten" requirement.
- Context : https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
- Status : wip https://github.com/apache/incubator-hudi/pull/635
(UC) Integrate Hudi with Apache Beam so that the sliding data window abstractions of beam can run on top of Parquet files incrementally updated through `Hudi`
...
Hypothesis The `sliding data window` abstraction from Apache Beam (also present in Spark and Flink) can eliminate most (all ) of the ad-hoc attempts to handle incremental data inside analysis code.
(UC) Use Hudi to build file based data lakes that are incrementally self updating
...
(data fabric)
`Hudi` works with one `data set` at a time but when building a `data lake` we need to relate `data set`s structurally and logically (business semantics) so that `feature store`s can be built from raw data
...
- Apache Arrow
- Dremio
- Project DAWM WeldDAWN Weld (https://www.weld.rs/)
Resources / reading list
- "Evaluating End-to-End Optimization for Data Analytics Applications in Weld"
- "Bridging the Gap: Towards Optimization Across Linear and Relational Algebra"
- "Accessible Machine Learning through Data Workflow Management"