THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- `experimental` is defined as
- exploratory data analysis
- development in notebooks
- essentially ad-hoc choice of tools
- generally batch only, "one off", manual execution
- small data, manual sampling
- models are trained offline
- the end result being reports, diagrams, etc,
- `production` = pretty much the opposite
- end result are enterprise data science applications
- ran in production
- with large, multi-dimensional data set`s that do not fit in RAM, logically infinite
- hence the algorithms / analysis must be incremental
- use of managed `data set`s : `data lake`s, `feature store`s
- models are trained onlineincrementally
- offline periodically and refreshed / deployed every few hours/days
- with awareness of `concept drift`, `distribution drift`, `adversarial attacks` and able to adapt
- use complex orchestration between core analysis and decision layer, model monitoring and other application logic and business processes, some involving human interactions
...
Some of these may lead to `Hudi` HIPs, some to extensions and others to more broad solutions, beyond `Hudi` itself but where `Hudi` plays a part.
Use cases
(UC) Ability to support deletes
- Application to GDPR "right to be forgotten" requirement.
- Context : https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
- Status : wip https://github.com/apache/incubator-hudi/pull/635
(UC) Integrate Hudi with Apache Beam so that the sliding data window abstractions of beam can run on top of Parquet files incrementally updated through `Hudi`
...
- Apache Arrow
- Dremio
- Project DAWM WeldDAWN Weld (https://www.weld.rs/)
Resources / reading list
- "Evaluating End-to-End Optimization for Data Analytics Applications in Weld"
- "Bridging the Gap: Towards Optimization Across Linear and Relational Algebra"
- "Accessible Machine Learning through Data Workflow Management"