Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • `experimental` is defined as 
    1. exploratory data analysis
    2. development in notebooks
    3. essentially ad-hoc choice of tools
    4. generally batch only, "one off", manual execution
    5. small data, manual sampling
    6. models are trained offline
    7. the end result being reports, diagrams, etc, 
  • `production` = pretty much the opposite
    1. end result are enterprise data science applications 
    2. ran in production 
    3. with large, multi-dimensional data set`s that do not fit in RAM, logically infinite
    4. hence the algorithms / analysis must be incremental
    5. use of managed `data set`s : `data lake`s, `feature store`s
    6. models are trained incrementally (_"training
      1. offline periodically and refreshed / deployed every few hours/days
      "_)
    7. with awareness of `concept drift`, `distribution drift`, `adversarial attacks` and able to adapt
    8. use complex orchestration between core analysis and  decision layer, model monitoring and other application logic and business processes, some involving human interactions

...

Some of these may lead to `Hudi` HIPs, some to extensions and others to more broad solutions, beyond `Hudi` itself but where `Hudi` plays a part.

Use cases

(UC) Ability to support deletes

(UC) Integrate Hudi with Apache Beam so that the sliding data window abstractions of beam can run on top of Parquet files incrementally updated through `Hudi`

...

  1. "Evaluating End-to-End Optimization for Data Analytics Applications in Weld"
    1. http://www.vldb.org/pvldb/vol11/p1002-palkar.pdf
  2. "Bridging the Gap: Towards Optimization Across Linear and Relational Algebra"
    1. https://h2020-streamline-project.eu/wp-content/uploads/2017/11/lara.pdf
  3. "Accessible Machine Learning through Data Workflow Management"
    1. https://eng.uber.com/machine-learning-data-workflow-management/
    2. Image Added