Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To further the initial vision from https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop through use cases, patterns, designs, applications, code, etc. and by building on other technologies and state of the art research.

General context

While "analytics" covers a broad area of interest functionality, here we distinguish between `experimental` and `production` and we focus on `production`.

  • `experimental` is defined as 
    1. exploratory data analysis
    2. development in notebooks
    3. essentially ad-hoc choice of tools
    4. generally batch only, "one off", manual execution
    5. small data, manual sampling
    6. models are trained offline
    7. the end result being reports, diagrams, etc, 
  • `production` = pretty the opposite
    1. end result is are enterprise data science applications 
    2. ran in production 
    3. with large `data , multi-dimensional data set`s that do not fit in RAM RAM, logically infinite
    4. hence they the algorithms / analysis must be incremental
    5. use of managed `data set`s : `data lake`s, `feature store`s
    6. models are trained online online
    7. with awareness of `concept drift`, `distribution drift`, `adversarial attacks` and adaptationable to adapt
    8. use complex orchestration between core analysis and  decision layer, model monitoring and other application logic and business processes, some involving human interactions

...

source https://github.com/productml/blurr 

Why "continuous"

  • Not about "batch" or "stream" - this (only) defines how data arrives first time (in the `data lake`)  but not about fully define how analysis code uses data programmatically
  • A more appropriate mindset might `sliding window`  from `Apache Beam` which provides powerful "window` semantics 
  • Not about "pipelines" - this implies a an unidirectional, one-pass over data
  • Most complex Complex algorithms / analyses / machine learning applications in a production/commercial setting are multi-pass, iterative and must tend to run "online".
  • This means that each component of the analysis produces data that others consume.

Putting this data in kafka, separate databases, etc breaks not just the logical cohesion of the analysis but also leaves no room for global optimization.
See the premise of Weld : https://www.weld.rs/assets/weld-strata.pdf

Image Added

Hypothesis : We need "continuous" analytics in both the time dimension and in the physical dimension.

...