Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Self Link:  https://s.apache.org/beam-design-docs

Documents by category

Project Incubation (2016)

  • Original Drive Folder for Incubation Docs [Google Drive folder]
  • Technical Vision [doc], [slides]
  • Repository Structure [doc]
  • Flink runner: Current status and development roadmap [doc]
  • Spark Runner Technical Vision [doc]
  • PPMC deep dive [slides]

...

  • Checkpoints [doc]
  • A New DoFn [doc], [slides]
  • Proposed Splittable DoFn API changes [doc]
  • Splittable DoFn (Obsoletes Source API) [doc]
    • Reimplementing Beam API classes on top of Splittable DoFn on top of Source API [doc]
    • New TextIO features based on SDF [doc]
    • Watch transform [doc]
    • Bundles w/ SplittableDoFns [doc]
    • Custom Runner-issued Checkpoint [doc]
  • State and Timers for DoFn [doc]
    • Portable OrderedListState [doc]
  • ContextFn [doc]
  • Static Display Data [doc]
  • Lateness (and Panes) in Apache Beam [doc]
  • Triggers in Apache Beam [doc]
  • Triggering is for sinks [doc] (not implemented)
  • Guard against “Trigger Finishing” [doc]
  • Pipeline Drain [doc]
  • Pipelines Considered Harmful [doc]
  • Side-Channel Inputs [doc]
  • Dynamic Pipeline Options [doc]
  • SDK Support for Reading Dynamic PipelineOptions [doc]
  • Fine-grained Resource Configuration in Beam [doc]
  • External Join with KV Stores [doc]
  • Error Reporting Callback (WIP) [doc]
  • Snapshotting and Updating Beam Pipelines [doc]
  • Requiring PTransform to set a coder on its resulting collections [mail]
  • Support of @RequiresStableInput annotation [doc], [mail]
  • [PROPOSAL] @onwindowexpiration [mail]
  • AutoValue Coding and Row Support [doc]
  • HyperLogLog++ Integration with Apache Beam [doc]
  • Retractions [doc]
  • @RequiresTimeSortedInput annotation for stateful DoFns [doc]
  • GroupIntoBatches with Runner Determined Sharding [doc]

  • Runner and Fn API StateSpec Mismatch [doc]

IO / Filesystem

  • IOChannelFactory Redesign [doc]
  • Configurable BeamFileSystem [doc]
  • New API for writing files in Beam [doc]
  • Dynamic file-based sinks [doc]
  • Beam GCP Debuggability Metrics [doc]
  • KafkaIO
    • Event Time and Watermarks in KafkaIO [doc]
    • Exactly-once Kafka sink [doc]
    • KafkaIO Dynamic Read [doc]
  • CDAP IO [doc]
  • Schema Aware Beam IOs [doc]
  • Client-Side Throttling Overview [doc]

Metrics

  • Defining and Adding SDK Metrics via FN API [doc]
  • Histogram Style Metrics - [doc]
  • Get Metrics API: Metric Extraction via proto RPC API. [doc]
  • Metrics API [doc]
  • I/O Metrics [doc]
  • Metrics extraction independent from runners / execution engines [doc]
  • Watermark Metrics [doc]
  • Support Dropwizard Metrics in Beam [doc]
  • Beam GCP Debuggability Metrics [doc]

...

  • More Expressive PAsserts [doc]
  • Mergebot design document [doc]
  • Performance tests for commonly used file-based I/O PTransforms [doc]
  • Performance tests results analysis and basic regression detection [doc]
  • Eventual PAssert [doc]
  • Testing I/O Transforms in Apache Beam [doc]
  • Reproducible Environment for Jenkins Tests By Using Container [doc]
  • Keeping precommit times fast [doc]
  • Increase Beam post-commit tests stability [doc]
  • Beam-Site Automation Reliability [doc]
  • Managing outdated dependencies [doc]
  • Automation For Beam Dependency Check [doc]
  • Test performance of core Apache Beam operations [doc]
  • Add static code analysis quality gates to Beam [doc]
  • Portable batch & streaming load tests in all sdks [doc]
  • Storing, displaying and detecting anomalies in test results [doc]
  • Add ARM Support to Beam SDK Container Images [doc]

Deployment

  • Beam on Flink on Kubernetes [doc]

...

  • Beam Python User State and Timer APIs [doc]
  • Python Kafka connector [doc]
  • Python 3 support [doc]
  • Splittable DoFn for Python SDK [doc]
  • Parquet IO for Python SDK [doc]
  • Building Python Wheels [doc]
  • Beam Type Hints for Python 3 [doc]
  • Pandas Dataframe API for Beam [doc]
  • Batched DoFns [doc]
  • PEP 585 Type Hints for Python 3.9+ [doc]
  • The Current State of Beam Python Type Hinting (as of 2.52.0) [doc]
  • Enrichment transform [doc]

Go

  • Apache Beam Go SDK design [doc]
  • Go SDK Vanity Import Path [doc] (unimplemented)
    • Needs to be adjusted to account for Go Modules.
  • Go SDK Integration Tests [doc]
  • Design RFC
    • Assumes Beam knowledge, but points out how Go's features informed the SDK design.
  • User Defined Coders + Original Schema Sketch 
  • Splittable DoFns for the Go SDK [doc]
  • Self-Checkpointing SDFs for the Go SDK [doc]
  • Bundle Finalization in the Go SDK [doc]
  • Watermark Estimation in the Go SDK [doc]
  • State and Timers in the Go SDK [doc]
  • Using Generics for Registration [doc]
  • Side Input Window Mapping [doc]
  • MultiMap Side Input Support [doc]
  • One-Pagers:
    • Investigation: Go Expansion Service Auto-Startup for Dev Environments [doc]

Machine Learning

  • Custom Inference Functions [doc]
  • Model Updates using Side Inputs [doc]
  • RunInference: ML Inference in Beam [doc]
  • beam.MLTransform [ doc ]
  • Embeddings in MLTransform [doc]
  • TensorFlow Model Handler [doc]
  • Hugging Face Model Handler [doc]
  • Per Key Inference [doc]
  • Benchmarking RunInference with Multi-Process Shared Models [doc]

Other

  • Euphoria - High-Level Java 8 DSL [doc]
  • Apache Beam Code Review Guide [doc]
  • Nexmark - Nexmark
  • Slowly Changing Side Inputs (or Slowly Changing Dimensions Support) [doc]

Some of documents are available on this google drive

...