Apache Beam

Welcome! This is the Apache Beam Wiki, with tips, tricks, and detailed guides for contributors.

If you want to lean about how to use Apache Beam, start with https://beam.apache.org

IDE Tips

Technical/Design Documents

Portability Framework

- The model protos contain all aspects of the portability API and is the truth on the ground. The proto definitions supercede any design documents. The main design documents are the following:
- Runner API. Pipeline representation and discussion on primitive/composite transforms and optimizations.
- Job API. Job submission and management protocol.
- Fn API. Execution-side control and data protocols and overview.
- Container contract. Execution-side docker container invocation and provisioning protocols. See CONTAINERS.md for how to build container images.
- Cross language. Options and tradeoffs for how to handle various kinds of multi-language/multi-SDK pipelines.

Metrics
- Metrics architecture inside the runners
Nexmark
- Nexmark code
Gradle
- Optimize Gradle Settings

Works In Progress

Portability Framework

The primary Beam vision: Any SDK on any runner. This is a cross-cutting effort across Java, Python, and Go, and every Beam runner.

Read more

Python 3 Support

Work is in progress to add Python 3 support to Beam. Current goal is to make Beam codebase compatible both with Python 2.7 and Python 3.4.

Contributions are welcome! If you are interested to help, you can select an unassigned issue in the Kanban board and assign it to yourself. Comment on the issue if you cannot assign it yourself. When submitting a new PR, please tag @RobbeSneyders, @aaltay, and @tvalentyn.

Next Java LTS version support (Java 11 / 18.9)

Work to support the next LTS release of Java is in progress. For more details about the scope and info on the various tasks please see the JIRA ticket.

JIRA: BEAM-2530
Contact: Ismaël Mejía

IO Performance Testing

We are also working on writing Performance Tests for IOs and developing a Performance Testing Framework for them. Contributions are welcome in the following areas:

developing more IO Performance Tests (IOITs)
providing necessary kubernetes infrastructure (eg. for databases or filesystems to be used in tests)
running Performance Tests on runners other than Dataflow and Direct
improving existing Performance Testing Framework and it’s documentation

See the documentation and the initial proposal(for file based tests).

If you’re willing to help in this area, tag the following people in PRs: @chamikaramj, @DariuszAniszewski, @lgajowy, @szewi, @kkucharc

Euphoria Java 8 DSL

Easy to use Java 8 DSL for the Beam Java SDK. Provides a high-level abstraction of Beam transformations, which is both easy to read and write. Can be used as a complement to existing Beam pipelines (convertible back and forth). You can have a glimpse of the API at WordCount example.

Feature branch: dsl-euphoria
JIRA: dsl-euphoria / BEAM-3900
Contact: David Moravek

Improving the contributor experience

Making it easier to write code, run tests, and release. Investigating using docker for jenkins builds, automating the release process, and improving the reliability of tests.

Ideas and help welcome! Contact: Alan Myrvold, Mark Liu, Yifan Zou

Beam SQL

Beam SQL has lots of areas to contribute: support for new operators, new connectors, performance measurement and improvement, more full specification and testing, etc.

JIRA: dsl-sql
Contact: Kenneth Knowles

Add benchmarks to continuous integration

Run Nexmark benchmark queries after each commit for Spark, Flink and Direct Runner and export response times to performance dashboards

JIRA: nexmark-perfkit
Contact: Etienne Chauchot

Extract metrics in a runner agnostic way

Metrics are pushed by the runners to configurable sinks (HTTP REST sink available). It is already enabled in Filnk and Spark runner. Work is in progress for Dataflow

JIRA: runner-agnostic-metrics
Contact: Etienne Chauchot

Space shortcuts

Page tree