Welcome! This is the Apache Beam Wiki, with tips, tricks, and detailed guides for contributors.
If you want to lean about how to use Apache Beam, start with https://beam.apache.org
IDE Tips
Technical/Design Documents
- Portability Framework
- The model protos contain all aspects of the portability API and is the truth on the ground. The proto definitions supercede any design documents. The main design documents are the following:
Runner API. Pipeline representation and discussion on primitive/composite transforms and optimizations.
Job API. Job submission and management protocol.
Fn API. Execution-side control and data protocols and overview.
Container contract. Execution-side docker container invocation and provisioning protocols. See CONTAINERS.md for how to build container images.
- Cross language. Options and tradeoffs for how to handle various kinds of multi-language/multi-SDK pipelines.
- Metrics
- Nexmark
- Gradle
Works In Progress
Portability Framework
The primary Beam vision: Any SDK on any runner. This is a cross-cutting effort across Java, Python, and Go, and every Beam runner.
Apache Spark 2.0 Runner
- Feature branch: runners-spark2
- Contact: Jean-Baptiste Onofré
JStorm Runner
- Docs
- Feature branch: jstorm-runner
- JIRA: runner-jstorm / BEAM-1899
- Contact: Pei He
MapReduce Runner
- Feature branch: mr-runner
- JIRA: runner-mapreduce / BEAM-165
- Contact: Pei He
Tez Runner
- Feature branch: tez-runner
- JIRA: runner-tez / BEAM-2709
Go SDK
- Contact: Henning Rohde
Python 3 Support
Work is in progress to add Python 3 support to Beam. Current goal is to make Beam codebase compatible both with Python 2.7 and Python 3.4.
Contributions are welcome! If you are interested to help, you can select an unassigned issue in the Kanban board and assign it to yourself. Comment on the issue if you cannot assign it yourself. When submitting a new PR, please tag @RobbeSneyders, @aaltay, and @tvalentyn.
Next Java LTS version support (Java 11 / 18.9)
Work to support the next LTS release of Java is in progress. For more details about the scope and info on the various tasks please see the JIRA ticket.
- JIRA: BEAM-2530
- Contact: Ismaël Mejía
IO Performance Testing
We are also working on writing Performance Tests for IOs and developing a Performance Testing Framework for them. Contributions are welcome in the following areas:
- developing more IO Performance Tests (IOITs)
- providing necessary kubernetes infrastructure (eg. for databases or filesystems to be used in tests)
- running Performance Tests on runners other than Dataflow and Direct
- improving existing Performance Testing Framework and it’s documentation
See the documentation and the initial proposal(for file based tests).
If you’re willing to help in this area, tag the following people in PRs: @chamikaramj, @DariuszAniszewski, @lgajowy, @szewi, @kkucharc
Euphoria Java 8 DSL
Easy to use Java 8 DSL for the Beam Java SDK. Provides a high-level abstraction of Beam transformations, which is both easy to read and write. Can be used as a complement to existing Beam pipelines (convertible back and forth). You can have a glimpse of the API at WordCount example.
- Feature branch: dsl-euphoria
- JIRA: dsl-euphoria / BEAM-3900
- Contact: David Moravek
Improving the contributor experience
Making it easier to write code, run tests, and release. Investigating using docker for jenkins builds, automating the release process, and improving the reliability of tests.
Ideas and help welcome! Contact: Alan Myrvold, Mark Liu, Yifan Zou
Beam SQL
Beam SQL has lots of areas to contribute: support for new operators, new connectors, performance measurement and improvement, more full specification and testing, etc.
- JIRA: dsl-sql
- Contact: Kenneth Knowles
Add benchmarks to continuous integration
Run Nexmark benchmark queries after each commit for Spark, Flink and Direct Runner and export response times to performance dashboards
- JIRA: nexmark-perfkit
- Contact: Etienne Chauchot
Extract metrics in a runner agnostic way
Metrics are pushed by the runners to configurable sinks (HTTP REST sink available). It is already enabled in Filnk and Spark runner. Work is in progress for Dataflow
- JIRA: runner-agnostic-metrics
- Contact: Etienne Chauchot