Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Problem: Flink's build is too slow

We want to reduce the local and CI build times of Flink. This page is looking at options.

§1 Optimize current setup

We currently use Maven + Travis CI + custom scripts. This proposals keep this setup but refine it

Enable JVM reuse for IT cases in more modules (Solution 1)

Benefits:

  • Speedups in blink planner (7 minutes saved)
  • Considered easy to implement

...

  • Not all tests are doing a proper cleanup

Custom differential build scripts (Solution 2)

Benefits:

  • Only build & test affected modules

...

  • Needs a defensive/pessimistic design to catch all potential issues
  • development and maintenance of "homegrown" scripts working around Maven limitations
  • Reinventing the wheel to compensate for the limitations of a bad build tool (Maven)
  • Complex, non-standard build system

Only run smoke tests when PR is opened, run heavy tests on demand (Solution 3)

Benefits:

  • Execute fewer tests, heavy tests on demand

...

  • Custom implementation with "ci-bot" likely
  • Committers need to know which test runs to request / run

Move more tests into cron builds (Solution 4)

Benefits:

  • almost no custom implementation needed (cheap version of 'Solution 3')

...

  • Poor developer experience: People expect to get fast feedback on their changes
  • Failures in cron builds potentially go unnoticed for quite some time (months)
  • Potential of lower long-term build quality

Work towards parallelizing the build better

Benefits:

  • Moving to a build infrastructure with more CPU cores will allow us to run more build / test workloads concurrently

...

  • Maven checkstyle plugin
  • Kafka tests (30 minutes of sequential execution)

Use Gradle Enterprise Global Build Cache

Gradle Enterprise provides a maven plugin for global build caches.

...

  • Relies on a proprietary product
  • Unclear if it works for anonymous Flink contributors


§2 Switch Build System

We currently use Maven + custom scripts

Use Gradle (Solution 5)

Benefits:

  • Supports incremental builds and tests
  • Supports remote build cache to do an incremental build w/o having earlier increments (through "Gradle Enterprise")
  • All build tasks can be solved in code, instead of Maven+scripts

...

  • Apache Kafka is using gradle
  • Apache Beam migrated from Maven to grade by having both build systems side-by-side during the transition
  • gradle supports Kotlin (as an alternative to Groovy) for the build scripts, but Kotlin support is new and has potential limitations
  • Arvid Heise is willing to support a POC
    • ~1 week for PoC (some modules only, not all problems solved)
    • POC must cover CI as well
  • Problems to solve
    • Shading & layered shading
    • Inclusion of NOTICE files into the final build (producing valid Apache releases in general)
    • Support for mixed scala / java projects
    • Javadocs for mixed scala / java projects
    • Java 9+ support
    • API compatibility checks
    • checkstyle
    • ensuring dependency convergence
  •  unclear whether we can use Gradle Enterprise build cache for free as open source, and how it works over the public internet (in a secure way)

Use Bazel

Benefits:

  • Supports incremental builds

...

  • A quick search for shading with bazel didn't reveal promising results

§3 Switch Build Infrastructure

We currently use Travis CI

...

  • Travis future is uncertain due to company ownership changes
  • Travis build caches are unreliable / used in a hacky way
  • Travis only provides a build environment with 2cpu, 7.5g (where a build currently needs 3.5hrs). Other vendors provide bigger build instances, where the build can finish in ~1.3hrs
    • Travis provides bigger build environments in paid plans.

Move to another hosted CI service (Solution 6)

Benefits:

  • Low maintenance overhead of a hosted service
  • similar experience to current setup

...

  • Azure Pipelines (recommended by community)
    • 10 instances (6 hours each on a 2 cores, 7gb machine)
    • Open source projects can add an unlimited number of self-hosted "worker" machines
    • Artefact caching is in preview only: https://docs.microsoft.com/en-us/azure/devops/pipelines/caching/?view=azure-devops 
    • Requires write access to the apache/flink GH repo, which Apache does not allow: 
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keyINFRA-17030
  • GitHub CI
    • Closed Beta
    • Seems to be based on AZ Pipelines
  • Circle CI

Paid options:

  • Google Cloud Build
    • 32 core builders (at a high price tag (almost 4x over the compute instances' price)) 

Move to a self-hosted CI service

Benefits:

  • Lower costs compared to hosted CI service

...

  • Cloud providers
    • Google: $1500/mo for 2x 32core machines
  • Dedicated Servers


§4 Split Repository (TODO)

Flink currently follows a mono-repository approach. Splitting the repository would divide the build time problem into smaller ones. This approach has some additional benefits and issues outside of the build time.

Benefits:

  • Split build time issues
  • Unstable (or worse, permanently failing) tests affect the entire project (the probability for this is increasing with the project)
  • Easier to track pull requests per repository

Problems:

  • Git history
    • fork off new repositories
    • rewrite history
  • Shared Maven dependencies / plugin configurations
    • Idea: set up a new parent pom
  • Building a common documentation out of many repositories
  • End 2 end tests
    • how to share / split tooling
    • In general, tooling will be spread across different repos
  • Releases / Versioning / "internal" dependencies
    • a) Single release across all repositories
    • b) Synced releases 
    • c) separate releases

Approaches:

...

  • Saves ~1 hour of build time

...

See separate page (wip)