Discussion threadhttps://lists.apache.org/thread/h4cmv7l3y8mxx2t435dmq4ltco4sbrgb
Vote threadhttps://lists.apache.org/thread/qx91dkm7mornvqqbr5n5sg1z9hj06t55
Result threadhttps://lists.apache.org/thread/tycrzlg6wrgx742l70m9xvjcvlhwo335
JIRA

Unable to render Jira issues macro, execution error.

Unable to render Jira issues macro, execution error.

Releasetba

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Even though it’s debatable whether the CI system is “part of” Apache Flink and, therefore, deserves its own FLIP, I went ahead and created this FLIP because the change is affecting users of Flink but especially developers more fundamentally. The structure of this FLIP will be slightly different (in terms of headlines, though).

The goal of this FLIP is to create a base onto which the Flink community can decide whether they want to migrate away from Azure CI and move to GitHub actions. It’s not the goal to bring in a ready-made nothing-to-improve GitHub Actions workflow. This FLIP shall enable us to start testing what Apache INFRA can offer on the GitHub Actions side by allowing the installation of GitHub Actions workflow in Flink's core repository.

It's not meant as a final decision on whether we move to GitHub Actions or not. This final decision should be covered by a discussion at the end of the trial run (ideally, with the release of Flink 1.19).

Azure CI vs GitHub Actions

Current Setup

  • Azure CI is used for CI runs due to its generous OpenSource policies in the past
  • Even though it exists, no proper GitHub integration of the Azure CI pipelines are set up
  • Azure CI requires write access to the repository which is not allowed by Apache. The following workaround was installed:
    • flink-ci GitHub organization was created (owned by Ververica)
    • flink-ci/flink-mirror is used to mirror Apache Flink and run Azure CI on it
    • flink-ci/git-repo-sync is used to synchronize master and the release branches to flink-ci/flink-mirror. It’s deployed on a GCP (AFAIR) VM that’s owned by Ververica right now. This limits the group of people that can jump in if a problem occurs.
    • flink-ci/ci-bot is used to monitor created PRs, synchronizing the PRs with flink-ci/flink-mirror, linking the PR with the corresponding CI run and updating the PR’s labels. It’s deployed on a GCP (AFAIR) VM that’s owned by Ververica right now. This limits the group of people that can jump in if a problem occurs.
  • Secret management is currently handled by a few PMC members within the Azure CI project apache-flink/apache-flink (AFAIR). The Azure CI project is also owned by Ververica which limits the PMC from having full access, too.

Flink’s CI workflows

  • Basic CI workflow
    • Source Code check (Spotless, Checkstyle, Rat plugin)
    • Runs all stages for Java 8 and Hadoop 2.10.2 (JUnit & Integration tests)
    • E2E tests
    • Binary API compatibility checks (japicmp)
    • Is executed for each PR
    • PRs do not have secrets included (i.e. certain tests like those relying on external S3 infrastructure are not executed)
  • Nightly CI workflow
    • Java 11, Java 17 (both with Hadoop 2.10.2)
    • Java 8 with Hadoop 3.1.3
    • Basic CI workflow with Adaptive Scheduler enabled
    • cron_azure that runs on Azure-provided VMs (rather than the self-hosted runners which are provided by Alibaba). See FLINK-18370 for more details.
    • Runs on master and the release branches of the two most-recently released versions of Flink (currently: 1.18 and 1.17 with a grace period for 1.16 until the release of 1.16.3)
    • Runs the following setups aside from the basic CI workflow:
    • The nightly workflow can be used to test other features (like it’s currently done with the AdaptiveScheduler) over a longer period of time

Azure-hosted vs Alibaba-hosted Runners

Flink’s CI workflows put quite a bit of pressure on the runner set that was provided by Apache. That was a reason why it was decided in the past to rely on custom runners. Currently, we use 6 (correct me if I’m wrong here) VMs that are provided by Alibaba. Each of these machines have 5 AzureCI runner agents (again, correct me if I’m wrong about the numbers) deployed.

The JUnit/ITCase stages (i.e. core, table, …) are executed on Alibaba machines (for both, PR and nightly runs). The e2e tests and cron_azure run on Azure-provided VMs.

Limitations of GitHub Actions in the past

GitHub Actions allowed organization-wide runners in the past. The Apache foundation provided runners on their end to allow Apache projects the execution aside from the limited runners GitHub offers in general (10 parallel runs).

As mentioned in the previous section, Flink’s CI is quite processing intensive. Relying on Apache Infra’s GitHub runners only might delay the execution of jobs too much. An alternative to that is using self-hosted runners again. But Apache Infra provided limits (through limiting the number of API tokens for setting up a runner) on how many agents are allowed to run per project. The major concern so far is security of the provided machines: VMs that deploy self-hosted runners do not destroy themselves after each run which makes them vulnerable to malicious code.

Apache Infra did some experimenting on self-hosted runners in collaboration with Apache Airflow (see ashb/runner with releases/pr-security-options branch) where they only allow certain groups of users (e.g. committers) to run their workflows on self-hosted machines. Any other group would have to rely on GitHub’s runners. We would have to release and deploy our custom runners because the corresponding fix (Draft PR #783 with fix from @ashb with the Airflow runner extension) didn't end up in actions/runner, yet (Closed issue #494 that covers permission control; Feature request for permission control). Any new release of GitHub's runners requires us to update our runners as well because GitHub will reject runners that doesn't match their expected version. @ashb came up with an Automated release workflow for Airflow's custom GitHub runner to work around the issue.

Ephemeral self-hosted runners provided by Apache INFRA

Recently, Apache INFRA started to experiment with ephemeral runners based on Kubernetes clusters. Apache Flink could register for the trial period. We will have to see how well these runners scale and how fast they are in processing Flink’s workflows.

One advantage of switching to GitHub Actions here is that we might be able to get arm support out-of-the-box. Apache INFRA seems to provide runners on arm machines.

Hardware Specifications

  • Apache INFRA self-hosted ephemeral runners (source: ASF Infra provided self-hosted runners)

    ArchAzure LabelWorkflow labelCoresMemoryDisk
    x64Standard_DS2_v2"self-hosted" & "asf-runner"2x vCPU Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz7G1TB
    MacOS*Standard_D2pds_v5"self-hosted" & "asf-arm"2x vCPU8G1TB
  • GitHub-hosted runners (source: GitHub documentation)

    ArchWorkflow LabelCoresMemoryDisk
    x64ubuntu-(latest|22.04|20.04)27GB14GB
    MacOS*macos-latest27GB14GB
  • Azure CI VMs (source: Azure pipeline documentation)

    ArchWorkflow labelCoresMemoryDisk
    x64ubuntu-(latest|22.04|20.04)27GB14GB SSD (with 10GB free disk space minimum)
    MacOS*macos-(latest|12|11)314GB14GB SSD (with 10GB free disk space minimum)

* I didn't find any architecture specifications for MacOS VMs. It looks like self-hosted runners in Azure CI can run on both, x64 and arm (see Azure Pipeline docs).

Apache INFRA RoundTable Discussion

  • Attended Apache INFRA roundtable about self-hosted runners on Dec 6, 2023
  • There were Apache-wide limitations on runners in the past 250. This limit does not exist anymore. The runner infrastructure is currently not fully utilized.
  • Apache INFRA provides project-specific runners. They also allow for companies to donate machines that they would use to host runners. Apache INFRA would be still in control of hosting and managing the VMs.
  • In contrast to hosting the runner deployments as dedicated processes within the VM, there's a project that utilizes Kubernetes and providing self-hosted ephemeral runners: voltrondata-labs/gha-controller-infra (mentioned by Jacob Wujciak from Apache Arrow)
  • DeveloCity (aka Gradle Enterprise) can be used as an extension to get CI functionality (e.g. test stability analytics) which is currently not provided by GitHub Actions
  • ARM runners are available through Apache INFRA in beta starting on January 2024
  • Likelyhood of ephemeral runners going away is quite low: Apache INFRA is pushing for more projects to try it out (there will be probably more Roundtable sessions in the future to gather feedback from Apache projects)

Secret Management

Currently, the secrets (e.g. for S3 API access tokens) are maintained by certain PMC members with access to the corresponding configuration in the Azure CI project. This responsibility will be moved to Apache Infra. They are in charge of handling secrets in the Apache organization. As a consequence, updating secrets is becoming a bit more complicated. This can be still considered an improvement from a legal standpoint because the responsibility is transferred from an individual company (i.e. Ververica who's the maintainer of the Azure CI project) to the Apache Foundation. 

GitHub Actions Experiments

The experiments with GitHub Actions workflows happened on the private fork XComp:flink. Currently there is the basic and the extended CI workflow installed. A few fixes were necessary. But the pipelines run quite smoothly. My impression so far is, though, that there are more timeout-related test instabilities popping up which need to be fixed while experimenting and monitoring the GitHub Actions workflows.

Available Resources

Apache Flink discussions of the past

Apache Infra

GitHub Documentation

Public Interfaces

Developer-facing changes

  • No AzureCI account is necessary anymore for forks to run CI outside of a PR
  • flink-ci/ci-bot can be deprecated (it’s currently running on a VM operated by Ververica)
  • flink-ci/flink-repo-sync can be deprecated (it’s currently running on a VM operated by Ververica
  • flink-ci/flink-mirror can be deprecated (flink-ci is currently owned by Ververica)

What’s not part of this change

flink-ci/flink-ci-docker is a repository under the flink-ci organization (owned by Ververica) which was used to generate the Docker image for the CI builds. That’s not the case anymore (or at least right now). Instead, Docker images are built based on Chesnay’s fork zentol/flink-ci-docker.

The current setup is unfortunate. We should move the Docker CI image repository also under the Apache umbrella. Two options exist:

The first option seems to be the most straight-forward solution. But we’re using the CI image also for the externalized connectors. Having the CI image as part of the Flink code would add a dependency to Flink again which we wanted to avoid.

Anyway, this change is out-of-scope for the GHA migration and won’t be handled in more detail in this FLIP.

Migration Plan

The migration plan contains three parallel stages:

  • PR CI: Experimenting with the GHA integration for PRs.
  • Nightly CI: Experimenting with the extended CI runs that are triggered on master and release-1.18 every day.
  • Apache INFRA runners: Experimenting with the different offerings of Apache INFRA. The GitHub runners seem to be good enough for the pipeline. But we might gain more performance (e.g. due to parallelization) with what Apache INFRA can offer.

PR CI (on Apache INFRA  runners and GitHub-hosted runners)Nightly CI (extended; on Apache INFRA-hosted runners)Apache INFRA runners


Setup Phase

1st phase: Add the basic CI workflow for GitHub Actions (trigger: "on push" and "workflow_dispatch" for branches). This would allow the GHA triggering in forks (outside of PRs) using the GitHub-hosted runners* and Apache INFRA-hosted runners for master  and release-1.18. This allows us to control the amount of workflow runs that are handled by Apache INFRA.

Add extended CI workflow which is enabled on master and the release branches and utilizes the Apache INFRA runners. Set up daily scheduled for master and release-1.18 (1.17 would require even more backports which I would like to avoid). Copy (or recreate) secrets for nightly runs in GHA

Join trial program for ephemeral GHA runners

2nd phase: Additional trigger for PR creation  because that might pick up Apache INFRA's runners for forks (i.e. PRs) and put more pressure on the Apache INFRA resources.

Experimenting Phase

Identify additional functionality that’s provided by Flink’s CI bot and that needs to be added to the GitHub Actions workflow (e.g. for labeling we could utilize the boring-cyborg bot that is already used for externalized connectors)

Investigate the performance & test stability based on GitHub’s generic runners for merged PRs to master and the release branches

Experiment with Apache INFRA's persistent and ephemeral runners.

Final Vote: Azure CI vs GitHub Actions

Another vote will be conducted on Flink's dev mailing list in which the community will be able to decide whether the migration to GitHub Actions shall be performed. In case of a successful vote, Azure CI-related resources will be disabled/removed. All GHA-related changes will be reverted if the community decides to stick to Azure CI.

Finalization Phase

Disable Flink’s CI bot for PRs if all issue are covered and GHA is on par with the current Azure CI version.

Enable nightly runs for all release branches if all issues are resolved and Flink 1.19 is released (we need to deprecate 1.17 because we don't want to backport all the CI script changes). Disable repo synchronization (git-repo-sync)Switch to the right setup for runners as soon as we have a better understanding what Apache INFRA can offer.

Documentation Phase

Proper in-code documentation (i.e README) of the CI system.


* The experiments I conducted in my XComp/flink repo suggest that GitHub-hosted runners are already providing good performance to run the CI pipeline.

Rejected Alternatives

  • Integration of Azure CI into GitHub
  • Use Apache Infra’s Jenkins deployment
  • Apache Pulsar restricts CI runs to reviewed PRs only. Contributors are asked to create a branch in their fork as well to use GitHub’s runners, instead. The project itself relies on Apache’s hosted runners. (see related PR)
    • There is a discussion about it in the follow-ups of this Infra Slack post
    • There are concerns shared about losing contributors due to the extra work that’s necessary to set this up.
  • Discussion on whether to use Jenkins back in 2014 (where the Flink community went with Travis out of convenience)