AIP-23 Migrate out of Travis CI

Status

State	Draft
Discussion Thread	https://lists.apache.org/thread.html/8bfee0f0a52b7b8b9ec63724c48f76a7002aeacf7c86c5a23b48f172@%3Cdev.airflow.apache.org%3E
JIRA	Unable to render Jira issues macro, execution error.
Created	$action.dateFormatter.formatGivenString("yyyy-MM-dd", $content.getCreationDate())

Motivation

We have recently started to experience a lot of problems with TravisCI. They are documented in https://lists.apache.org/thread.html/8bfee0f0a52b7b8b9ec63724c48f76a7002aeacf7c86c5a23b48f172@%3Cdev.airflow.apache.org%3E . and in https://lists.apache.org/list.html?dev@airflow.apache.org but summarising it:

Travis CI has little incentive to support OSS projects
Their service deteriorates
We have little to no influence on fixing problems even if we involve Apache Infrastructure
Apache Infrastructure actively encourages us to use our own solution and secure some funds (which we did)
Google Donated initially 3000 USD for running the builds in GCP
Google is working on a long-term regular donation once we make it works an know how much funds we need
GitLab CI is open to support OSS projects with up to 50.000 minutes of build/month - it is a Docker-first and Kubernetes-executor-capable CI system
We have great support from GitLab including direct contacts with GitLab CI team
Recent change of AIP-10 Multi-layered and multi-stage official Airflow CI image enabled docker-first execution of all steps of our builds

Considerations

We considered cost of running the builds (seems that 3000 USD will be enough for several months).
We can utilise Pre-emptible instances and Auto-scaling Kubernetes cluster to handle our scenario. Details to be worked out (we should focus on getting it up and running and then we can optimise it)
Gooogle promised regular funding for the project
The system has to be easy to integrate with GitHub including passing status of the build back to GitHub
The system should be self-maintainable - with as little special Development/Ops maintenance needed.
Keeping old Travis CI builds working (being able to run builds from own Travis CI forks or GitLab CI forks as needed).

What change do you propose to make?

GithHub Actions:

The architecture of the proposed solutions is shown here:

The following components are part of the CI infrastructure

Apache Airflow Code Repository - our code repository at https://github.com/apache/airflow
Apache Airflow Forks - forks of the Apache Airflow Code Repository from which contributors make Pull Requests
GitHub Actions - (GA) UI + execution engine for our jobs
GA CRON trigger - GitHub Actions CRON triggering our jobs
GA Workers - virtual machines running our jobs at GitHub Actions (max 20 in parallel)
GitHub Private Image Registry - image registry used as build cache for CI jobs. It is at https://docker.pkg.github.com/apache/airflow/airflow
DockerHub Public Image Registry - publicly available image registry at DockerHub. It is at https://hub.docker.com/repository/docker/apache/airflow
DockerHub Build Workers - virtual machines running build jibs at DockerHub
Official Images (future) - these are official images that are prominently visible in DockerHub. We aim our images to become official images so that you will be able to pull them with `docker pull apache-airflow`

CI run category

The following CI Job runs are currently run for Apache Airflow, and each of the runs have different purpose and context.

- Pull Request Run - Those runs are results of PR from the forks made by contributors. Most builds for Apache Airflow fall into this category. They are executed in the context of the "Fork", not main Airflow Code Repository which means that they have only "read" permission to all the GitHub resources (container registry, code repository). This is necessary as the code in those PRs (including CI job definition) might be modified by people who are not committers for the Apache Airflow Code Repository. The main purpose of those jobs is to check if PR builds cleanly, if the test run properly and if the PR is ready to review and merge.
- Direct Push/Merge Run - Those runs are results of direct pushes done by the committers or as result of merge of a Pull Request by the committers. Those runs execute in the context of the Apache Airflow Code Repository and have also write permission for GitHub resources (container registry, code repository). The main purpose for the run is to check if the code after merge still holds all the assertions - like whether it still builds, all tests are green. This is needed because some of the conflicting changes from multiple PRs might cause build and test failures after merge even if they do not fail in isolation. Also those runs are already reviewed and confirmed by the committers so they can be used to do some housekeeping - for now they are pushing most recent image build in the PR to the Github Private Registry - which is our image cache for all the builds.
- Scheduled Run - those runs are results of (nightly) triggered jobs - only for well-defined branches : master and v1-10-test they execute nightly. Their main purpose is to check if there was no impact of external dependency changes on the Apache Airflow code (for example transitive dependencies released that fail the build). They also check if the Docker images can be build from the scratch (again - to see if some dependencies have not changed - for example downloaded package releases etc. Another reason for the nightly build is that the builds tags most recent master or v1-10-test code with "master-nightly" and "v1-10-test" tags respectively so that DockerHub build can pick up the moved tag and prepare a nightly "public" build in the DockerHub.

All runs consist of the same jobs, but the jobs behave slightly differently or they are skipped in different runs. Here is a summary of the run types with regards of the jobs they are running. Those jobs often have matrix run strategy which runs several different variations of the jobs (with different Backend type /Python version, type of the tests to run for example)

Job	Description	Pull Request Run	Direct Push/Merge Run	Scheduled Run * Builds all images from scratch
Static checks 1	Performs first set of static checks	Yes	Yes	Yes *
Static checks 2	Performs second set of static checks	Yes	Yes	Yes *
Docs	Builds documentation	Yes	Yes	Yes *
Build Prod Image	Builds production image	Yes	Yes	Yes *
Prepare Backport packages	Prepares Backport Packages for 1.10.*	Yes	Yes	Yes *
Pyfiles	Counts how many python files changed in the change. Used to determine if tests should be run	Yes	Yes (but it is not used)	Yes (but it is not used)
Tests	Run all the combinations of Pytest tests for Python code	Yes (if pyfiles count >0)	Yes	Yes*
Quarantined tests	Those are tests that are flaky and we need to fix them	Yes (if pyfiles count >0)	Yes	Yes *
Requirements	Checks if requirement constraints in the code are up-to-date	Yes (fails if missing requirement)	Yes Fails if missing requirement	Yes * Eager dependency upgrade Does not fail for changed requirements
Push Prod image	Pushes production images to GitHub Private Image Registry This is to cache the build images for following runs.	-	Yes	-
Push CI image	Pushes CI images to GitHub Private Image Registry This is to cache the build images for following runs.	-	Yes	-
Tag Repo nightly	Tags the repository with nightly tag It is a lightweight tag that moves nightly	-	-	Yes. Triggers DockerHub build for public registry

Former GitLab proposal

This is an old version of proposal that chose GitLabCI - but it turned out to be not workable because they lacked the capability of running builds from a fork. it took them 8 months to discuss it and they have not rolled it out yet.

Click here to expand the former GitLab Proposal...

~~The proposal is to migrate to GitLabCI (Cloud version) running the jobs in GKE auto-scaling Kubernetes cluster.~~

~~The architecture of the proposed solution is shown here:~~

~~The steps executed during the build:~~

~~1) Code committed to Github, PR created (already done today)~~

~~2) Code from master commits is used to build latest "master" image (already done today)~~

~~3) GitHub repo is mirrored to GitLab.org instance~~

~~4) GitLab CI uses Kubernetes Executor to run the jobs on GKE Kubernetes cluster~~

~~5) Each job has its own dind (Docker-In-Docker) engine~~

~~6) The dind (Docker-In-Docker) engine is used to build latest Docker images including latest sources (incrementally, using master image from DockerHub as base)~~

~~7) The dind (Docker-In-Docker) engine is used to execute the tests~~

~~8) Kind (Kubernetes-in-Docker) is used to run Kubernetes tests~~

~~9) GitLab reports build status back to GitHub.~~

Since GitHub released it's GitHub Actions we decided finally to move to GitHub Actions. It is well in progress and it simplifies the setup a lot.

What problem does it solve?

Instability and speed of current Travis CI
Lack of control we have over resources used in Travis CI (queueing and machine size)
Being able to run bigger matrix of builds

Why is it needed?

We need stable and fast CI as it is an integral part of our workflow

Are there any downsides to this change?

Not really apart everyone switching to different UI

Which users are affected by the change?

All contributors to Apache Airflow

How are users affected by the change? (e.g. DB upgrade required?)

They need to learn new CI UI

Other considerations?

Being able to use paid GCP account allows us to use other services of GCP (storing and hosting artifacts, running the tests

What defines this AIP as "done"?

We run the tests for several days using GitLab + GKE setup

Space shortcuts

Page tree