Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Page properties

StateCompleted
Discussion Thread
JIRA

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyAIRFLOW-5029

Created

Created

In Release1.10.10


NOTE! Updated version of that architecture is kept in https://github.com/apache/airflow/blob/master/CI.rst



Motivation

We have recently started to experience a lot of problems with TravisCI. They are documented in https://lists.apache.org/thread.html/8bfee0f0a52b7b8b9ec63724c48f76a7002aeacf7c86c5a23b48f172@%3Cdev.airflow.apache.org%3E . and in https://lists.apache.org/list.html?dev@airflow.apache.org but summarising it:

Considerations

  • We considered cost of running the builds
  • The system has to be easy to integrate with GitHub including passing status of the build back to GitHub
  • The system should be self-maintainable - with as little special Development/Ops maintenance needed.
  • Keeping old Travis CI builds working (being able to run builds from own Travis CI forks as needed).

 What change do you propose to make?

GithHub Actions:

The architecture of the proposed solutions is shown here :(note that the image and below desription has been updated on July 4, 2021 to reflect the latest changes after switching to ghcr.io , Github Container Registry

draw.io Diagram
bordertrue
diagramNameGitHub Actions CI
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth10111001
revision58

The following components are part of the CI infrastructure

  • Apache Airflow Code Repository - our code repository at https://github.com/apache/airflow
  • Apache Airflow Forks - forks of the Apache Airflow Code Repository from which contributors make Pull Requests
  • GitHub Actions -  (GA) UI + execution engine for our jobs
  • GA CRON trigger - GitHub Actions CRON triggering our jobs
  • GA Workers - virtual machines running our jobs at GitHub Actions (max 20 in parallel)
  • GitHub Private Image Container Registry  - image registry used as build cache for CI  jobs. It is at https://docker.pkg.github.com/apache/airflow/airflowghcr.io
  • DockerHub Public Image Registry  - publicly available image registry at DockerHub. It is at https://hub.docker.com/repository/docker/apache/airflow
  • DockerHub Build Workers - virtual machines running build jibs at DockerHub
  • image registry used to pull base Python Images and to keep the official Released Images of Airflow
  • Official Images (future) - these are official images that are prominently visible in DockerHub. We aim our images to become official images so that you will be able to pull them with `docker pull apache-airflow`

CI run categories

The following CI Job runs are currently run for Apache Airflow, and each of the runs have different purpose and context.

- Pull Request Run - Those runs are results of PR from the forks made by contributors. Most builds for Apache Airflow fall into this category. They are executed in the context of the "Fork", not main Airflow Code Repository which means that they have only "read" permission to all the GitHub resources (container registry, code repository). This is necessary as the code in those PRs (including CI job definition) might be modified by people who are not committers for the Apache Airflow Code Repository. The main purpose of those jobs is to check if PR builds cleanly, if the test run properly and if the PR is ready to review and merge. The runs are using cached images from the Private GitHub Container registry - CI, Production Images as well as base Python images that are also cached in the Private GitHub Container registry.

- Direct Push/Merge Run - Those runs are results of direct pushes done by the committers or as result of merge of a Pull Request by the committers. Those runs execute in the context of the Apache Airflow Code Repository and have also write permission for GitHub resources  (container registry, code repository). The main purpose for the run is to check if the code after merge still holds all the assertions - like whether it still builds, all tests are green. This is needed because some of the conflicting changes from multiple PRs might cause build and test failures after merge even if they do not fail in isolation. Also those runs are already reviewed and confirmed by the committers so they can be used to do some housekeeping - for now they are pushing most recent image build built in the PR to the Github Private Container Registry - which is our image cache for all the builds. Another purpose of those runs is to refresh latest Python base images. Python base images are refreshed with varying frequency (once every few months usually but sometimes several times per week) with the latest security and bug fixes. Those patch level images releases can occasionally break Airflow builds (specifically Docker image builds based on those images) therefore in PRs we always use latest "good" python image that we store in the private GitHub cache.

The direct push/master builds are not using registry cache to pull the python images - they are directly pulling the images from DockerHubcheck DockerHub to see if there are newer python images, therefore they will try the latest images after they are released and in case they are fine, CI Docker image is build and tests are passing - those jobs will push the base images to the private the  GitHub Container Registry so that they be used by subsequent PR runs.

- Scheduled Run - those runs are results of (nightly) triggered job - only for master  main branch. The main purpose of the job is to check if there was no impact of external dependency changes on the Apache Airflow code (for example transitive dependencies released that fail the build). It also checks if the Docker images can be build from the scratch (again - to see if some dependencies have not changed - for example downloaded package releases etc. Another reason for the nightly build is that the builds tags most recent master with nightly-master tag so that DockerHub build can pick up the moved tag and prepare a nightly public master build in the DockerHub registry. The v1-10-test branch images are build in DockerHub when pushing v1-10-stable manually

All runs consist of the same jobs, but the jobs behave slightly differently or they are skipped in different runs. Here is a summary of the run types with regards of the jobs they are running. Those jobs often have matrix run strategy which runs several different variations of the jobs (with different Backend type /Python version, type of the tests to run for example)

JobDescription Pull Request RunDirect Push/Merge Run

Scheduled Run

* Builds all images from scratch

Static checks 1Performs first set of static checksYesYesYes *Static checks 2Performs second set of static checksYesYesYes *DocsBuilds documentationYesYesYes *Build Prod ImageBuilds production imageYesYesYes *Prepare Backport packagesPrepares Backport Packages for 1.10.*YesYesYes *Pyfiles

Counts how many python files changed in the  change.

Used to determine if tests should be run

YesYes (but it is not used)Yes (but it is not used)TestsRun all the combinations of Pytest tests for Python codeYes (if pyfiles count >0)YesYes*Quarantined testsThose are tests that are flaky and we need to fix themYes (if pyfiles count >0)YesYes *Requirements

Checks if requirement constraints in the code are up-to-date

Yes (fails if missing requirement)

Yes

Fails if missing requirement

Yes *

Eager dependency upgrade

Does not fail for changed requirements

Pull Python from cachePulls Python base images from Github Private Image registry to keep the last good python image used in PRsYesNo-Push Python from cachePushes Python base images to Github Private Image registry - checks if latest image is fine and pushes if so-Yes-Push Prod image 

Pushes production images to GitHub Private Image Registry

This is to cache the build images for following runs.

-Yes-Push CI image

Pushes CI images to GitHub Private Image Registry

This is to cache the build images for following runs.

-Yes-Tag Repo nightly

Tags the repository with nightly tag

It is a lightweight tag that moves nightly

--

Yes.

Triggers DockerHub build for public registry

).

Note that unlike in previous architecture, we do not build/push images directly to DockerHub. Main reason for that is that we switched to ghcr.io for cache completely, and autobuild feature of DockerHub has been disabled due to abuse: https://www.docker.com/blog/changes-to-docker-hub-autobuilds/ 

The details about jobs and current state of the CI can be found in https://github.com/apache/airflow/blob/main/CI.rst

Former GitLab proposal

This is an old version of proposal that chose GitLabCI - but it turned out to be not workable because they lacked the capability of running builds from a fork. it took them 8 months to discuss it and they have not rolled it out yet. 


Expand
titleClick here to expand the former GitLab Proposal...

The proposal is to migrate to GitLabCI (Cloud version) running the jobs in GKE auto-scaling Kubernetes cluster.


The architecture of the proposed solution is shown here:

draw.io Diagram
width
bordertrue
viewerToolbartrue
fitWindowfalse
diagramNameGitlab CI builds architecture
simpleViewerfalse
diagramWidth1096
revision6

The steps executed during the build:

1) Code committed to Github, PR created (already done today)

2) Code from master commits is used to build latest "master" image (already done today)

3) GitHub repo is mirrored to GitLab.org instance

4) GitLab CI uses Kubernetes Executor to run the jobs on GKE Kubernetes cluster

5) Each job has its own dind (Docker-In-Docker) engine

6) The dind (Docker-In-Docker) engine is used to build latest Docker images including latest sources (incrementally, using master image from DockerHub as base)

7) The dind (Docker-In-Docker) engine is used to execute the tests 

8) Kind (Kubernetes-in-Docker) is used to run Kubernetes tests

9) GitLab reports build status back to GitHub.


Since GitHub released it's GitHub Actions we decided finally to move to GitHub Actions. It is well in progress and it simplifies the setup a lot.



What problem does it solve?

  • Instability and speed of current Travis CI
  • Lack of control we have over resources used in Travis CI (queueing and machine size)
  • Being able to run bigger matrix of builds

Why is it needed?

  • We need stable and fast CI as it is an integral part of our workflow

Are there any downsides to this change?

  • Not really apart everyone switching to different UI 

Which users are affected by the change?

  • All contributors to Apache Airflow

How are users affected by the change? (e.g. DB upgrade required?)

  • They need to learn new CI UI

Other considerations?

  • We still need to work out a way to run the traffic for "external services - AIP-4 - kind of tests"

What defines this AIP as "done"?

  • We run the tests for several days using GitHub Actions