Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Page properties

StateDraft
Discussion Thread
https://lists.apache.org/thread.html/8bfee0f0a52b7b8b9ec63724c48f76a7002aeacf7c86c5a23b48f172@%3Cdev.airflow.apache.org%3E
JIRA

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyAIRFLOW-5029



Motivation

We have recently started to experience a lot of problems with TravisCI. They are documented in https://lists.apache.org/thread.html/8bfee0f0a52b7b8b9ec63724c48f76a7002aeacf7c86c5a23b48f172@%3Cdev.airflow.apache.org%3E . and in https://lists.apache.org/list.html?dev@airflow.apache.org but summarising it:

  • Travis CI has little incentive to support OSS projects
  • Their service deteriorates
  • We have little to no influence on fixing problems even if we involve Apache Infrastructure
  • Apache Infrastructure actively encourages us to use our own solution and secure some funds (which we did)
  • Google Donated initially 3000 USD for running the builds in GCP
  • Google is working on a long-term regular donation once we make it works an know how much funds we need
  • GitLab CI is open to support OSS projects with up to 50.000 minutes of build/month - it is a Docker-first and Kubernetes-executor-capable CI system
  • We have great support from GitLab including direct contacts with GitLab CI team
  • Recent change of AIP-10 Multi-layered and multi-stage official Airflow CI image enabled docker-first execution of all steps of our builds

Considerations

  • We considered cost of running the builds (seems that 3000 USD will be enough for several months).
    We can utilise Pre-emptible instances and Auto-scaling Kubernetes cluster to handle our scenario. Details to be worked out (we should focus on getting it up and running and then we can optimise it)
  • Gooogle promised regular funding for the project
  • The system has to be easy to integrate with GitHub including passing status of the build back to GitHub
  • The system should be self-maintainable - with as little special Development/Ops maintenance needed.
  • Keeping old Travis CI builds working (being able to run builds from own Travis CI forks or GitLab CI forks as needed).

 What change do you propose to make?

The proposal is to migrate to GitLabCI (Cloud version) running the jobs in GKE auto-scaling Kubernetes cluster.

The architecture of the proposed solution is shown here:

draw.io Diagram
bordertrue
viewerToolbartrue
fitWindowfalse
diagramNameGitlab CI builds architecture
simpleViewerfalse
width
diagramWidth10591094
revision34

The steps executed during the build:

1) Code committed to Github, PR created (already done today)

2) Code from master commits is used to build latest "master" image (already done today)

3) GitHub repo is mirrored to GitLab.org instance

4) GitLab CI uses Kubernetes Executor to run the jobs on GKE Kubernetes cluster

5) Each job has its own dind (Docker-In-Docker) engine

6) The dind (Docker-In-Docker) engine ) During the build Kaniko (a tool developed by Google) is used to build securely latest Docker images including latest sources (incrementally, using master image from DockerHub as base)

Note: each commit will have it's own image (identified via COMMIT SHA)

6) The images are stored in GCP Container Registry (very fast, nearly local, image push/pull, no need to push images outside of GCP)

7) The Kaniko-built image is used by build tasks to execute static code checks and tests

7) The dind (Docker-In-Docker) engine is used to execute the tests 

8) Kind (Kubernetes-in-Docker) is used to run Kubernetes tests8) Kubernetes tests are executed using the Kubernetes Cluster from GKE, no minikube setup is needed

9) GitLab reports build status back to GitHub.


What problem does it solve?

  • Instability and speed of current Travis CI
  • Lack of control we have over resources used in Travis CI (queueing and machine size)
  • Being able to run bigger matrix of builds

Why is it needed?

  • We need stable and fast CI as it is an integral part of our workflow

Are there any downsides to this change?

  • Not really apart everyone switching to different UI 

Which users are affected by the change?

  • All contributors to Apache Airflow

How are users affected by the change? (e.g. DB upgrade required?)

  • They need to learn new CI UI

Other considerations?

  • Being able to use paid GCP account allows us to use other services of GCP (storing and hosting artifacts, running the tests

What defines this AIP as "done"?

  • We run the tests for several days using GitLab + GKE setup