Status
Motivation
Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture nor multi-stage Docker architecture.
Mono-layered image means that builds after only small changes take as long as full build rather than utilise caching and only rebuild what's needed.
With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed. This enables the users using the images to download only incremental changes, and opens up a number of options how such incremental build/download process can be utilised:
- Multi-layered images can be used as based for AIP-7 Simplified development workflow - where locally downloaded images are used during development and they are incrementally updated quickly during development with newly added dependencies.
- Multi-layered images being part of the "airflow" project can be used to run Travis CI integration tests (simplifying the idea described in Optimizing Docker Image Workflow ). Having incremental builds will allow DockerHub registry to be used as source for base images (pulled before build) to build locally final image used for test execution in an incremental way.
- Why initially the images are not meant to be used in production, using multi-staging, variable arguments and multiple layers to produce production-ready Airflow image that can be used to pre-bake Dags into the image - thus making Airflow closer to be Kubernetes-native. This has been discussed as potential future improvement in AIP-12 Persist DAG into DB
- Ideally both Airflow and CI images should be maintained in single place - "source of truth" to ease maintenance and development. Currently they are maintained in separate repositories and have potentially different dependencies and build process. It also makes it difficult to add your own dependencies during development as there is no regular/development friendly process to update CI image with new dependencies.
Considerations
In the PR : https://github.com/apache/airflow/pull/4543 the current mono-layered docker has been implemented as multi-layered one. The PR uses "hooks/build" hook that is used by DockerHub build process to control caching and build process. Thanks to that we can build different variants of the images (Main - slim - airflow image, CI image with more dependencies, Wheel cache image for efficient caching of PIP dependencies).
Assumptions
- There are two images to be built:
- "Airflow" image - slim image with only necessary Airflow dependencies
- "CI" image - fat image with additional dependencies necessary for CI tests
- there are separate images for each python version (currently 2.7, 3.5, 3.6)
- each image uses python-x.y-slim as a base
- all stages are defined in single multi-stage Dockerfile
- Standard Docker build: it's possible to build main airflow image by issuing "docker build ." command. It's not optimised for DockerHub cache reuse but it will build locally.
- Scripted Docker build: we are using hook/build script to build the image utilising DockerHub cache - pulling the images from registry and using them as cache. Those are mainly useful for local development
- binary/apt dependencies are build as separate stages - so that we can use whole cached images with main/CI dependencies as cache source
- the builds are versioned - airflow 2.0.0.dev0 images are different than airflow 2.0.1dev0
Changes that trigger rebuilds
Those changes below are described starting from the most frequent ones - so staring backwards from the end of Dockerfile, going up to the beginning.
- apt and pip dependencies: they are "upgraded" as last part of the build (after sources are added) - thus upgrade to latest versions available is triggered every time sources change (utilising cache from previous installations).
- source changes do not invalidate previously installed packages from apt/pip/npm. They trigger upgrades to pip/apt package as explained above.
- changing to www sources trigger pre-compiling the web page for production (npm run prod) and everything above.
- changing package.json or package-lock.json trigger reinstallation of all npm packages (npm ci) and everything above.
- changing any of setup.py-related files trigger reinstallation of all pip packages. In case of CI build, previously compiled wheel packages from wheel image are used to install the dependencies (saving time for downloading and compilation of packages) and everything above.
- changing the wheel cache causes everything above
- for CI build, changing CI apt dependencies triggers reinstallation of those dependencies and everything above
- changing Airflow apt dependencies triggers reinstallation of those dependencies and everything above
- there is a possibility to trigger whole build process by changing one line in Dockerfile (FORCE_REINSTALL_ALL_DEPENDENCIES)
- new python stable image triggers rebuild of the whole image
Stages of the image
Those are the stages of the image that we have defined in Dockerfile
- X.Y - python version (2.7, 3.5 or 3.6 currently)
- VERSION - airflow version (v2.0.0.dev0)
No. | Stage | Description | Labels in DockerHub | Airflow build dependencies | CI build dependencies |
---|---|---|---|---|---|
1 | Python | Base python image | python-X.Y-slim | - | - |
2 | ariflow-apt-deps | Vital Airflow apt dependencies | latest-X.Y-apt-deps-VERSION | 1 | 1 |
3 | airflow-ci-apt-deps | Additional CI image dependencies | latest-X.Y-ci-apt-deps-VERSION | [Not used] | 2 |
4 | wheel-cache-master | Master wheel cache build on DockerHub from latest master for faster PIP installs | latest-X.Y-wheelcache-VERSION | [Not used] | 3 |
5 | wheel-cache | Currently build wheel cache (for future builds) | latest-X.Y-wheelcache-VERSION | [Not used] | 3 |
6 | main | Main airflow sources build. Used for both Airflow and CI build | Airflow builds:
CI builds:
| 2 | 2 - image 4 - /cache folder with wheels |
Dependencies between stages
Effectively those images we create have those dependencies. In case of Dockerfile changes, Docker multi-staging mechanism takes care about rebuilding only those stages that need to be rebuild in case of Dockerfile definition change - changes in a stage trigger rebuilds only in stages that depend on it.
Layers in the main image
The main image has a number of layers, that make the image rebuilds incrementally depending on changes in the repository vs. the previous build. Mechanism of Docker build (context/cache invalidation) are used to determine if the subsequent layers should be invalidated and rebuild.
No. | Layer | Description | Trigger for rebuild | Airflow build behaviour | CI build behaviour |
---|---|---|---|---|---|
1 | Wheel cache master | /cache folder with cached wheels from previous build | Rebuild of the wheelcache source. | Empty wheel cache used to minimise size of the image | Wheel cache build in latest DockerHub "master" image used. |
2 | PIP configuration | Setup.py and related files (version.py etc.) | Updated dependencies for PIP | Copy setup.py related files to context | Copy setup.py related files to context |
3 | PIP install | PIP installation | Previous layer change | All PIP dependencies downloaded and installed | PIP dependencies installed from wheel cache - new dependencies downloaded and installed |
4 | NPM package configuration | package.json and package-lock.son | Updated dependencies for NPM | Copy package files to context | Copy package files to context |
5 | npm ci | Installs locked dependencies from NPM | Previous layer change | All NPM dependencies downloaded and installed | All NPM dependencies downloaded and installed |
6 | www files | airflow/www all files | Updated any of the www files | Copy www files to context | Copy www files to context |
7 | npm run prod | Prepares production javascript packaging for webserver | Previous layer change | Javascript prepared | Packages prepared |
8 | airflow sources | Copy all sources to context | Any change in sources | Copy sources to context | Copy sources to context |
9 | apt-get upgrade | Upgrading apt dependencies | Previous layer change | All apt packages upgraded to latest stable versions | All apt packages upgraded to latest stable versions |
10 | pip install | Reinstalling PIP dependencies | Previous layer change | Pip packages are potentially upgraded | All PIP packages are upgraded |
The results of such layer structure are the following behaviours:
- in case wheel image is changed: PIP packages + NPM packages + NPM compile + sources are reinstalled for CI build (nothing changes for Airflow build)
- in case PIP configuration is changed: PIP packages + NPM packages + NPM compile + sources are reinstalled. For Airflow build, all PIP packages are downloaded and installed, for CI build Wheel cache is used as base for installation (faster)
- in case NPM configuration is changed: NPM packages + NPM compile + sources are reinstalled
- in case any of WWW files changed: NPM compile + sources are reinstalled
- in case of any source change: sources are reinstalled
Different types of builds
The images for Airflow are build for several scenarios - and the "hook/build" script with accompanying environment variable controls which images are built during those scenarios:
Scenario | Trigger | Purpose | Cache | Frequency | Pull from DockerHub | Push to DockerHub | Images prepared during the build (controled by environment variables) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Apt deps | CI Apt deps | Master Wheelcache | Local wheelcache | Airflow | CI | |||||||
DockerHub build for master branch | A commit merged to "master" | Build and push reference images that are used as cache for subsequent builds | From master | Several times per day | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Local developer build | Triggered by the user | Build when developer adds dependencies or downloads new code and prepares development environment | From local images (pulled initially) unless cache is disabled | Once per day | First time or when requested | When requested and user logged in | Yes | Yes | Yes | Yes | ||
Google Compute Engine Build Machine | Manual build | First Manual build to populate DockerHub registry faster | No cache | First build | No | Yes | Yes | Yes | Yes | Yes | Yes | |
CI build | A commit is pushed to any branch | Builds image that is used to execute CI tests for commits pushed by developers. | From master | Several times an hour | Yes | No | Yes | Yes | Yes |
Build timings for different scenarios
Those timings were measured during tests. Times are in HH:MM:SS.
The yellow rows indicate timings for the orignal "Mono-layered" builds for comparision of incremental build times.
Where built | Images | No source change | Sources changed | WWW sources changed | NPM packages changed | PIP Packages changed | CI Apt deps changed | Apt deps changed | Full build (from scratch) | Commwnts |
---|---|---|---|---|---|---|---|---|---|---|
DockerHub Includes pull of cache | Airflow CI | 8:20 | 8:40 | 11:01 | 13:40 | 33:30 | 38:45 | 44:00 | 44:00 | Delays on DockerHub |
Travis CI | CI | 3:24 | 3:32 | 3:30 | 3:47 | 5:45 | 7:39 | 8:24 | 8:26 | Typical timing for CI builds |
Cloud Build * Includes pull of cache | CI | 2:53 | 3:00 | 3:07 | 3:31 | 4:40 | 6:44 | 8:33 | 9:35 | |
Google Compute Engine ** /hooks/build | Airflow CI | 1:13 | 1:23 | 1:43 | 2:26 | 10:20 | 12:30 | 13:09 | 16:35 | |
Google Compute Engine ** Only CI build using breeze | CI | 0:10 (no rebuild) | 0:10 (no rebuild) | 0:10 (no rebuild) | 1:40 | 3:14 | 5:30 | 8:40 | 7:22 | More time needed to pull than y=to build from scratch |
Google Compute Engine ** 'docker build . --build-arg APT_DEPS_IMAGE=airflow-ci-apt-deps' | CI | 0:02 | 0:13 | 0:23 | 1:00 | 4:36 | 6:05 | 7:50 | 10:28 | |
Google Compute Engine ** 'docker build .' | Airflow | 0:02 | 0:13 | 0:23 | 1:00 | 4:35 | 5:10 | 7:42 | 8:22 | |
Google Compute Engine ** Monolayer (Cassandra fix) **** | Airflow | 0:01 | 4:23 | 4:23 | 4:23 | 4:23 | 4:23 | 4:23 | 5:30 | Cassandra fix is biggest improvement . The rest is more-or-less incremental. The bigest change is in pip-packages. |
Google Compute Engine ** Monolayer | Airflow | 0:01 | 9:07 | 9:07 | 9:07 | 9:07 | 9:07 | 9:07 | 10:43 | |
Local Machine *** Only CI build using breeze | CI | 0:05 (no rebuild) | 0:05 (no rebuild) | 0:05 (no rebuild) | 1:36 | 4:20 | 7:13 | 8:07 | Typical timing for local development | |
Local Machine *** 'docker build . --build-arg APT_DEPS_IMAGE=airflow-ci-apt-deps' | CI | 0:02 | 0:15 | 0:25 | 0:44 | 4:07 | 6:22 | 7:43 | 10:20 | |
Local Machine *** 'docker build .' | Airflow | 0:09 | 0:20 | 0:29 | 0:56 | 4:28 | 3:30 | 8:09 | 10:18 | |
Local Machine *** Monolayer (Cassandra fix) **** | Airflow | 0:30 | 4:34 | 4:34 | 4:34 | 4:34 | 4:34 | 4:34 | 5:56 | |
Local Machine *** Monolayer | Airflow | 0:27 | 8:26 | 8:26 | 8:26 | 8:26 | 8:26 | 8:26 | 9:26 |
* Cloud Build - M8 High CPU - 3 Python versions built in parallel on single instance
** Google Compute Engine: custom (8 vCPUs, 31 GB memory)
*** Local Machine: MacBook Pro (15-inch, 2017), 2,9 GHz Intel Core i7, 4 Cores. Using MacBook impacts Context sending times → it takes significantly longer to send context to Linux Kernel VM which is used on Mac.
**** Cassandra fix - installing cassandra driver takes a lot of time - it compiles cython-based driver (which is good for performance) - Cassandra fix speeds up the build by removing cython optimisations. Multi-layer images are build with cassandra fix.
Image size comparison
Size | |
---|---|
Airflow monolayer image | 1.2GB |
Airflow multi-layer | 1.2GB |
CI multi-layer |
Appendices
Results for initial measurements of sizes of layer images is shown. It has proven that multi-layered image size is comparable to mono-layered one and that there are significant download traffic savings in case of incremental builds.