Status
State: Draft
Discussion thread:
JIRA: AIRFLOW-3718
Table of Contents |
---|
Motivation
Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture.
This means that builds take longer and that users downloading the image regularly will always download full image. With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed.
Considerations
In the PR : https://github.com/apache/airflow/pull/4543 the current mono-layered docker has been implemented as multi-layered one as a Proof-Of-Concept. It has been used as base of the calculations.
Mutli-layered image with enabled caching in DockerHub might save a lot of build time and a lot of download time for the users. Some details about sizing of the mono-layered image and multi-layered one are shown below.
Important assumption:
Caching must be enabled in Docker Hub
The tables below show when given layer is rebuilt/downloaded.
Details for Mono-layered Docker image for Airflow
Implemented in https://github.com/apache/airflow/commit/e2c22fe70a488feea0cfecde890c20f8c984c09c
Available to pull at:
docker pull potiuk/airflow-monodocker:latest
Only significant layers are shown:
Layer | Size | When rebuilt/downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
Airflow Sources | 237 MB | After every commit |
Airflow installed binaries (all - apt and pip installed together) | 763 MB | After every commit |
Total: 1138 MB
Details for Multi-layered Docker image of Airflow
POC implemented in https://github.com/apache/airflow/pull/4543
Available to pull at:
docker pull potiuk/airflow-layereddocker:latest
Only significant layers are shown:
Layer | Size | When rebuilt/downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
apt-get install core build deps | 118 MB | Only when core dependencies change or when we force fresh build (extremely rare) |
apt-get install extra deps | 155MB | Only when extra deps change (extremely rare) |
pip install deps (just setup no airflow sources) | 523 MB | Only when setup.py changes (every few weeks usually) |
copy airflow sources | 176 MB | After every commit |
Install extra airflow deps just in case | 6 MB | After every commit |
Total: 1116 MB
It turns out that multi layered image is even a bit smaller than the monolayered one. But those are not all benefits that you get from multi-layered image. If you take into account usage patterns and users who download the image semi-frequently they will have to download the whole single layer pretty much every time, where in multi-layered approach they would only need to pull incremental changes - the size of incremental changes will change depending on whether setup.py dependencies are updated, or whether all dependencies are forced to be rebuilt from scratch.
Simulation of downloads for a user that pulls the image regularly
Here is the simulation showing how big downloads users will experience when downloading Airflow image semi-frequently (twice a week).
Assumptions:
A user downloads a new image twice a week.
Setup.py is updated every two weeks.
Commits are happening daily.
Force rebuild from scratch every 4 weeks - to account for changed dependencies
Mono layered downloads:
First download: 1.138 GB
all other downloads: 1GB
Multi-layered downloads:
First download: 1.138 GB
Download if only sources changed (no setup.py): 182 MB
Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB
Download if forced apt-get dependencies forced: 1GB
User download size pattern:
Weeks | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Total (GB) | ||||||||
Sources change | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |
Setup.py changes | x | x | x | x | |||||||||||||
Forced dependencies | x | x | |||||||||||||||
Monolayered (GB) | 1.14 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 16.15 |
Multilayered (GB) | 1.12 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 1 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 5.7 |
Conclusions
- The multi-layered image is even slightly smaller than the mono-layered one - so multi-layered image is even better when you download it once
- Downloading the image regularly by the users is way better in case of multi-layered image - for simulated user it's 5.7 GB vs. 16.15 image over the course of 8 weeks.
Sources for calculation
Mono-layered image:
docker history potiuk/airflow-monodocker:latest
IMAGE CREATED CREATED BY SIZE COMMENT
711f22148f14 28 minutes ago /bin/sh -c #(nop) CMD ["--help"] 0B
1106894d8e9c 28 minutes ago /bin/sh -c #(nop) ENTRYPOINT ["/entrypoint.… 0B
a65bc0db5cdb 28 minutes ago /bin/sh -c #(nop) COPY file:22d6c0f397f65528… 907B
2f96d1ca713e 28 minutes ago |5 AIRFLOW_DEPS=all AIRFLOW_HOME=/usr/local/… 0B
63056a97131b 28 minutes ago /bin/sh -c #(nop) WORKDIR /usr/local/airflow 0B
79d60b1912b3 28 minutes ago |5 AIRFLOW_DEPS=all AIRFLOW_HOME=/usr/local/… 763MB
b23c2a77ce0f 37 minutes ago /bin/sh -c #(nop) WORKDIR /opt/airflow 0B
23a64ce057ce 37 minutes ago /bin/sh -c #(nop) ARG APT_DEPS=freetds-dev … 0B
4a79499a26c0 37 minutes ago /bin/sh -c #(nop) ARG buildDeps=freetds-dev… 0B
8ac987ed4125 37 minutes ago /bin/sh -c #(nop) ARG PYTHON_DEPS= 0B
327f2bc94551 37 minutes ago /bin/sh -c #(nop) ARG AIRFLOW_DEPS=all 0B
b387dffb0fec 37 minutes ago /bin/sh -c #(nop) ARG AIRFLOW_HOME=/usr/loc… 0B
2ce682937e1d 37 minutes ago /bin/sh -c #(nop) COPY dir:b90f1d3f3c97b3b89… 237MB
81f4c012cf1f 2 weeks ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 weeks ago /bin/sh -c set -ex; savedAptMark="$(apt-ma… 7.13MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=18… 0B
<missing> 2 weeks ago /bin/sh -c cd /usr/local/bin && ln -s idle3… 32B
<missing> 2 weeks ago /bin/sh -c set -ex && savedAptMark="$(apt-… 69.2MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.6.8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 2 weeks ago /bin/sh -c apt-get update && apt-get install… 6.48MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:6d6f6f123e45697d3… 55.3MB
Multi-layered image:
docker history potiuk/airflow-layereddocker:latest
IMAGE CREATED CREATED BY SIZE COMMENT
964458a837c4 13 minutes ago /bin/bash -c #(nop) CMD ["--help"] 0B
29934849ea6d 13 minutes ago /bin/bash -c #(nop) ENTRYPOINT ["/entrypoin… 0B
25aa1e37139b 13 minutes ago /bin/bash -c #(nop) COPY file:22d6c0f397f655… 907B
303f291f5588 13 minutes ago |4 ADDITIONAL_PYTHON_DEPS= AIRFLOW_EXTRAS=al… 0B
2c9545351f2b 13 minutes ago /bin/bash -c #(nop) ARG ADDITIONAL_PYTHON_D… 0B
5c51286365ff 13 minutes ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 147kB
0956060680a5 13 minutes ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 6.04MB
a448582458c1 13 minutes ago /bin/bash -c #(nop) COPY dir:1c24e93ef026646… 176MB
19a926abcdaf 13 minutes ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 523MB
e6fdf3881500 18 minutes ago /bin/bash -c #(nop) WORKDIR /opt/airflow 0B
351c1ae84700 18 minutes ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_AIR… 0B
589a353b806b 18 minutes ago /bin/bash -c #(nop) COPY file:143db2e76b8f16… 1.26kB
[AIRFLOW-3718] Multi-layered version of the official docker image
8d6e37aae712 18 minutes ago /bin/bash -c #(nop) COPY file:590340f7066102… 3.04kB
4052dc91ecb1 18 minutes ago /bin/bash -c #(nop) COPY file:3e78814fb55a47… 838B
f6fd5aa7f53d 18 minutes ago /bin/bash -c #(nop) COPY file:53d0bc9002b31a… 29.6kB
53a367472930 18 minutes ago /bin/bash -c #(nop) COPY multi:8bb5ed331b460… 14.2kB
7d28e9f76918 18 minutes ago /bin/bash -c #(nop) ENV SLUGIFY_USES_TEXT_U… 0B
3c5cebbef413 18 minutes ago /bin/bash -c #(nop) ENV CASS_DRIVER_NO_CYTH… 0B
1e011ecb668f 18 minutes ago /bin/bash -c #(nop) ENV CASS_DRIVER_BUILD_C… 0B
a30c11b58de1 18 minutes ago /bin/bash -c #(nop) ARG CASS_DRIVER_NO_CYTH… 0B
0d44f2da664b 26 minutes ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_ALL… 0B
fd9c1b1ebf05 26 minutes ago /bin/bash -c #(nop) ARG AIRFLOW_EXTRAS=all 0B
3845cd179711 26 minutes ago |1 AIRFLOW_HOME=/usr/local/airflow /bin/bash… 0B
9b190f1b4d13 26 minutes ago /bin/bash -c #(nop) ARG AIRFLOW_HOME=/usr/l… 0B
c4870f9907a5 4 days ago /bin/bash -c apt-get update && apt-get i… 155MB
e2249492ef25 4 days ago /bin/bash -c apt-get update && apt-get i… 118MB
34bf4afb2b2c 4 days ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_APT… 0B
fa2782294cce 4 days ago /bin/bash -c #(nop) ENV DEBIAN_FRONTEND=non… 0B
435a88072613 4 days ago /bin/bash -c #(nop) SHELL [/bin/bash -c] 0B
81f4c012cf1f 2 weeks ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 weeks ago /bin/sh -c set -ex; savedAptMark="$(apt-ma… 7.13MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=18… 0B
<missing> 2 weeks ago /bin/sh -c cd /usr/local/bin && ln -s idle3… 32B
<missing> 2 weeks ago /bin/sh -c set -ex && savedAptMark="$(apt-… 69.2MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.6.8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 2 weeks ago /bin/sh -c apt-get update && apt-get install… 6.48MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:6d6f6f123e45697d3… 55.3MB