You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Motivation

Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture. There was a discussion on whether to use mono-layered Docker or multi-layered one here: https://github.com/apache/airflow/pull/4483

Mono-layered image means that builds take longer (always built from scratch) and that users downloading the image regularly will always download full image.

With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed. The disadvantage is that some layers are "stalled" - i.e. they need to be refreshed from time to time (for example apt-get installed binaries)

Considerations

In the PR : https://github.com/apache/airflow/pull/4543 the current mono-layered docker has been implemented as multi-layered one as a Proof-Of-Concept. It has been used as base of the calculations.

Mutli-layered image with enabled caching in DockerHub might save a lot of build time and a lot of download time for the users. Some details about sizing of the mono-layered image and multi-layered one are shown below.

Important assumption:

Caching must be enabled in Docker Hub

The tables below show when given layer is rebuilt/downloaded.

The multi-layered Docker proposal features:

  • First significant layer: apt-get dependencies are installed (they can be force-rebuilt by increasing value of env variable FORCE_REINSTALL_APT_GET_DEPENDENCIES
  • Second significant layer: Only PIP dependencies are installed without all Airflow Sources (just setup* files are copied to the image and dependencies are installed). Can be rebuilt by increasing FORCE_REINSTALL_ALL_PIP_DEPENDENCIES variable
  • Third significant layer: Airflow sources are added and installation is repeated (this layer is rebuilt every time airflow sources change)
  • Apt-get upgrade is run every time sources changes to make sure that latest APT dependencies are upgraded (for example security fixes)
  • You can disable CYTHON version of cassandra driver by setting CASS_DRIVER_NO_CYTHON_ARG  to 1. This saves few minutes of build time at the expense of optimized (CYTHON-compiled) version of cassandra driver

Details for Mono-layered Docker image for Airflow

Implemented in https://github.com/apache/airflow/commit/e2c22fe70a488feea0cfecde890c20f8c984c09c 

Available to pull at: 

docker pull potiuk/airflow-monodocker:latest

Only significant layers are shown:

Layer

Size

When rebuilt/downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

Airflow Sources

237 MB

After every commit

Airflow installed binaries

(all - apt and pip installed together)

763 MB

After every commit


Total: 1138 MB

Details for Multi-layered Docker image of Airflow

POC implemented in https://github.com/apache/airflow/pull/4543 

Available to pull at:

docker pull potiuk/airflow-layereddocker:latest

Only significant layers are shown:

Layer

Size

When rebuilt/downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

apt-get install core build deps

118 MB

Only when core dependencies change or when we force fresh build (extremely rare)

apt-get install extra deps

155MB

Only when extra deps change (extremely rare)

pip install deps (just setup no airflow sources)

523 MB

Only when setup.py changes (every few weeks usually)

copy airflow sources

176 MB

After every commit

Install extra airflow deps just in case

6 MB

After every commit


Total: 1116 MB


Note that ariflow sources + reinstall will grow between force - reinstalling of all dependencies because upgrades of packages will be added. However this should not be significant. If full reinstall is done periodically, the size of this layer is reset.

It turns out that multi layered image is even a bit smaller than the monolayered one. But those are not all benefits that you get from multi-layered image. If you take into account usage patterns and users who download the image semi-frequently they will have to download the whole single layer pretty much every time, where in multi-layered approach they would only need to pull incremental changes - the size of incremental changes will change depending on whether setup.py dependencies are updated, or whether all dependencies are forced to be rebuilt from scratch.

Simulation of downloads for a user that pulls the image regularly

Here is the simulation showing how big downloads users will experience when downloading Airflow image semi-frequently (twice a week).

Assumptions:

  • A user downloads a new image twice a week.

  • Setup.py is updated every two weeks.

  • Commits are happening daily.

  • Force rebuild from scratch every 4 weeks - to account for changed dependencies

Mono layered downloads:

  • First download: 1.138 GB

  • all other downloads: 1GB

Multi-layered downloads:

  • First download: 1.138 GB

  • Download if only sources changed (no setup.py): 182 MB

  • Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB

  • Download if forced apt-get dependencies forced: 1GB


User download size pattern:


Weeks

1

2

3

4

5

6

7

8

Total (GB)

Sources change

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x


Setup.py

changes

x




x




x




x





Forced dependencies

x








x









Monolayered (GB)

1.14

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

16.15

Multilayered (GB)

1.12

0.18

0.18

0.18

0.71

0.18

0.18

0.18

1

0.18

0.18

0.18

0.71

0.18

0.18

0.18

5.7



Conclusions

  • The multi-layered image is even slightly smaller than the mono-layered one - so multi-layered image is even better when you download it once
  • Downloading the image regularly by the users is way better in case of multi-layered image - for simulated user, downloading airflow image twice a week it is:  5.7 GB  (multi-layered) vs. 16.15 GB (mono-layered) downloads over the course of 8 weeks.\
  • Multi-layered image is better choice.


Sources for calculation

Mono-layered image:

docker history potiuk/airflow-monodocker:latest
IMAGE CREATED CREATED BY SIZE COMMENT
711f22148f14 28 minutes ago /bin/sh -c #(nop) CMD ["--help"] 0B
1106894d8e9c 28 minutes ago /bin/sh -c #(nop) ENTRYPOINT ["/entrypoint.… 0B
a65bc0db5cdb 28 minutes ago /bin/sh -c #(nop) COPY file:22d6c0f397f65528… 907B
2f96d1ca713e 28 minutes ago |5 AIRFLOW_DEPS=all AIRFLOW_HOME=/usr/local/… 0B
63056a97131b 28 minutes ago /bin/sh -c #(nop) WORKDIR /usr/local/airflow 0B
79d60b1912b3 28 minutes ago |5 AIRFLOW_DEPS=all AIRFLOW_HOME=/usr/local/… 763MB
b23c2a77ce0f 37 minutes ago /bin/sh -c #(nop) WORKDIR /opt/airflow 0B
23a64ce057ce 37 minutes ago /bin/sh -c #(nop) ARG APT_DEPS=freetds-dev … 0B
4a79499a26c0 37 minutes ago /bin/sh -c #(nop) ARG buildDeps=freetds-dev… 0B
8ac987ed4125 37 minutes ago /bin/sh -c #(nop) ARG PYTHON_DEPS= 0B
327f2bc94551 37 minutes ago /bin/sh -c #(nop) ARG AIRFLOW_DEPS=all 0B
b387dffb0fec 37 minutes ago /bin/sh -c #(nop) ARG AIRFLOW_HOME=/usr/loc… 0B
2ce682937e1d 37 minutes ago /bin/sh -c #(nop) COPY dir:b90f1d3f3c97b3b89… 237MB
81f4c012cf1f 2 weeks ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 weeks ago /bin/sh -c set -ex; savedAptMark="$(apt-ma… 7.13MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=18… 0B
<missing> 2 weeks ago /bin/sh -c cd /usr/local/bin && ln -s idle3… 32B
<missing> 2 weeks ago /bin/sh -c set -ex && savedAptMark="$(apt-… 69.2MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.6.8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 2 weeks ago /bin/sh -c apt-get update && apt-get install… 6.48MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:6d6f6f123e45697d3… 55.3MB


Multi-layered image:


docker history potiuk/airflow-layereddocker:latest
IMAGE CREATED CREATED BY SIZE COMMENT
964458a837c4 13 minutes ago /bin/bash -c #(nop) CMD ["--help"] 0B
29934849ea6d 13 minutes ago /bin/bash -c #(nop) ENTRYPOINT ["/entrypoin… 0B
25aa1e37139b 13 minutes ago /bin/bash -c #(nop) COPY file:22d6c0f397f655… 907B
303f291f5588 13 minutes ago |4 ADDITIONAL_PYTHON_DEPS= AIRFLOW_EXTRAS=al… 0B
2c9545351f2b 13 minutes ago /bin/bash -c #(nop) ARG ADDITIONAL_PYTHON_D… 0B
5c51286365ff 13 minutes ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 147kB
0956060680a5 13 minutes ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 6.04MB
a448582458c1 13 minutes ago /bin/bash -c #(nop) COPY dir:1c24e93ef026646… 176MB
19a926abcdaf 13 minutes ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 523MB
e6fdf3881500 18 minutes ago /bin/bash -c #(nop) WORKDIR /opt/airflow 0B
351c1ae84700 18 minutes ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_AIR… 0B
589a353b806b 18 minutes ago /bin/bash -c #(nop) COPY file:143db2e76b8f16… 1.26kB
[AIRFLOW-3718] Multi-layered version of the official docker image
8d6e37aae712 18 minutes ago /bin/bash -c #(nop) COPY file:590340f7066102… 3.04kB
4052dc91ecb1 18 minutes ago /bin/bash -c #(nop) COPY file:3e78814fb55a47… 838B
f6fd5aa7f53d 18 minutes ago /bin/bash -c #(nop) COPY file:53d0bc9002b31a… 29.6kB
53a367472930 18 minutes ago /bin/bash -c #(nop) COPY multi:8bb5ed331b460… 14.2kB
7d28e9f76918 18 minutes ago /bin/bash -c #(nop) ENV SLUGIFY_USES_TEXT_U… 0B
3c5cebbef413 18 minutes ago /bin/bash -c #(nop) ENV CASS_DRIVER_NO_CYTH… 0B
1e011ecb668f 18 minutes ago /bin/bash -c #(nop) ENV CASS_DRIVER_BUILD_C… 0B
a30c11b58de1 18 minutes ago /bin/bash -c #(nop) ARG CASS_DRIVER_NO_CYTH… 0B
0d44f2da664b 26 minutes ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_ALL… 0B
fd9c1b1ebf05 26 minutes ago /bin/bash -c #(nop) ARG AIRFLOW_EXTRAS=all 0B
3845cd179711 26 minutes ago |1 AIRFLOW_HOME=/usr/local/airflow /bin/bash… 0B
9b190f1b4d13 26 minutes ago /bin/bash -c #(nop) ARG AIRFLOW_HOME=/usr/l… 0B
c4870f9907a5 4 days ago /bin/bash -c apt-get update && apt-get i… 155MB
e2249492ef25 4 days ago /bin/bash -c apt-get update && apt-get i… 118MB
34bf4afb2b2c 4 days ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_APT… 0B
fa2782294cce 4 days ago /bin/bash -c #(nop) ENV DEBIAN_FRONTEND=non… 0B
435a88072613 4 days ago /bin/bash -c #(nop) SHELL [/bin/bash -c] 0B
81f4c012cf1f 2 weeks ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 weeks ago /bin/sh -c set -ex; savedAptMark="$(apt-ma… 7.13MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=18… 0B
<missing> 2 weeks ago /bin/sh -c cd /usr/local/bin && ln -s idle3… 32B
<missing> 2 weeks ago /bin/sh -c set -ex && savedAptMark="$(apt-… 69.2MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.6.8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 2 weeks ago /bin/sh -c apt-get update && apt-get install… 6.48MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:6d6f6f123e45697d3… 55.3MB



  • No labels