You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

IMAGE CREATED CREATED BY SIZE COMMENT
055d0daee787 About an hour ago /bin/bash -c #(nop) CMD ["--help"] 0B
<missing> About an hour ago /bin/bash -c #(nop) ENTRYPOINT ["/entrypoin… 0B
<missing> About an hour ago /bin/bash -c #(nop) COPY file:22d6c0f397f655… 907B
<missing> About an hour ago |4 ADDITIONAL_PYTHON_DEPS= AIRFLOW_EXTRAS=al… 0B
<missing> About an hour ago /bin/bash -c #(nop) ARG ADDITIONAL_PYTHON_D… 0B
<missing> About an hour ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 128kB
<missing> About an hour ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 6.04MB
<missing> About an hour ago /bin/bash -c #(nop) COPY dir:5d6f5c2f0d7171e… 72.8MB
<missing> About an hour ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 523MB
<missing> About an hour ago /bin/bash -c #(nop) WORKDIR /opt/airflow 0B
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:143db2e76b8f16… 1.26kB
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:590340f7066102… 3.04kB
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:3e78814fb55a47… 838B
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:53d0bc9002b31a… 29.6kB
<missing> 15 hours ago /bin/bash -c #(nop) COPY multi:8bb5ed331b460… 14.2kB
<missing> 15 hours ago /bin/bash -c #(nop) ENV SLUGIFY_USES_TEXT_U… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ENV CASS_DRIVER_NO_CYTH… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ENV CASS_DRIVER_BUILD_C… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ARG CASS_DRIVER_NO_CYTH… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_ALL… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ARG AIRFLOW_EXTRAS=all 0B
<missing> 15 hours ago |1 AIRFLOW_HOME=/usr/local/airflow /bin/bash… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ARG AIRFLOW_HOME=/usr/l… 0B
<missing> 5 days ago /bin/bash -c apt-get update && apt-get i… 155MB
<missing> 5 days ago /bin/bash -c apt-get update && apt-get i… 118MB
<missing> 5 days ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_APT… 0B
<missing> 5 days ago /bin/bash -c #(nop) ENV DEBIAN_FRONTEND=non… 0B
<missing> 5 days ago /bin/bash -c #(nop) SHELL [/bin/bash -c] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 weeks ago /bin/sh -c set -ex; savedAptMark="$(apt-ma… 7.13MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=18… 0B
<missing> 2 weeks ago /bin/sh -c cd /usr/local/bin && ln -s idle3… 32B
<missing> 2 weeks ago /bin/sh -c set -ex && savedAptMark="$(apt-… 69.2MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.6.8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 2 weeks ago /bin/sh -c apt-get update && apt-get install… 6.48MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:6d6f6f123e45697d3… 55.3MB


Status

State: Draft

Discussion thread: https://lists.apache.org/thread.html/7af1a4faa4baa119a124cec0920c2d6e4b7b6c91d7fa5b7ce0d1c1d6@%3Cdev.airflow.apache.org%3E

JIRA: AIRFLOW-3718


Motivation

Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture. There was a discussion on whether to use mono-layered Docker or multi-layered one here: https://github.com/apache/airflow/pull/4483

Mono-layered image means that builds take longer (always built from scratch) and that users downloading the image regularly will always download full image.

With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed. The disadvantage is that some layers are "stalled" - i.e. they need to be refreshed from time to time (for example apt-get installed binaries)

Considerations

In the PR : https://github.com/apache/airflow/pull/4543 the current mono-layered docker has been implemented as multi-layered one as a Proof-Of-Concept. It has been used as base of the calculations.

Mutli-layered image with enabled caching in DockerHub might save a lot of build time and a lot of download time for the users. Some details about sizing of the mono-layered image and multi-layered one are shown below.

Important assumption:

Caching must be enabled in Docker Hub

The tables below show when given layer is rebuilt/downloaded.

The multi-layered Docker proposal features:

  • First significant layer: apt-get dependencies are installed (they can be force-rebuilt by increasing value of env variable FORCE_REINSTALL_APT_GET_DEPENDENCIES
  • Second significant layer: Only PIP dependencies are installed without all Airflow Sources (just setup* files are copied to the image and dependencies are installed). Can be rebuilt by increasing FORCE_REINSTALL_ALL_PIP_DEPENDENCIES variable
  • Third significant layer: Airflow sources are added and installation is repeated (this layer is rebuilt every time airflow sources change)
  • Apt-get upgrade is run every time sources changes to make sure that latest APT dependencies are upgraded (for example security fixes)
  • You can disable CYTHON version of cassandra driver by setting CASS_DRIVER_NO_CYTHON_ARG  to 1. This saves few minutes of build time at the expense of optimized (CYTHON-compiled) version of cassandra driver

Details for Mono-layered Docker image for Airflow

Implemented in https://github.com/apache/airflow/commit/e2c22fe70a488feea0cfecde890c20f8c984c09c 

Available to pull at: 

docker pull potiuk/airflow-monodocker:latest

Only significant layers are shown:

Layer

Size

When rebuilt/downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

Airflow Sources

73 MB

After every commit

Airflow installed binaries

(all - apt and pip installed together)

765 MB

After every commit


Total: 976 MB


Example download time when tested (full download after removing the image and docker system prune): 32.7 s (note this was not scientific enough and can be influenced by external factors)

time docker pull potiuk/airflow-monodocker:latest
latest: Pulling from potiuk/airflow-monodocker
177e7ef0df69: Pull complete
1dee839b70d8: Pull complete
aafb04a34d0d: Pull complete
9a36f2b2e390: Pull complete
51ac94058903: Pull complete
17105da27567: Pull complete
08903c354ddd: Pull complete
234eaa99bee5: Pull complete
8c3bd3e34c20: Pull complete
Digest: sha256:db5b707ddec35b5ceeb1caba9be5192965ad00ba34ec630fe5ee6b6d06c49b85
Status: Downloaded newer image for potiuk/airflow-monodocker:latest

real 0m32.744s
user 0m0.090s
sys 0m0.065s

Details for Multi-layered Docker image of Airflow

POC implemented in https://github.com/apache/airflow/pull/4543 

Available to pull at:

docker pull potiuk/airflow-layereddocker:latest

Only significant layers are shown:

Layer

Size

When rebuilt/downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

apt-get install core build deps

118 MB

Only when core dependencies change or when we force fresh build (extremely rare)

apt-get install extra deps

155MB

Only when extra deps change (extremely rare)

pip install deps (just setup no airflow sources)

523 MB

Only when setup.py changes (every few weeks usually)

copy airflow sources

73 MB

After every commit

Install extra airflow deps just in case

6 MB

After every commit


Total: 1007 MB

Example download time when tested (full download after removing the image and docker system prune): 33.7 s (note this was not scientific enough and can be influenced by external factors)

time docker pull potiuk/airflow-layereddocker:latest
latest: Pulling from potiuk/airflow-layereddocker
177e7ef0df69: Pull complete
1dee839b70d8: Pull complete
aafb04a34d0d: Pull complete
9a36f2b2e390: Pull complete
51ac94058903: Pull complete
18b01857bb01: Pull complete
23ba9d802d8e: Pull complete
28157c14842b: Pull complete
8c6340a2c38d: Pull complete
a1b4c634dcbc: Pull complete
b0ce958037ac: Pull complete
c93f50ea89e5: Pull complete
939e3f06fc4b: Pull complete
ed1e854d5b96: Pull complete
918a0767c9ad: Pull complete
b207cdc2df35: Pull complete
99a53823ab76: Pull complete
8c3bd3e34c20: Pull complete
Digest: sha256:08a6e8ac7ae7b5c0de0b4d1c6cae3fbb8cb868f12ea3363dfb18374daa62b47a
Status: Downloaded newer image for potiuk/airflow-layereddocker:latest
real 0m33.761s
user 0m0.100s
sys 0m0.068s

Note that ariflow sources + reinstall will grow between force - reinstalling of all dependencies because upgrades of packages will be added. However this should not be significant. If full reinstall is done periodically, the size of this layer is reset.

It turns out that multi layered image is even a bit smaller than the monolayered one. But those are not all benefits that you get from multi-layered image. If you take into account usage patterns and users who download the image semi-frequently they will have to download the whole single layer pretty much every time, where in multi-layered approach they would only need to pull incremental changes - the size of incremental changes will change depending on whether setup.py dependencies are updated, or whether all dependencies are forced to be rebuilt from scratch.

Simulation of downloads for a user that pulls the image regularly

Here is the simulation showing how big downloads users will experience when downloading Airflow image semi-frequently (twice a week).

Assumptions:

  • A user downloads a new image twice a week.

  • Setup.py is updated every two weeks.

  • Commits are happening daily.

  • Force rebuild from scratch every 4 weeks - to account for changed dependencies

Mono layered downloads:

  • First download: 976 MB

  • all other downloads: 838 MB = 765 MB + 73 MB

Multi-layered downloads:

  • First download: 1007 MB

  • Download if only sources changed (no setup.py): 73 MB

  • Download if setup.py changed: 757 MB = 155 MB + 523 MB + 73 MB+ 6 MB

  • Download if forced apt-get dependencies forced: 1007 MB - 138 MB = 869 MB


User download size pattern:


Weeks

1

2

3

4

5

6

7

8

Total downloaded over the
course of
8 weeks (MB)

Sources change

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x


Setup.py

changes

x




x




x




x





Forced dependencies

x








x









Monolayered (MB)

976

838

838

838

838

838

838

838

838

838

838

838

838

838

838

838

13546

Multilayered (MB)

1007

73

73

73

757

73

73

757

869

73

73

73

757

73

73

73

4950 (36% of monolayered)



Conclusions

  • The multi-layered image is only slightly bigger than the mono-layered one (around 2% more in total ) - download time is also slightly longer by 1 s  (33.7 vs 32.7s) which is 3% longer.
  • Downloading the image regularly by the users is way better in case of multi-layered image - for simulated user, downloading airflow image twice a week it is:  4950 MB  (multi-layered) vs. 13546 MB (mono-layered) downloads over the course of 8 weeks. Yielding 64% less data to download.
  • Multi-layered image seems to be much better for users regularly downloading the image.


Sources for calculation

Mono-layered image:

docker history potiuk/airflow-monodocker:latest
IMAGE CREATED CREATED BY SIZE COMMENT
725143eaf153 17 minutes ago /bin/sh -c #(nop) CMD ["--help"] 0B
<missing> 17 minutes ago /bin/sh -c #(nop) ENTRYPOINT ["/entrypoint.… 0B
<missing> 17 minutes ago /bin/sh -c #(nop) COPY file:22d6c0f397f65528… 907B
<missing> 17 minutes ago |5 AIRFLOW_DEPS=all AIRFLOW_HOME=/usr/local/… 0B
<missing> 17 minutes ago /bin/sh -c #(nop) WORKDIR /usr/local/airflow 0B
<missing> 17 minutes ago |5 AIRFLOW_DEPS=all AIRFLOW_HOME=/usr/local/… 765MB
<missing> 24 minutes ago /bin/sh -c #(nop) WORKDIR /opt/airflow 0B
<missing> 24 minutes ago /bin/sh -c #(nop) ARG APT_DEPS=freetds-dev … 0B
<missing> 24 minutes ago /bin/sh -c #(nop) ARG buildDeps=freetds-dev… 0B
<missing> 24 minutes ago /bin/sh -c #(nop) ARG PYTHON_DEPS= 0B
<missing> 24 minutes ago /bin/sh -c #(nop) ARG AIRFLOW_DEPS=all 0B
<missing> 24 minutes ago /bin/sh -c #(nop) ARG AIRFLOW_HOME=/usr/loc… 0B
<missing> 24 minutes ago /bin/sh -c #(nop) COPY dir:c08fa4a00d4740680… 72.8MB
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 weeks ago /bin/sh -c set -ex; savedAptMark="$(apt-ma… 7.13MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=18… 0B
<missing> 2 weeks ago /bin/sh -c cd /usr/local/bin && ln -s idle3… 32B
<missing> 2 weeks ago /bin/sh -c set -ex && savedAptMark="$(apt-… 69.2MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.6.8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 2 weeks ago /bin/sh -c apt-get update && apt-get install… 6.48MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:6d6f6f123e45697d3… 55.3MB


Multi-layered image:


docker history potiuk/airflow-layereddocker:latest
IMAGE CREATED CREATED BY SIZE COMMENT
055d0daee787 About an hour ago /bin/bash -c #(nop) CMD ["--help"] 0B
<missing> About an hour ago /bin/bash -c #(nop) ENTRYPOINT ["/entrypoin… 0B
<missing> About an hour ago /bin/bash -c #(nop) COPY file:22d6c0f397f655… 907B
<missing> About an hour ago |4 ADDITIONAL_PYTHON_DEPS= AIRFLOW_EXTRAS=al… 0B
<missing> About an hour ago /bin/bash -c #(nop) ARG ADDITIONAL_PYTHON_D… 0B
<missing> About an hour ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 128kB
<missing> About an hour ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 6.04MB
<missing> About an hour ago /bin/bash -c #(nop) COPY dir:5d6f5c2f0d7171e… 72.8MB
<missing> About an hour ago |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca… 523MB
<missing> About an hour ago /bin/bash -c #(nop) WORKDIR /opt/airflow 0B
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:143db2e76b8f16… 1.26kB
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:590340f7066102… 3.04kB
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:3e78814fb55a47… 838B
<missing> 15 hours ago /bin/bash -c #(nop) COPY file:53d0bc9002b31a… 29.6kB
<missing> 15 hours ago /bin/bash -c #(nop) COPY multi:8bb5ed331b460… 14.2kB
<missing> 15 hours ago /bin/bash -c #(nop) ENV SLUGIFY_USES_TEXT_U… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ENV CASS_DRIVER_NO_CYTH… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ENV CASS_DRIVER_BUILD_C… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ARG CASS_DRIVER_NO_CYTH… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_ALL… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ARG AIRFLOW_EXTRAS=all 0B
<missing> 15 hours ago |1 AIRFLOW_HOME=/usr/local/airflow /bin/bash… 0B
<missing> 15 hours ago /bin/bash -c #(nop) ARG AIRFLOW_HOME=/usr/l… 0B
<missing> 5 days ago /bin/bash -c apt-get update && apt-get i… 155MB
<missing> 5 days ago /bin/bash -c apt-get update && apt-get i… 118MB
<missing> 5 days ago /bin/bash -c #(nop) ENV FORCE_REINSTALL_APT… 0B
<missing> 5 days ago /bin/bash -c #(nop) ENV DEBIAN_FRONTEND=non… 0B
<missing> 5 days ago /bin/bash -c #(nop) SHELL [/bin/bash -c] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 2 weeks ago /bin/sh -c set -ex; savedAptMark="$(apt-ma… 7.13MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=18… 0B
<missing> 2 weeks ago /bin/sh -c cd /usr/local/bin && ln -s idle3… 32B
<missing> 2 weeks ago /bin/sh -c set -ex && savedAptMark="$(apt-… 69.2MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.6.8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 2 weeks ago /bin/sh -c apt-get update && apt-get install… 6.48MB
<missing> 2 weeks ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 2 weeks ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 2 weeks ago /bin/sh -c #(nop) ADD file:6d6f6f123e45697d3… 55.3MB



  • No labels