Status
State: Draft
Discussion thread:
JIRA: AIRFLOW-3718
Motivation
Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture.
This means that builds take longer and that users downloading the image regularly will always download full image. With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed.
Considerations
In the PR : https://github.com/apache/airflow/pull/4543 the current mono-layered docker has been implemented as multi-layered one as a Proof-Of-Concept. It has been used as base of the calculations.
Mutli-layered image with enabled caching in DockerHub might save a lot of build time and a lot of download time for the users. Some details about sizing of the mono-layered image and multi-layered one are shown below.
Important assumption:
Caching must be enabled in Docker Hub
The tables below show when given layer is rebuilt/downloaded.
Details for Mono-layered Docker image for Airflow
Implemented in https://github.com/apache/airflow/commit/e2c22fe70a488feea0cfecde890c20f8c984c09c
Available to pull at:
docker pull potiuk/airflow-monodocker:latest
Only significant layers are shown:
Layer | Size | When rebuilt/downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
Airflow Sources | 237 MB | After every commit |
Airflow installed binaries (all - apt and pip installed together) | 763 MB | After every commit |
Total: 1138 MB
Details for Multi-layered Docker image of Airflow
POC implemented in https://github.com/apache/airflow/pull/4543
Available to pull at:
docker pull potiuk/airflow-layereddocker:latest
Only significant layers are shown:
Layer | Size | When rebuilt/downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
apt-get install core build deps | 118 MB | Only when core dependencies change or when we force fresh build (extremely rare) |
apt-get install extra deps | 155MB | Only when extra deps change (extremely rare) |
pip install deps (just setup no airflow sources) | 523 MB | Only when setup.py changes (every few weeks usually) |
copy airflow sources | 176 MB | After every commit |
Install extra airflow deps just in case | 6 MB | After every commit |
Total: 1116 MB
It turns out that multi layered image is even a bit smaller than the monolayered one. But those are not all benefits that you get from multi-layered image. If you take into account usage patterns and users who download the image semi-frequently they will have to download the whole single layer pretty much every time, where in multi-layered approach they would only need to pull incremental changes - the size of incremental changes will change depending on whether setup.py dependencies are updated, or whether all dependencies are forced to be rebuilt from scratch.
Simulation of downloads for a user that pulls the image regularly
Here is the simulation showing how big downloads users will experience when downloading Airflow image semi-frequently (twice a week).
Assumptions:
A user downloads a new image twice a week.
Setup.py is updated every two weeks.
Commits are happening daily.
Force rebuild from scratch every 4 weeks - to account for changed dependencies
Mono layered downloads:
First download: 1.138 GB
all other downloads: 1GB
Multi-layered downloads:
First download: 1.138 GB
Download if only sources changed (no setup.py): 182 MB
Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB
Download if forced apt-get dependencies forced: 1GB
User download size pattern:
Weeks | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Total (GB) | ||||||||
Sources change | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |
Setup.py changes | x | x | x | x | |||||||||||||
Forced dependencies | x | x | |||||||||||||||
Monolayered (GB) | 1.14 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 16.15 |
Multilayered (GB) | 1.12 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 1 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 5.7 |
Status
State: Draft
Discussion thread:
JIRA: AIRFLOW-3718
Motivation
Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture.
This means that builds take longer and that users downloading the image regularly will always download full image. With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed.
Considerations
In the PR
Details for Mono-layered Docker image for Airflow
Only significant layers are shown
Layer | Size | When downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
Airflow Sources | 237 MB | After every commit |
Airflow installed binaries (all - apt and pip installed together) | 763 MB | After every commit |
Total: 1138 MB
Details for Multi-layered Docker image of Airflow
Only significant layers are shown:
Layer | Size | When downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
apt-get install core build deps | 118 MB | Only when core dependencies change or when we force fresh build (extremely rare) |
apt-get install extra deps | 155MB | Only when extra deps change (extremely rare) |
pip install deps (just setup no airflow sources) | 523 MB | Only when setup.py changes (every few weeks usually) |
copy airflow sources | 176 MB | After every commit |
Install extra airflow deps just in case | 6 MB | After every commit |
Total: 1116 MB
Simulation of usage:
Assumptions:
A user downloads a new image twice a week.
Setup.py is updated every two weeks.
Commits are happening daily.
Force rebuild from scratch every 4 weeks - to account for changed dependencies
Mono layered:
First download: 1.138 GB
all other downloads: 1GB
Multi-layered:
First download: 1.138 GB
Download if only sources changed (no setup.py): 182 MB
Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB
Download if forced apt-get dependencies forced: 1GB
User download size:
Weeks | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Total | ||||||||
Sources change | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |
Setup.py changes | x | x | x | x | |||||||||||||
Forced dependencies | x | x | |||||||||||||||
Monolayered | 1.14 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 16.15 |
Multilayered | 1.12 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 1 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 5.7 |
Details for Mono-layered Docker image for Airflow
Only significant layers are shown
Layer | Size | When downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
Airflow Sources | 237 MB | After every commit |
Airflow installed binaries (all - apt and pip installed together) | 763 MB | After every commit |
Total: 1138 MB
Details for Multi-layered Docker image of Airflow
Only significant layers are shown:
Layer | Size | When downloaded |
python:3.6-slim layers (there are 12 layers) | 138 MB | Only the first time it is built |
apt-get install core build deps | 118 MB | Only when core dependencies change or when we force fresh build (extremely rare) |
apt-get install extra deps | 155MB | Only when extra deps change (extremely rare) |
pip install deps (just setup no airflow sources) | 523 MB | Only when setup.py changes (every few weeks usually) |
copy airflow sources | 176 MB | After every commit |
Install extra airflow deps just in case | 6 MB | After every commit |
Total: 1116 MB
Simulation of usage:
Assumptions:
A user downloads a new image twice a week.
Setup.py is updated every two weeks.
Commits are happening daily.
Force rebuild from scratch every 4 weeks - to account for changed dependencies
Mono layered:
First download: 1.138 GB
all other downloads: 1GB
Multi-layered:
First download: 1.138 GB
Download if only sources changed (no setup.py): 182 MB
Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB
Download if forced apt-get dependencies forced: 1GB
User download size:
Weeks | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Total | ||||||||
Sources change | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |
Setup.py changes | x | x | x | x | |||||||||||||
Forced dependencies | x | x | |||||||||||||||
Monolayered | 1.14 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 16.15 |
Multilayered | 1.12 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 1 | 0.18 | 0.18 | 0.18 | 0.71 | 0.18 | 0.18 | 0.18 | 5.7 |