You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »



Status

State: Draft

Discussion thread:

JIRA: AIRFLOW-3718


Motivation

Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture.

This means that builds take longer and that users downloading the image regularly will always download full image. With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed.

Considerations

In the PR : https://github.com/apache/airflow/pull/4543 the current mono-layered docker has been implemented as multi-layered one as a Proof-Of-Concept. It has been used as base of the calculations.

Mutli-layered image with enabled caching in DockerHub might save a lot of build time and a lot of download time for the users. Some details about sizing of the mono-layered image and multi-layered one are shown below.

Important assumption:

Caching must be enabled in Docker Hub

The tables below show when given layer is rebuilt/downloaded.

Details for Mono-layered Docker image for Airflow

Implemented in https://github.com/apache/airflow/commit/e2c22fe70a488feea0cfecde890c20f8c984c09c 

Available to pull at: 

docker pull potiuk/airflow-monodocker:latest

Only significant layers are shown:

Layer

Size

When rebuilt/downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

Airflow Sources

237 MB

After every commit

Airflow installed binaries

(all - apt and pip installed together)

763 MB

After every commit


Total: 1138 MB

Details for Multi-layered Docker image of Airflow

POC implemented in https://github.com/apache/airflow/pull/4543 

Available to pull at:

docker pull potiuk/airflow-layereddocker:latest

Only significant layers are shown:

Layer

Size

When rebuilt/downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

apt-get install core build deps

118 MB

Only when core dependencies change or when we force fresh build (extremely rare)

apt-get install extra deps

155MB

Only when extra deps change (extremely rare)

pip install deps (just setup no airflow sources)

523 MB

Only when setup.py changes (every few weeks usually)

copy airflow sources

176 MB

After every commit

Install extra airflow deps just in case

6 MB

After every commit


Total: 1116 MB


It turns out that multi layered image is even a bit smaller than the monolayered one. But those are not all benefits that you get from multi-layered image. If you take into account usage patterns and users who download the image semi-frequently they will have to download the whole single layer pretty much every time, where in multi-layered approach they would only need to pull incremental changes - the size of incremental changes will change depending on whether setup.py dependencies are updated, or whether all dependencies are forced to be rebuilt from scratch.

Simulation of downloads for a user that pulls the image regularly

Here is the simulation showing how big downloads users will experience when downloading Airflow image semi-frequently (twice a week).

Assumptions:

  • A user downloads a new image twice a week.

  • Setup.py is updated every two weeks.

  • Commits are happening daily.

  • Force rebuild from scratch every 4 weeks - to account for changed dependencies

Mono layered downloads:

  • First download: 1.138 GB

  • all other downloads: 1GB

Multi-layered downloads:

  • First download: 1.138 GB

  • Download if only sources changed (no setup.py): 182 MB

  • Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB

  • Download if forced apt-get dependencies forced: 1GB


User download size pattern:


Weeks

1

2

3

4

5

6

7

8

Total (GB)

Sources change

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x


Setup.py

changes

x




x




x




x





Forced dependencies

x








x









Monolayered (GB)

1.14

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

16.15

Multilayered (GB)

1.12

0.18

0.18

0.18

0.71

0.18

0.18

0.18

1

0.18

0.18

0.18

0.71

0.18

0.18

0.18

5.7



Status

State: Draft

Discussion thread:

JIRA: AIRFLOW-3718


Motivation

Current official Airflow image is rebuilt from the scratch every time new commit is done to the repo. It is a "mono-layered" one and does not use Docker's multi-layer architecture.

This means that builds take longer and that users downloading the image regularly will always download full image. With multi-layered approach and caching enabled in Docker Hub we can optimise it to download only the layers that changed.

Considerations

In the PR 

Details for Mono-layered Docker image for Airflow

Only significant layers are shown


Layer

Size

When downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

Airflow Sources

237 MB

After every commit

Airflow installed binaries

(all - apt and pip installed together)

763 MB

After every commit


Total: 1138 MB

Details for Multi-layered Docker image of Airflow

Only significant layers are shown:

Layer

Size

When downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

apt-get install core build deps

118 MB

Only when core dependencies change or when we force fresh build (extremely rare)

apt-get install extra deps

155MB

Only when extra deps change (extremely rare)

pip install deps (just setup no airflow sources)

523 MB

Only when setup.py changes (every few weeks usually)

copy airflow sources

176 MB

After every commit

Install extra airflow deps just in case

6 MB

After every commit


Total: 1116 MB





Simulation of usage:

Assumptions:

  • A user downloads a new image twice a week.

  • Setup.py is updated every two weeks.

  • Commits are happening daily.

  • Force rebuild from scratch every 4 weeks - to account for changed dependencies


Mono layered:

  • First download: 1.138 GB

  • all other downloads: 1GB


Multi-layered:

  • First download: 1.138 GB

  • Download if only sources changed (no setup.py): 182 MB

  • Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB

  • Download if forced apt-get dependencies forced: 1GB



User download size:


Weeks

1

2

3

4

5

6

7

8

Total

Sources change

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x


Setup.py

changes

x




x




x




x





Forced dependencies

x








x









Monolayered

1.14

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

16.15

Multilayered

1.12

0.18

0.18

0.18

0.71

0.18

0.18

0.18

1

0.18

0.18

0.18

0.71

0.18

0.18

0.18

5.7





Details for Mono-layered Docker image for Airflow

Only significant layers are shown


Layer

Size

When downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

Airflow Sources

237 MB

After every commit

Airflow installed binaries

(all - apt and pip installed together)

763 MB

After every commit


Total: 1138 MB

Details for Multi-layered Docker image of Airflow

Only significant layers are shown:

Layer

Size

When downloaded

python:3.6-slim layers

(there are 12 layers)

138 MB

Only the first time it is built

apt-get install core build deps

118 MB

Only when core dependencies change or when we force fresh build (extremely rare)

apt-get install extra deps

155MB

Only when extra deps change (extremely rare)

pip install deps (just setup no airflow sources)

523 MB

Only when setup.py changes (every few weeks usually)

copy airflow sources

176 MB

After every commit

Install extra airflow deps just in case

6 MB

After every commit


Total: 1116 MB





Simulation of usage:

Assumptions:

  • A user downloads a new image twice a week.

  • Setup.py is updated every two weeks.

  • Commits are happening daily.

  • Force rebuild from scratch every 4 weeks - to account for changed dependencies


Mono layered:

  • First download: 1.138 GB

  • all other downloads: 1GB


Multi-layered:

  • First download: 1.138 GB

  • Download if only sources changed (no setup.py): 182 MB

  • Download if setup.py changed: 705 MB = 176 MB + 523 MB + 6 MB

  • Download if forced apt-get dependencies forced: 1GB



User download size:


Weeks

1

2

3

4

5

6

7

8

Total

Sources change

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x


Setup.py

changes

x




x




x




x





Forced dependencies

x








x









Monolayered

1.14

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

16.15

Multilayered

1.12

0.18

0.18

0.18

0.71

0.18

0.18

0.18

1

0.18

0.18

0.18

0.71

0.18

0.18

0.18

5.7



  • No labels