You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Status

StateDraft
Discussion Thread
https://lists.apache.org/thread/9kw13cxcg06p37shv57hsomnx6zognoc
Created

$action.dateFormatter.formatGivenString("yyyy-MM-dd", $content.getCreationDate())

Motivation

Airflow worker host is a shared resource among all tasks running on it. Thus, it requires hosts to provision dependencies for all tasks, including system and python application level dependencies. It leads to a very fat runtime, thus long host provision time and low elasticity in the worker resource.

The lack of runtime isolation makes it challenging and risky to do operations, including adding/upgrading system and python dependencies, and it is almost impossible to remove any dependencies. It also incurs lots of additional operating costs for the team as users do not have permission to add/upgrade python dependencies, which requires us to coordinate with them. When there are package version conflicts, it prevents installing them directly on the host. Users have to use PythonVirtualenvOperator.

Considerations

To solve those problems, I propose introducing docker runtime for Airflow tasks and dag parsing. It leverages docker as the tasks runtime environment. There are several benefits:

  1. Provide runtime isolation on task level
  2. Customize runtime to parse dag files
  3. Lean runtime, which enables high worker resource elasticity
  4. Immutable and portable runtime
  5. Process isolation ensures that all subprocesses of a task are cleaned up after docker exits


Note: this AIP does NOT force users to adopt docker as the default runtime. It adds an option to parse dag files in docker  container and run tasks in docker container.

What change do you propose to make?

Airflow Worker


The current airflow worker runtime is shared with all tasks on a host. An airflow worker is responsible for running an airflow task.

Current process hierarchy is:

    airflow worker process

      → `airflow run local` process

        → `airflow run raw` process

            → potential more processes spawned by tasks

In the new design, the `airflow run local` and `airflow run raw` processes are running inside a docker container, which is launched by an airflow worker. In this way, the airflow worker runtime only needs minimum requirements to run airflow core and docker.

Resource Constraints:

Memory and CPU constraints can be easily configured via docker API. 

The disk space soft constraint is delegated to AirflowAgent (running on each airflow worker host). When launching a docker container, the airflow worker uses volume mounts to mount hostPath into a container. It uses the naming convention for hostPath: `/tmp/airflow/<dag_id>/<task_id>/<execution_date>/` and mounts it as `/tmp/` in a container. In this way, AirflowAgent can easily monitor `/tmp` directory space usage by airflow tasks and react when disk space utilization is high.

Networking

Airbnb uses the client side service discovery. It requires client side setup in order to connect to other services (synapses, HAproxy (or Envoy)). Since we are not aiming for providing network isolation, the network driver of all containers will be `host` mode a. The containers on a host share the host’s networking namespace.

Data Warehouse Layer

The data warehouse layer is shared by all tasks on a host. We don’t allow/expect users to modify the runtime requirements of this layer. The access to this layer from containers is done via volume mounts. The airflow worker mounts all the required interfaces (defined here) to the containers so that the task in a container can get access to data warehouse clients' binaries and configurations.

User/Group Management Inside Containers 

Given that we leverage volume mounts, it is important to create users inside the containers with correct uid, gid and groups so that the users inside the container can have access to those files. What’s more, the created files from the container must be accessible on the host with the same permissions.

The airflow worker collects these users info of 1) airflow infra run as user 2) airflow task default impersonation user 3) task impersonation user. It includes username, uid, its groups and gids. The worker passes the information as environment variables in containers, which creates those groups and users in the container entrypoint.

DockerOperator

There are some tasks using the DockerOperator. It requires the docker runtime to access the docker socket. The airflow worker achieves this via mounting the docker socket path into the container. Given the docker container launched by the DockerOperator is a new container, the resource restriction won’t be applied to it automatically, the run_as_user is controlled differently since the container has full privilege. These are the caveats of DockerOperator itself, thus we are not targeting to address this issue in this project.



Which users are affected by the change?

No user impact. This feature is controlled by a feature flag.

How are users affected by the change? (e.g. DB upgrade required?)

NA

What defines this AIP as "done"?

dag file is able to be parsed in docker container and airflow task is able to run inside docker container

  • No labels