Status
State: Draft
Discussion thread:
JIRA:
Motivation
Airflow runs arbitrary code across different workers. Currently, every task has full access to the Airflow database including connection details like usernames, passwords etc. This makes it quite hard to deploy Airflow in environments that are multi-tenant or semi-multi-tenant. Next to that there is no mechanism in place that ensures that what the scheduler thinks it is scheduling is also the thing that is running at the worker. This creates to additional operational risk of running an out of date task that does something else than expected, aside from the security risk of a malicious task.
Challenges
DAGs can be generated by DAG factories essentially creating a kind of sub-DSL in which DAGs are defined. From Airflow’s point of view this creates a challenge as DAGs therefore are able to pull in arbitrary dependencies.
Envisioned workflow
DAG submission
The DAG directory should be fully managed by Airflow. DAGs should be submitted by authorized users and ownership should be determined by the submitting user. Group ownership determines whether others users can overwrite the DAG. DAGs can be submitted either as single .py file or as a zip containing all dependencies that are required outside of system dependencies.
DAGs are submitted to a REST endpoint. The API server then verifies the submitted DAG and places it in the DAG directory (/var/lib/airflow/dags).
DAG distribution
In case of distributed workers DAG distribution happens by REST API. The worker checks if a DAG is available In the local cache (/var/lib/cache/airflow/dags) and verifies the SHA512 hash. In case the DAG is unavailable or the hash is different the worker asks the API for the version with the specified hash. It then stores the DAG in the local cache.
Task Launching
Scheduler
When having determined a particular task is ready for launch the scheduler checks what connections are required for the task. It verifies the user that owns the task against the connections. If these match the scheduler serializes the connection details into a metadata blob together with the SHA512 hash of the task. The metadata is then send to the executor together with the task_instance information.
Worker
The worker receives the metadata and calculates the hash of the task and verifies this against the metadata it has received. If these match it continues otherwise it will raise a security exception.
The worker pipes (os.pipe() ) the metadata over a unique file descriptor to the task.
Task
The task parses the metadata and obtains it connection information from the connection information provided in the metadata or for backwards compatibility from the environment.