Page History

Status

Page properties

State

Draft

Discussion Thread

JIRA

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	AIRFLOW-4138

Created

Created

Motivation

Currently, DAGs are discovered by Airflow through traversing the all the files under $AIRFLOW_HOME/dags, looking for files that contains "airflow" and "DAG" in the content, which is not efficient. We need to find a better way for Airflow to discover the DAGs.

Considerations

Is there anything special to consider about this AIP? Downsides? Difficultly in implementation or rollout etc?

What change do you propose to make?

I am proposing to introduce DAG manifest, an easier and more efficient way for Airflow to discover DAGs. The DAG manifest would be composed with manifest entries, where each entry represents a single DAG and contains in formation about where to find the DAG.

Format:

Code Block

language	py

dag_manifest_entry:
    dag_id: the DAG ID
	uri: where dag can be found, DAG locations will be given via URI, i.e. s3://my-bucket/dag1.zip, local:////dags/day1.zip
	conn_id: connection id to use to interact with remote location

File-based DAG manifest

Airflow services will look at $AIRFLOW_HOME/manifest.json for the DAG manifest. The manifest.json contains all the DAG entries. We should expect a manifest.json like:

Code Block

language	text

[
	"dag_1": {
		"uri": "local://dags/hello.py"
	},
	"dag_2": {
		"uri": "s3://dags/superhero.py"
	}
]

Custom DAG manifest

The manifest can also be generated by a callable supplied in the airflow.cfg that would generate a list of entries when called, i.e

Code Block

language	py

[core]
# callable to fetch dag manifest list
dag_manifest_entries = my_config.get_dag_manifest_entries

The DAG manifest can be stored on S3 and my_config.get_dag_manifest_entries will read the manifest from S3.

What problem does it solve?

An easier and more efficient approach for Airflow DAG discovery.

Why is it needed?

With the manifest people are able to more explicitly note which DAGs should be looked at for by Airflow
Airflow no longer has to crawl through a directory importing various files possibly causing problems
Users are not forced to allow for a way to crawl various remote sources
We can get rid of
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key AIRFLOW-97
, which requires strings such as "airflow" and "DAG" to be present in DAG file.

Are there any downsides to this change?

An extra step to add a new DAG, e.g. add an entry in the DAG manifest.
Migration is required for upgrade. We may need to create a script that traverse all the files under $AIRFLOW_HOME/dags or remote storage (assuming we have remote dag fetcher) and populates entries in the DAG manifest.

Which users are affected by the change?

All users attempted to upgrade to the latest.

How are users affected by the change? (e.g. DB upgrade required?)

Users need to run a migration script to populate the DAG manifest.

Other considerations?

N/A

What defines this AIP as "done"?

Airflow discovers DAGs by looking at DAG manifest without traversing through all the files in the filesystem.

Space shortcuts

Page tree

Versions Compared

Old Version 6

New Version Current

Key

Status

Motivation

Considerations

What change do you propose to make?

Format:

File-based DAG manifest

Custom DAG manifest

What problem does it solve?

Why is it needed?

Are there any downsides to this change?

Which users are affected by the change?

How are users affected by the change? (e.g. DB upgrade required?)

Other considerations?

What defines this AIP as "done"?