Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Page properties

StateDraft
Discussion Thread
JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyAIRFLOW-4138

Created

Created



Motivation

Currently, DAGs are discovered by Airflow through traversing the all the files under $AIRFLOW_HOME/dags, looking for files that contains "airflow" and "DAG" in the content, which is not efficient. We need to find a better way for Airflow to discover the DAGs.

Considerations

Is there anything special to consider about this AIP? Downsides? Difficultly in implementation or rollout etc? 

What change do you propose to make?

I am proposing to introduce DAG manifest, an easier and more efficient way for Airflow to discover DAGs. The DAG manifest would be composed with manifest entries, where each entry represents a single DAG and contains in formation about where to find the DAG.

Format:

Code Block
languagepy
dag_manifest_entry:
    dag_id: the DAG ID
	uri: where dag can be found, DAG locations will be given via URI, i.e. s3://my-bucket/dag1.zip, local:////dags/day1.zip
	conn_id: connection id to use to interact with remote location

File-based DAG manifest

Airflow services will look at $AIRFLOW_HOME/manifest.json for the DAG manifest. The manifest.json contains all the DAG entries. We should expect a manifest.json like:

Code Block
languagetext
[
	"dag_1": {
		"uri": "local://dags/hello.py"
	},
	"dag_2": {
		"uri": "s3://dags/superhero.py"
	}
]

Custom DAG manifest

The manifest can also be generated by a callable supplied in the airflow.cfg that would generate a list of entries when called, i.e

Code Block
languagepy
[core]
# callable to fetch dag manifest list
dag_manifest_entries = my_config.get_dag_manifest_entries

The DAG manifest can be stored on S3 and my_config.get_dag_manifest_entries will read the manifest from S3.

What problem does it solve?

An easier and more efficient approach for Airflow DAG discovery.

Why is it needed?

  • With the manifest people are able to more explicitly note which DAGs should be looked at for by Airflow
  • Airflow no longer has to crawl through a directory importing various files possibly causing problems
  • Users are not forced to allow for a way to crawl various remote sources
  • We can get rid of 
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyAIRFLOW-97
    , which requires strings such as "airflow" and "DAG" to be present in DAG file.

Are there any downsides to this change?

  • An extra step to add a new DAG, e.g. add an entry in the DAG manifest.
  • Migration is required for upgrade. We may need to create a script that traverse all the files under $AIRFLOW_HOME/dags or remote storage (assuming we have remote dag fetcher) and populates entries in the DAG manifest.

Which users are affected by the change?

All users attempted to upgrade to the latest.

How are users affected by the change? (e.g. DB upgrade required?)

Users need to run a migration script to populate the DAG manifest.

Other considerations?

N/A

What defines this AIP as "done"?

Airflow discovers DAGs by looking at DAG manifest without traversing through all the files in the filesystem.