You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Current »

Status

StateDraft
Discussion Thread
JIRA Unable to render Jira issues macro, execution error.


Motivation

Currently, DAGs are discovered by Airflow through traversing the all the files under $AIRFLOW_HOME/dags, looking for files that contains "airflow" and "DAG" in the content, which is not efficient. We need to find a better way for Airflow to discover the DAGs.

Considerations

Is there anything special to consider about this AIP? Downsides? Difficultly in implementation or rollout etc? 

What change do you propose to make?

I am proposing to introduce DAG manifest, an easier and more efficient way for Airflow to discover DAGs. The DAG manifest would be composed with manifest entries, where each entry represents a single DAG and contains in formation about where to find the DAG.

Format:

dag_manifest_entry:
    dag_id: the DAG ID
	uri: where dag can be found, DAG locations will be given via URI, i.e. s3://my-bucket/dag1.zip, local:////dags/day1.zip
	conn_id: connection id to use to interact with remote location

File-based DAG manifest

Airflow services will look at $AIRFLOW_HOME/manifest.json for the DAG manifest. The manifest.json contains all the DAG entries. We should expect a manifest.json like:

[
	"dag_1": {
		"uri": "local://dags/hello.py"
	},
	"dag_2": {
		"uri": "s3://dags/superhero.py"
	}
]

Custom DAG manifest

The manifest can also be generated by a callable supplied in the airflow.cfg that would generate a list of entries when called, i.e

[core]
# callable to fetch dag manifest list
dag_manifest_entries = my_config.get_dag_manifest_entries

The DAG manifest can be stored on S3 and my_config.get_dag_manifest_entries will read the manifest from S3.

What problem does it solve?

An easier and more efficient approach for Airflow DAG discovery.

Why is it needed?

  • With the manifest people are able to more explicitly note which DAGs should be looked at for by Airflow
  • Airflow no longer has to crawl through a directory importing various files possibly causing problems
  • Users are not forced to allow for a way to crawl various remote sources
  • We can get rid of  Unable to render Jira issues macro, execution error. , which requires strings such as "airflow" and "DAG" to be present in DAG file.

Are there any downsides to this change?

  • An extra step to add a new DAG, e.g. add an entry in the DAG manifest.
  • Migration is required for upgrade. We may need to create a script that traverse all the files under $AIRFLOW_HOME/dags or remote storage (assuming we have remote dag fetcher) and populates entries in the DAG manifest.

Which users are affected by the change?

All users attempted to upgrade.

How are users affected by the change? (e.g. DB upgrade required?)

Users need to run a migration script to populate the DAG manifest.

Other considerations?

N/A

What defines this AIP as "done"?

Airflow discovers DAGs by looking at DAG manifest without traversing through all the files in the filesystem.

  • No labels