Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Motivation

By default, users launch one scheduler instance for Airflow. This brings up a few concerns, including

  • High Availability: what if the single scheduler is down.
  • Scheduling Performance: the scheduling latency for each DAG may be long if there are many DAGs.


It would be ideal for Airflow to support multiple schedulers, to address these concerns.

Considerations

1. `scheduler_lock` is already there in DagModel, but it's not used in current implementation of Airflow (as of now, https://github.com/apache/airflow/tree/45d24e79eab98589b1b0509e920811cbf778048b). We should leverage leverage  it and modify the scheduler code accordingly.

2. To avoid the leader-selection issueproblem, we may not want to use master-slave architecture for schedulers. Instead, we simply start multiple schedulers.

The probability of schedulers competing on the same DAG is easy to calculate since it's a typical Birthday Problem, and it is reasonably low if # of DAGs/ # of schedulers is not too low (the probability that there are schedulers competing on the same DAG is 1-m!/((m-n)! * (m^n))  , m is the number of DAGs and n is the number of schedulers).

Let’s say we have 200 DAGs and we start 2 schedulers. At any moment, the probability that there is schedulers competing on the same DAG is only 0.5%. If we run 2 schedulers against 300 DAGs, this probability is only 0.33%.(https://lists.apache.org/thread.html/389287b628786c6144c0b8e6abf74a040890cd9410a5abe6e968eb55@%3Cdev.airflow.apache.org%3E)

3. To avoid the "correlation" between schedulers, we may want to consider random sort list of DAG files before it's passed to scheduler process (https://lists.apache.org/thread.html/e21d028944092b588295112acb9a3e203c4aea7fae50978f288c2af1@%3Cdev.airflow.apache.org%3E)

4. One important scope of this AIP is to intensively test whether running multiple schedulers would cause any issue (after all concerns above are addressed).