This document is archived and replaced by: AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance

Looking at the original AIP-15 the author proposes to use locking to enable the use of multiple schedulers, this might introduce unnecessary complexity. Because of this I propose to split the scheduler into MainScheduler and DagScheduler. This makes it possible to have multiple DagSchedulers running that are submitted by the MainScheduler.


Benefits

Each DAG will get their own scheduler on demand. When having multiple DAGs multiple DagSchedulers can run at the same time. The load on the MainScheduler will be reduced a lot. The MainScheduler will not be the blocking process of Airflow anymore.

Processes

MainScheduler

This process should always run, like the current scheduler.

This process can be master/failover or this can be solved within k8s as discussed in AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance.

Tasks

  • DAG syncing to database
    • This might be also separated into an other daemonized process. This should be configurable to include or exclude from the MainScheduler.
  • Submitting DagScheduler
    • Only DagModel and DagRun table should be required here. The DAG object is not required here.
    • Conditions to submit a DagScheduler:
      • If there is a running DagRun
      • If a new DagRun should be scheduled
      • No DagScheduler is already active for this DAG

DagScheduler

This process is only executed on demand and is executed by an Airflow executor, for example on a Celery worker.

This DagScheduler should only execute a single DAG and a single cycle. If a cycle is done the MainScheduler should schedule a new DagScheduler.

Tasks

  • Create a DagRun when required
  • Check running TaskInstances and submit new TaskInstances when required
  • Set status of a DagRun to Success or failed when required
  • No labels

8 Comments

  1. So in large clusters it is not un-heard of to have 100s or even 1000s of dags running. This approach doesn't seem like it would scale to that

  2. Ash Berlin-Taylor I also have experience in scheduling 5000+ dags at the same time. This was on a Celery cluster of 10 nodes with 100 slots per node. Here the single process took so long that it takes 2-3+ hours to touch all dags while the mostly the slots where not doing a lot because the schedule time in this case was longer as the job durations.

    In the current proposal the load of scheduler will be divided over this 1000 slots what means scheduling will be done in matter of minutes.

  3. Sorry for bugging (or perhaps misusing this area), but is there any progress on this? I've searched through current open issues on `high availability` and got only this (old) ones: 

    type key summary assignee reporter priority status resolution created updated due

    JQL and issue key arguments for this macro require at least one Jira application link to be configured

    There are third party projects that attempt to bring a working solution, such as https://github.com/teamclairvoyant/airflow-scheduler-failover-controller, but they are lacking tests and I think this is something that could be implemented by the core team in the package itself. Changelog https://github.com/apache/airflow/blob/master/CHANGELOG.txt doesn't seem to contain any information on new features or improvements in that direction.

  4. I don't think anyone is actively working on it yet. We (Astronomer) plan to work on it first half of 2020.

    Are you interested in working on it, by chance?

    1. Hello Ry!

      Thanks for responding.

      Being an AirFlow newbie, I'm not quite convinced if I can be much of a help here. But if you have a certain task, perhaps I could take a look at it in (in my free time).

    2. Did you (Astronomer) start working on it? Is there any roadmap (qua time)?

  5. Artur Barseghyan  For now, it undertakes small tasks that are aimed at improving Scheduler's performance.  Here is the link:


    https://issues.apache.org/jira/browse/AIRFLOW-5929?jql=text%20~%20%22performance%22%20and%20project%20%3D%20%22Apache%20Airflow%22%20

    We do not have large activities in this area. For now, we're trying to improve executor/worker performance. 

  6. As usual - happy to take part in it as well. I also might ask our customers if they are interesting (we have a few customers that want/are already contributing to Airflow and we look at even more performance improvements in that area).