Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Main reason for the issue is the limit of the job queue ASF (as any organisation) has. The ASF has an agreement (which is great on its own) with GitHub that ASF Organisation level is an "Enterprise Organisation" (for free - this is GitHub's donation to the ASF). 
This means that ASF projects have 180 slots in the GitHub Actions Jobs queue allocated and no more than 180 GA jobs can run in parallel. This is far too small for the current use. It has already caused a number of problems in the past when too many jobs for too many projects have been started at the same time. In the weeks of January, during the weekdays in the EU day/morning US we experience 5-6 hours queues for the jobs consistently. This basically means that when you submit a PR, you have to wait 5-6 hours before it even STARTS running. This is unbearable and not sustainable. We've implemented a multitude of optimizations in Airflow and we encouraged and helped other projects (such As Apache Beam, Apache SuperSet, Apache SkyWalking) to optimize their workflows - including a few custom actions (Cancel Workflow Runs for example).

Unfortunately, there are no tools nor mechanisms that could give the ASF Infra the possibility of limiting the use of the actions per-project, and until this is solved any approach to limit the use of actions for each project is destined to fail. As much of an effort we put in optimizing workflows in one of the projects it is very quickly consumed by other projects using more (for example Apache Airflow optimized the use of our workflows and decreased it by roughly 70%). There is also an ongoing effort from other projects to decrease the strain - for example: 

  • issue and design doc where maintainers of Pulsar discuss ways of decreasing the strain (with some help from Apache Airflow team who has already implemented the savings).
  • Kamil Bregula  from Apache Airflow, opened a number of PRs to implement "Cancel Workflow Runs" action (in PulsarSpark, Pinot for example)
  • The Apache SuperSet PR where they implemented their custom "cancel duplicates" python script

To be perfectly clear - this is not a complaint, just statement of the fact - those projects have no tool, nor mechanism to limit and monitor the usage of their workflows and there is no mechanism for ASF to enforce any limits per-project. At the last "Build Infra" meeting 14th of Jan, developer advocates from GitHub mentioned that there might be a way to increase the queue. The ASF - rightfully so - cannot really pay for the increase (this is totally understandable if they have no tools to manage and control it). I am not aware about the results of this yet. Such an increase will only help for a short while though. This is the same story as with motorways - if you have traffic jams and you widen the roads, it only takes a short time for the traffic to reach the capacity again as people start using it more.

...