Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Minor text edits for readability

This page describes the current status of the GitHub Actions for the Apache Software Foundation projects. This page is maintained by the community.

Info
titleSummary of the GitHub Actions Status

TL;DR; summary Updated: 31.01.2020

If you are a Committer/PMC member of a an ASF project and you think thinking about migrating to GitHub Actions, this is the current status.:

  • You should hold - off with switching to GitHub Actions until we resolve performance and security problems that we are currently experiencing are solved at the ASF level.
  • If you want to use the GitHub Actions, you can consider using your own self-hosted runner, but only if you can afford to build and maintain your own self-hosted infrastructure (this is not an easy task due to security limitations of the official GitHub Actions runners). 
  • If you decide to use GitHub Actions, you need to be very careful to mitigate some of the security problems you might have if you follow the GA setup using the existing examples. There is an extra hardening required in your workflows if you want to protect your project from 3rd-party dependencies having WRITE access to your project.


Table of Contents

Overall status of GitHub Actions

...

for Apache Software Foundation projects

There are already  already quite a few projects using Github GitHub Actions. However, however be aware that there are (still unresolved ) problems with performance of the ASF-wide Github GitHub Actions Enterprise account, and there are some potential security implications that you might have to be aware of when starting using Github to use GitHub Actions.

There are a few discussions that you can read at builds@apache.org about itthese issues:

The issues with the GitHub Actions revolve around Billing, Performance/Scalability and Security. Billing is not a problem on it's its own, but it impact impacts the Performance/Scalability issue.

...

All public projects, resources, images, etc.  on GitHub are generally free (not only Apache Software Foundation ones). No problem with that. You will not incur any costs , as long as you do not create any "private" resources, so there is no way you can create some billing consequences.
However there is an important caveat there - the : as more projects are using use GitHub Actions, the more they all compete for a shared job queue (more on performance below). Apache Software Foundation has an "Enterprise" organisation status in GitHub.

Performance/Scalability

Status

Currently the "native" GitHub Runner Performance/Scalability is far below the requirements of ASF projects using it (due to the 180 parallel job limit all ASF projects have). 

If you are going to use GA you WILL experience severe and unbearable unacceptable performance issues, especially if your users and builds are predominantly in the EU/US time zone (this is the case for most of the biggest projects using Github GitHub Actions). 
Apache Airflow built (Tobiasz Kędzierski, a contributor to Apache Airflow ) built a very crude dashboard showing the use of GA workflows (it's not jobs but workflows) which clearly shows the extent of the problem.

...

This chart is imperfect - we currently do not have details "per-job" because GitHub is unable to give us the data. We've asked for it at the meeting organised by INFRA on 14 January, 2021, with people from GitHub Actions present, but (at least I am not aware of it) we do not have the data from GitHub available in a usable way (neither as raw data, nor dashboards). We've asked for it at the last meeting organised by INFRA where we had people from Github Actions (14th of January). The meeting notes and preparation are here (no meeting notes yet available though): ASF Build Infra Meeting 14th of Jan 2021.

The chart is built using the GitHub API and due the to API limitations (API call quota) we cannot drill down to the job level. But it's good enough to show the queues and numbers. we are talking about.

Regularly during the EU day/ US morning US we have now 500-600 workflows in progress at a time from ASF projects (2 months ago it was 200-300 and then it was pretty OK). We run ran our workflows in Apache Airflow for ~ 8 months now months  without the problems, but the last 1.5 months are have been really problematic and I'd discourage using GA until those problems are solved.

From that dashboard we've built - the projects that seem to use GA most are: pulsar, spark, incubator-pinot, dubbo-samples, camel-k-runtime, netbeans, beam, airflow, incubator-daffodil, and commons-text.

The reason for the issue

Main The main reason for the issue is the limit of the job queue ASF (as any organisation) has. The ASF has an agreement (which is great on its own) with GitHub that ASF Organisation level is as an "Enterprise Organisation" (for free - this is GitHub's donation to the ASF). 

This means that ASF projects have 180 slots in the GitHub Actions Jobs queue allocated and no more than 180 GA jobs can run in parallel. This is far too small for the current usedemand. It has already caused a number of problems in the past when too many jobs for too many projects have been started at the same time. In the weeks of January, 2021, during the weekdays in the EU day/US morning US , we experience consistently experienced 5-6 hours hour queues for the jobs consistently. This basically means that when you submit a PR, you have to wait 5-6 hours before it even STARTS running. This is unbearable and not sustainable.

We've implemented a multitude of optimizations in Airflow and we encouraged and helped other projects (such As as Apache Beam, Apache SuperSet, and Apache SkyWalking) to optimize their workflows - including a few custom actions (Cancel Workflow Runs for example).

Unfortunately, there are no tools nor or mechanisms that could give the ASF Infra the possibility of limiting the use of the actions Actions per-project, and until this is solved any approach to limit the use of actions Actions for each project is destined to fail. As much Much of an the effort we put in into optimizing workflows in one of the projects it is has been very quickly consumed by other projects using more (for example Apache Airflow optimized the use of our workflows and decreased it use by roughly 70%). There is also an ongoing effort from other projects to decrease the strain, for example: 

  • issue and design doc where maintainers of Pulsar discuss ways of decreasing the strain (with some help from the Apache Airflow team, who has have already implemented the savings).
  • Kamil Bregula  from Apache Airflow, opened a number of PRs to implement a "Cancel Workflow Runs" action (in PulsarSpark, Pinot for example).
  • The Apache SuperSet PR where they implemented their custom "cancel duplicates" python script.

To be perfectly clear - this is not a complaint, just a statement of the fact facts - those projects have no tool , nor or mechanism to limit and monitor the usage of their workflows and there is no mechanism for ASF to enforce any limits per-project.

At the last "Build Infra" meeting 14th of Jan14 January, 2021 developer advocates from GitHub mentioned that there might be a way to increase the queue. The ASF - rightfully so - cannot really pay for the increase (this is totally understandable if they have no tools to manage and control it). I am not aware about the results of this yet. Such an increase will only help for a short while, though. This is the same story as with motorways - : if you have traffic jams and you widen the roads, it only takes a short time for the traffic to reach the capacity again as people start using it using the roads more.

...

A potential solution

One of the solutions that might be sustainable is to deploy self-hosted runners if your project has some infrastructure money (from stakeholders/sponsors) they can spend. We have money in Airflow (From from the AWS Open-Source initiative and Astronomer, ; also Google promised to donate some of the GCP time). This is, however, (currently) inherently insecure. With the "PR-s from forks" approach of Apache projects, the current model of GitHub Runners is not secure by default. In fact, there is a recommendation from GitHub to NEVER use self-hosted runners for public repositories . Apache Airflow team forked the Runner and we are working on hardening the Self-hosted runners from Github GitHub,  and we setup set up auto-scaling runners in our donated infrastructure (PMC member of Airflow  - Ash Berlin-Taylor  is working on it), but this is a big project on its own.

While Airflow have has had some early successes and has a POC working, it's already taking a few weeks when we are securing and testing to secure and test it, even if it is done together with a Devops person to make it robust and secure. Ash Berlin-Taylor shared his early thoughts in the Self-hosted GitHub Runners  document. This is a very rough description of what needs to be done and has a lot of "security" disclaimers and without lacks full context (and needs some updates after further learnings).

The solution Apache Airflow introduces introduced is a bit brittle. It is based on what the RUST team has done and it relies on dynamic patching of the runner from GitHub as soon as it is released because they have an aggressive policy of disabling old runners pretty much immediately after a new version is released. Thus it is prone to disruption of service if the patch does not apply cleanly. At the meeting 14th of 14 January, 2021 we learned that GitHub is not planning to improve security of the self-hosted runners any time soon (for sure we should not expect anything until mid-2021 we should not expect anything). So we are on our own for quite a while.

...

There are a number of security problems you have to be aware of. The 3rd-party actions and 3rd-party dependencies are a huge security risk if not used appropriately (basically if you are using it Actions as the examples suggest you are opened open for easy exploitation by the action Action authors). If you do not "securely" add the actions Actions you are ripe to any kind of uncontrolled "write" modifications to your repository (!) by 3rd-party action Action owners AND (as we've learned recently) by 3rd-party dependencies you install in your build pipeline. One of the problems caused INFRA action to disable the "direct" use of 3rd-party actions Actions at the organisation level (see the discussion), but there are many more risks that you have to be aware of.

There are two critical security vulnerability reports opened by Jarek Potiuk 30th of 30 December 2020 with GitHub actions Actions - both of them triaged and awaiting for actions on the GitHub side. GitHub Security Lab who in December encouraged the users to  post their experiences is engaged as well.  Those issues can be all mitigated (for Apache Airflow implemented all mitigation) but they are not what most projects do. 

...

If you decide to use GitHub Actions,  those are recommendations (there are varying opinions on sub-modules use, though):

  • NEVER use 3rd-party actions directly in your worfklows - use the "submodule" pattern. Examople Example PR Tobiasz Kędzierski  opened in SuperSet showing how this could be done. Also the ASF INFRA allow-listed some of the popular Actions out there - , including my "cancel workflow" action, but I there is no public list of those available. The nice things about submodules is that it does they do not bring action code to your repo, they . They link to commit hashes of the actionsActions, and that it integrates well with the GitHub review process so that committers have bigger better chance to review the changes before they are merged. By using submodules, you are automatically following the GitHub recommendations for hardening of security for 3rd-party actions.
  • ALWAYS add "persist-credentials: false" to all your checkout actions. This is not done by default and is a huge security risk because it leaves your repository (and hundreds of thousands of others) open to 3rd-party dependencies to modify your repository (!) if you have any kind of "master" builds enabled. This is a "hidden" feature of the checkout action that is not at all obvious, but it leaves write access to your repository widely open to any code that you install during the build process. This is a very dangerous default.
  • NEVER directly run code that might come with "forked" PRs in your workflows directly. There are certain exotic (but useful) workflows  workflows that are dangerous. For example, with "workflow_run" that you might need to cancel duplicate workflowworkflows. Those workflows by default run with "master" code, but sometimes you might need to checkout check out the incoming PR code for those. The host environment can have access (in various ways) to to the "WRITE" GITHUB_TOKEN that has permission to modify your repository WITHOUT RESTRICTION NOR OR NOTIFICATION. NEVER run the code that is checked out from the PR in your host environment. If you need to, run them it in Docker Container to provide isolation from the host environment to avoid the "write" access to leak access leaking to users who prepare such a PR from their fork.
  • NEVER install and run the 3rd-party dependencies in the host of your build workflow code. Again there are ways those dependencies can obtain obtain the "WRITE" GITHUB_TOKEN and change anything in your repository without your knowledge.  There arevery are very common "schedule" and  "push" workflows that are especially prone to such abuse. Those run with "WRITE" access, and again - there are ways to obtain the GitHub Token by the actions these Actions and code that runs in your workflow. If you execute any 3rd-party code, run it in Docker containers to keep isolation from your "build" host environment to avoid the leaking "write" access to leak access to those 3rd - parties.