You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

This page describes current status of the GitHub Actions for the Apache Software Foundation projects. This page is maintained by the community.

Summary of the GitHub Actions Status

TL;DR; summary Updated: 31.01.2020

If you are a Committer/PMC member of a ASF project and you think about migrating to GitHub Actions this is the current status.

  • You should hold-off with switching to GitHub Actions until performance and security problems that we are currently experiencing are solved at the ASF level.
  • If you want to use the GitHub Actions, you can consider using your own self-hosted runner, but only if you can afford to build and maintain your own self-hosted infrastructure (this is not an easy task due to security limitations of the official GitHub Actions runners). 
  • If you decide to use GitHub Actions, you need to be very careful to mitigate some of the security problems you might have if you follow GA setup using examples. There is an extra hardening required in your workflows if you want to protect your project from 3rd-party dependencies having WRITE access to your project.


Overall status of GitHub Actions among the Apache Software Foundation projects

There are already  quite a few projects using Github Actions, however be aware that there are (still unresolved) problems with performance of ASF-wide Github Actions Enterprise account and there are some potential security implications that you might have to be aware of when starting using Github Actions.

There are a few discussions that you can read at builds@apache.org about it:

The issues with the GitHub Actions revolve around Billing, Performance/Scalability and Security. Billing is not a problem on it's own but it impact the Performance/Scalability issue.

Detailed status

Billing

All public projects, resources, images, etc.  on GitHub are generally free (not only Apache Software Foundation ones). No problem with that. You will not incur any costs, as long as you do not create any "private" resources there is no way you can create some billing consequences.
However there is an important caveat there - the more projects are using GitHub Actions, the more they all compete for a shared job queue (more on performance below). Apache Software Foundation has an "Enterprise" organisation status in GitHub.

Performance/Scalability

Status

Currently the "native" GitHub Runner Performance/Scalability is far below the requirements of ASF projects using it (due to the 180 parallel job limit all ASF projects have). 

If you are going to use GA you WILL experience severe and unbearable performance issues, especially if your users and builds are predominantly in the EU/US time zone (this is the case for most of the biggest projects using Github Actions). 
Apache Airflow built (Tobiasz Kędzierski, a contributor to Apache Airflow ) a very crude dashboard showing the use of GA workflows (it's not jobs but workflows) which clearly shows the extent of the problem.

This chart is imperfect - we currently do not have details "per-job" because GitHub is unable to give us the data. We've asked for it, but (at least I am not aware of it) we do not have the data from GitHub available in a usable way (neither as raw data, nor dashboards). We've asked for it at the last meeting organised by INFRA where we had people from Github Actions (14th of January). The meeting notes and preparation are here (no meeting notes yet available though): ASF Build Infra Meeting 14th of Jan 2021. The chart is built using GitHub API and due the API limitations (API call quota) we cannot drill down to the job level. But it's good enough to show the queues and numbers. we are talking about. Regularly during the EU day/morning US we have now 500-600 workflows in progress at a time from ASF projects (2 months ago it was 200-300 and then it was pretty OK). We run our workflows in Apache Airflow for ~ 8 months now without the problems, but the last 1.5 months are really problematic and I'd discourage using GA until those problems are solved. From that dashboard we've built - the projects that seem to use GA most are: pulsar, spark, incubator-pinot, dubbo-samples, camel-k-runtime, netbeans, beam, airflow, incubator-daffodil, commons-text.

The reason for the issue

Main reason for the issue is the limit of the job queue ASF (as any organisation) has. The ASF has an agreement (which is great on its own) with GitHub that ASF Organisation level is an "Enterprise Organisation" (for free - this is GitHub's donation to the ASF). 
This means that ASF projects have 180 slots in the GitHub Actions Jobs queue allocated and no more than 180 GA jobs can run in parallel. This is far too small for the current use. It has already caused a number of problems in the past when too many jobs for too many projects have been started at the same time. In the weeks of January, during the weekdays in the EU day/morning US we experience 5-6 hours queues for the jobs consistently. This basically means that when you submit a PR, you have to wait 5-6 hours before it even STARTS running. This is unbearable and not sustainable. We've implemented a multitude of optimizations in Airflow and we encouraged and helped other projects (such As Apache Beam, Apache SuperSet, Apache SkyWalking) to optimize their workflows - including a few custom actions (Cancel Workflow Runs for example).

Unfortunately, there are no tools nor mechanisms that could give the ASF Infra the possibility of limiting the use of the actions per-project, and until this is solved any approach to limit the use of actions for each project is destined to fail. As much of an effort we put in optimizing workflows in one of the projects it is very quickly consumed by other projects using more (for example Apache Airflow optimized the use of our workflows and decreased it by roughly 70%). There is also an ongoing effort from other projects to decrease the strain for example: 

  • issue and design doc where maintainers of Pulsar discuss ways of decreasing the strain (with some help from Apache Airflow team who has already implemented the savings).
  • Kamil Bregula  from Apache Airflow, opened a number of PRs to implement "Cancel Workflow Runs" action (in PulsarSpark, Pinot for example)
  • The Apache SuperSet PR where they implemented their custom "cancel duplicates" python script

To be perfectly clear - this is not a complaint, just statement of the fact - those projects have no tool, nor mechanism to limit and monitor the usage of their workflows and there is no mechanism for ASF to enforce any limits per-project. At the last "Build Infra" meeting 14th of Jan, developer advocates from GitHub mentioned that there might be a way to increase the queue. The ASF - rightfully so - cannot really pay for the increase (this is totally understandable if they have no tools to manage and control it). I am not aware about the results of this yet. Such an increase will only help for a short while though. This is the same story as with motorways - if you have traffic jams and you widen the roads, it only takes a short time for the traffic to reach the capacity again as people start using it more.

The potential solution

One of the solutions that might be sustainable is to deploy self-hosted runners if your project has some infrastructure money (from stakeholders/sponsors) they can spend. We have money in Airflow (From AWS Open-Source initiative and Astronomer, also Google promised to donate some of the GCP time). This is however (currently) inherently insecure. With the "PR-s from forks" approach of Apache projects, the current model of GitHub Runners is not secure by default. In fact, there is a recommendation from GitHub to NEVER use self-hosted runners for public repositories . Apache Airflow team forked the Runner and we are working on hardening the Self-hosted runners from Github and we setup auto-scaling runners in our donated infrastructure (PMC member of Airflow  - Ash Berlin-Taylor  is working on it), but this is a big project on its own. While Airflow have some early successes and POC working, it's already taking a few weeks when we are securing and testing it, even if it is done together with a Devops person to make it robust and secure. Ash Berlin-Taylor shared his early thoughts in Self-hosted GitHub Runners  document. This is a very rough description of what needs to be done and has a lot of "security" disclaimers and without full context (and needs some updates after further learnings).

The solution Apache Airflow introduces is a bit brittle. It is based on what RUST team has done and it relies on dynamic patching of the runner from GitHub as soon as it is released because they have aggressive policy of disabling old runners pretty much immediately after new version is released. Thus it is prone to disruption of service if the patch does not apply cleanly. At the meeting 14th of January we learned that GitHub is not planning to improve security of the self-hosted runners any time soon (for sure not until mid-2021 we should not expect anything). So we are on our own for quite a while.

Security

There are a number of security problems you have to be aware of. The 3rd-party actions and 3rd party dependencies are a huge security risk if not used appropriately (basically if you are using it as the examples suggest you are opened for easy exploitation by the action authors). If you do not "securely" add the actions you are ripe to any kind of uncontrolled "write" modifications to your repository (!) by 3rd-party action owners AND (as we've learned recently) by 3rd party dependencies you install in your build pipeline. One of the problems caused INFRA action to disable the "direct" use of 3rd-party actions at the organisation level (see the discussion), but there are many more risks that you have to be aware of. There are two critical security vulnerability reports opened by Jarek Potiuk 30th of December with GitHub actions - both of them triaged and awaiting for actions on the GitHub side. GitHub Security Lab who in December encouraged the users to  post their experiences is engaged as well.  Those issues can be all mitigated (for Apache Airflow implemented all mitigation) but they are not what most projects do. 

Mitigations

If you decide to use GitHub Actions,  those are recommendations (there are varying opinions on sub-modules use though):

  • NEVER use 3rd-party actions directly in your worfklows - use the "submodule" pattern. Examople PR Tobiasz Kędzierski  opened in SuperSet showing how this could be done. Also the ASF INFRA allow-listed some of the popular Actions out there - including my "cancel workflow" action, but I there is no public list of those available. The nice things about submodules is that it does not bring action code to your repo, they link to commit hashes of the actions, and that it integrates well with GitHub review process so that committers have bigger chance to review the changes before they are merged. By using submodules, you are automatically following the GitHub recommendations for hardening of security for 3rd-party actions.
  • ALWAYS add "persist-credentials: false" to all your checkout actions. This is not done by default and is a huge security risk because it leaves your repository (and hundreds of thousands others) open to 3rd-party dependencies to modify your repository (!) if you have any kind of "master" builds enabled. This is a "hidden" feature of checkout action that is not at all obvious, but it leaves write access to your repository widely open to any code that you install during the build process. This is a very dangerous default.
  • NEVER directly run code that might come with "forked" PRs in your workflows directly. There are certain exotic (but useful) workflows  that are dangerous. For example "workflow_run" that you might need to cancel duplicate workflow. Those workflows by default run with "master" code, but sometimes you might need to checkout the incoming PR code for those. The host environment can have access (in various ways) to "WRITE" GITHUB_TOKEN that has permission to modify your repository WITHOUT RESTRICTION NOR NOTIFICATION. NEVER run the code that is checked out from the PR in your host environment. If you need to, run them in Docker Container to provide isolation from the host environment to avoid the "write" access to leak to users who prepare such a PR from their fork.
  • NEVER install and run the 3rd-party dependencies in host of your build workflow code. Again there are ways those dependencies can obtain "WRITE" GITHUB_TOKEN and change anything in your repository without your knowledge.  There arevery common "schedule" and  "push" workflows that are especially prone to such abuse. Those run with "WRITE" access and again - there are ways to obtain the GitHub Token by the actions and code that runs in your workflow. If you execute any 3rd-party code, run it in Docker containers to keep isolation from your "build" host environment to avoid the "write" access to leak to those 3rd-parties.




  • No labels