Goals

The goal of the bot is to reduce the cost of building Apache MXNet PRs by at least 50% with the help of a MXNet Bot. The Bot allows developers to trigger builds only after they are ready. Developers can also trigger build on a specific job with the bot. It solves the problem of PR Authors being unable to trigger CI and eliminates their dependence on Jenkins Admins for re-triggering jobs.

Problem

  1. PR Authors (who aren’t Jenkins Admin / MXNet Committers) can’t trigger CI
  2. Automated CI trigger is expensive and unnecessary.

Stakeholders

MXNet Developer community at large.
Specifically,

  1. PR Authors (developer who created a Pull Request)
  2. Jenkins Admins (people with Admin access to CI System)
  3. MXNet Committers

Design

How to use the Bot?

Detailed Video : Apache MXNet CI Bot Demo (Youtube Link)

An authorized user can comment on the PR with following 2 commands:

  1. trigger specific job @mxnet-bot run ci [unix-gpu]
  2. trigger all jobs @mxnet-bot run ci [all]

Code

Work in Progress PR tracked here : https://github.com/MXNetEdge/mxnet-infrastructure/pull/72 (private repo)

Implementation Details

Workflow

Deployment Framework : Serverless

  1. Github webhook pushes incoming comments to Lambda 1
  2. Lambda 1 pushes the request to SQS
  3. SQS then forwards the request to a Lambda 2
  4. Request payload contains info about PR author, comment author
  5. Lambda 2 retrieves token for accessing GitHub API using Secret manager
  6. Call GitHub API to retrieve list of mxnet-committers from teams [using committer’s credentials]
  7. Parse jobs from the input comment
  8. Verify if comment author is authorized [PR author/commiter]
    1. If verified : Trigger CI
      1. If this is the first time triggering the job, the branch needs to be scanned
        1. Using trigger job token (from Secrets manager), scan specific job for the new branch with Multi-branch Scan WebHook Trigger plugin.
      2. Else, trigger the build using Jenkins API
    2. Else comment unauthorized access

What changes to the existing system?

  1. Disable the existing GitHub WebHook
    1. apache/incubator-mxnet Github account
  2. Add a new GitHub WebHook
    1. Points to the API Gateway POST endpoint
  3. Create required Secrets in test account
  4. Add a plugin Multi-branch Scan WebHook Trigger & configure tokens for 8 jobs

Design Considerations

  • Edit previously made comment
    • Should I retrigger the Jenkins?
      • Current : Retrigger jenkins
  • Delete previously made comment
    • How to handle?
      • Currently : Ignore
  • Error handling
    • if bad request (ci trigger fails) how to gracefully handle?
      • Currently : No retry. Catch exception.
  • Trigger currently running job
    • if a job is already running and still verified user retriggers
      • Current : retrigger

Upcoming Tasks

  1. Once approved, configure CI Prod Infra (AWS setup)
  2. With the Permission of Apache Infra, configure Github Webhooks for public Apache MXNet repository.

Trouble-shooting

  1. How to fix issues related to the bot?
    1. First view the logs [Isengard → Prod/Dev account → Cloudwatch Logs]
    2. Check if Lambda functions are getting triggered
      1. /send_to_sqs
      2. /jenkins
    3. Check if comment author is being verified
    4. Check if Jenkins Job is being triggered [Verify on Jenkins URL if needed]
    5. Check if the messages are being posted via Github API
  2. Send To SQS lambda isn’t triggered
    1. Check if GitHub WebHook is working correctly
      1. for Prod account, need Apache Infra team access
      2. for dev account, tested on my personal fork
        1. https://github.com/ChaiBapchya/incubator-mxnet/settings/hooks/187804995
    2. Check on CW Logs for any errors
      1. Dev account : CW → Log Groups → /aws/lambda/mxnet-ci-bot-test-send
  3. Jenkins Lambda isn’t triggered
  4. Problem within Jenkins Lambda function
    1. Check on CW Logs for any errors
      1. Dev account : CW → Log Groups → /aws/lambda/mxnet-ci-bot-test-jenkins
  5. For Jenkins - Github Communication related issues : Refer : Troubleshooting
  • No labels