Goals
The goal of the bot is to reduce the cost of building Apache MXNet PRs by at least 50% with the help of a MXNet Bot. The Bot allows developers to trigger builds only after they are ready. Developers can also trigger build on a specific job with the bot. It solves the problem of PR Authors being unable to trigger CI and eliminates their dependence on Jenkins Admins for re-triggering jobs.
Problem
- PR Authors (who aren’t Jenkins Admin / MXNet Committers) can’t trigger CI
- Automated CI trigger is expensive and unnecessary.
Stakeholders
MXNet Developer community at large.
Specifically,
- PR Authors (developer who created a Pull Request)
- Jenkins Admins (people with Admin access to CI System)
- MXNet Committers
Design
How to use the Bot?
Detailed Video : Apache MXNet CI Bot Demo (Youtube Link)
An authorized user can comment on the PR with following 2 commands:
- trigger specific job
@mxnet-bot run ci [unix-gpu]
- trigger all jobs
@mxnet-bot run ci [all]
Code
Work in Progress PR tracked here : https://github.com/MXNetEdge/mxnet-infrastructure/pull/72 (private repo)
Implementation Details
Workflow
Deployment Framework : Serverless
- Github webhook pushes incoming comments to Lambda 1
- Lambda 1 pushes the request to SQS
- SQS then forwards the request to a Lambda 2
- Request payload contains info about PR author, comment author
- Lambda 2 retrieves token for accessing GitHub API using Secret manager
- Call GitHub API to retrieve list of mxnet-committers from teams [using committer’s credentials]
- Parse jobs from the input comment
- Verify if comment author is authorized [PR author/commiter]
- If verified : Trigger CI
- If this is the first time triggering the job, the branch needs to be scanned
- Using trigger job token (from Secrets manager), scan specific job for the new branch with Multi-branch Scan WebHook Trigger plugin.
- Else, trigger the build using Jenkins API
- Else comment unauthorized access
What changes to the existing system?
- Disable the existing GitHub WebHook
- apache/incubator-mxnet Github account
- Add a new GitHub WebHook
- Points to the API Gateway POST endpoint
- Create required Secrets in test account
- Add a plugin Multi-branch Scan WebHook Trigger & configure tokens for 8 jobs
Design Considerations
- Edit previously made comment
- Should I retrigger the Jenkins?
- Current : Retrigger jenkins
- Delete previously made comment
- How to handle?
- Currently : Ignore
- Error handling
- if bad request (ci trigger fails) how to gracefully handle?
- Currently : No retry. Catch exception.
- Trigger currently running job
- if a job is already running and still verified user retriggers
- Current : retrigger
Upcoming Tasks
- Once approved, configure CI Prod Infra (AWS setup)
- With the Permission of Apache Infra, configure Github Webhooks for public Apache MXNet repository.
Trouble-shooting
- How to fix issues related to the bot?
- First view the logs [Isengard → Prod/Dev account → Cloudwatch Logs]
- Check if Lambda functions are getting triggered
- /send_to_sqs
- /jenkins
- Check if comment author is being verified
- Check if Jenkins Job is being triggered [Verify on Jenkins URL if needed]
- Check if the messages are being posted via Github API
- Send To SQS lambda isn’t triggered
- Check if GitHub WebHook is working correctly
- for Prod account, need Apache Infra team access
- for dev account, tested on my personal fork
- Check on CW Logs for any errors
- Dev account : CW → Log Groups → /aws/lambda/mxnet-ci-bot-test-send
- Jenkins Lambda isn’t triggered
- Problem within Jenkins Lambda function
- Check on CW Logs for any errors
- Dev account : CW → Log Groups → /aws/lambda/mxnet-ci-bot-test-jenkins
- For Jenkins - Github Communication related issues : Refer : Troubleshooting