Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

  • Problem

  • Goal
    • Part I - Email Bot
    • Part II - Label Bot
    • Part III - Determine labels automatically
  • Approach
    • Part I - Email Bot
    • Part II - Label Bot
    • Part III - Determine labels automatically
  • Technical Challenges
  • References

This is the initial design of ML Based GitHub Bot.

1. Problem

Currently there are many issues on Incubator-MXNet repo, labeling issues can help contributors who know a particular area to pick up the issue and help user. However, currently issues are all manually labelled, which is time consuming. And every time maintainers need to @ a committer to add labels. This bot will help automate/simplify this issue labeling process.

2. Goal

  • Part I - Email Bot
    Create weekly email todev@mxnet.incubator.apache.org:
    (Instead of sending emails directly to dev@, another option is to create another email alia and ask people who are interested in weekly reports to join. )Send daily GitHub issue reports to the mailing list:
    • Count of newly opened issues and closed issues in last 7 days
    • Average and worst response time for all new issues
    • List of non-responded new issues with links
    • List of non-responded issues outside SLA
  • Part II - Label Bot
    Create a bot to add labels for incubator-mxnet issues
    • Predictions of unlabeled issues
    • Create weekly email to internal team members:
    • Count of newly opened issues and closed issues in last 7 days
    • List of non-labelled issues
    • List of non-responded issues
    • Pie chart with top 10 labels for all issues
    • Pie chart with top 10 labels for newly opened issues in last 7 days. (Add "unlabelled" as a segment)
    • A line/bar graph with week over week statistics of the number of issues closed and the number of issues opened
    • Generate a spreadsheet with detailed information of non-labelled issues. Every team member should have access to view and fill in labels to it.
    • Read filled-in labels and add labels to corresponding issue.
  • Part II - Predict labels automatically for unlabeled issues
    • Build a web server which could response to GET/POST requests and realize self-maintenance:
      • Predict labels: once it receives GET/POST requests with issue ID, it will send predictions back.
      • Self-maintenance: it will re-train Machine Learning models every 24 hours.
  • Part III - Label Bot:
    This bot serves to help non-committers add labels to GitHub issues.
    • Recognize people's commands. ie "@mxnet-label-bot, please add labels :[A, B]". 
    • Be able to add labels for incubator-mxnet issues using a committer's credentials.
    Part III - Determine labels automatically from GitHub issues:
    • Identify the corresponding programming language to it (ex: Python, C/C++, Scala)
    • Multi-label classification

3. Approach

  • Part I - Email

    Bot

    Bot 

    Image Removed

    An amazon cloudwatch event will trigger lambda function in a certain frequency(ex: 9am every Monday). Once the lambda function is executed, the issue report will be generated and sent to the mailing list. Figure1 shows the email bot
    design
    architecture and Figure2 shows demo email content
    .

    Image Added


Figure1 Email Bot Design

Image AddedImage Removed


Figure 2 Demo Email Content


  • Part II -

    Label BotAmazon cloudwatch event (a) will trigger lambda function(a) 9am every Monday. At that time, lambda function(a) will generate an email and write non-labelled issues' data into a Google sheet. Every team member has access to view and fill in labels to it. 12 hours later, another lambda function (lambda function b) will be executed and add labels to corresponding issues. This bot should have restricted permissions to avoid unexpected operations. Figure3 shows the bot design, Figure 4 shows the demo email content and Figure 5 shows the demo Google sheet content.

    Image Removed

Figure 3 Label Bot Design

Sample Issue Report

...

  • Predict labels automatically for unlabeled issues

    This part will use Machine Learning models to predict labels and send them by emails. Figure 3 shows the architecture.

    Image Added

Figure 3 Lambda with Elastic Beanstalk 


  • Part III - Label Bot

    This label bot serves to help non-committers to add labels. A contributor can @mxnet-label-bot and comment "@mxnet-label-bot, please add labels: [A, B]". Then this bot will recognize notifications and add . 

    All code is on a lambda function. A CloudWatch event will trigger this lambda function every 5 minutes. Once the lambda function is executed, it will read valid notifications, extract labels' information from comments then add labels. Figure shows architecture.
    Image AddedFigure 5 Label Bot Design

4. Multi-label classification

Each instance can be assigned with multiple categories, so these types of problems are known

...

as multi-label classification

...

 problem, where we have a set of target labels. Multi-label classification problems are very common in the real world, for example, audio categorization, image categorization, bioinformatics..etc. Our project mainly focus

...

on text categorizations

...

 because labels are learned from issue title and issue description.

Steps to achieve it

Step 1: Retrieve Data
Extract data from GitHub issues into JSON format.

Step 2: Data Cleaning
Data cleaning is very important for us to keep the valuable information such as keywords extraction and reduce the noise.

Step 3: Vector Representation
Classifiers and learning algorithms cannot directly process the text documents in their original form. During a preprocessing step, the documents are converted into a more manageable representation. Typically, the documents are represented by feature vectors.

...

  • Problem Transformation
    • Binary Relevance
      This is the simplest technique, which basically treats each label as a separate single class classification problem.
    • Classifier Chains
      The first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
    • Label Powerset
      Transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.
  • Algorithm adaptation
    Manual:
    rule-based
    Automatic:
    • Vector space model based
      • Prototype-based
      • K-nearest neighbor
      • Decision-tree
      • Neural Networks
      • Support Vector Machines
    • Probabilistic or generative model based
      • Naive Bayes classifier

...

5. Technical Challenges

  • Restrict permissions of this bot to avoid unexpected operations.
  • Training data is limited.

...

6. Reference

7. Design Upgrade of Label Bot 

Issue:

There is a limitation with the current label bot implementation in that the current label bot can only label unlabelled issues. Key functionality to be implemented includes re-labelling labeled issues and streamlining the process of updating and removing labels. The current label bot implementation is also fairly inefficient in the way it automatically labels our issues and pull requests. The current design is based upon a pull model where every 5 minutes we trigger the bot to pull all issues/pull requests which we then label appropriately and consequently retrain our model every 24 hours. There is also a restriction which can be faced where GitHub limits users to make 5000 HTTP requests in an hour so we want to minimize the requests we make as much as possible.


Proposed Design Decision:

The efficiency of this bot can be improved if the bot was redesigned with a push model where as soon as an issue or pull request is made to the repository, we trigger the label bot to then appropriately label the issue. The lambda bot will also include functionality to not only add but also update, and delete labels. 

Implementation:

Taking advantage of GitHub WebHook we can trigger the bot when an issue or pull request is made to the repository (which we specify by denoting the event that we want to subscribe to) this trigger is then managed by the lambda function which decides on the appropriate action to take on a GitHub label. 

AWS Services: API Gateway handles receiving a POST notification from the GitHub WebHook and has that response be sent to our lambda which we use to send to SQS. SQS handles management of multiple messages which are received and then sends this data to our lambda bot. The lambda bot reads the payload that has been received from SQS and takes the appropriate action onto a GitHub label. 

Current Proposed Design Implementation:

Image Added



Usage:

 Add functionality: adds labels specified to the list of labels:

@mxnet-label-bot add [label1, label2]

Remove functionality removes labels specified from the list of labels:

@mxnet-label-bot remove [label1, label2]

Update functionality updates the labels of the issue to only the labels specified in the list:

@mxnet-label-bot update [label1, label2]