Deep Learning Based GitHub Label Bot

Problem
Goal

Part I - Email Bot
Part II - Label Bot
Part III - Determine labels automatically

Approach

Part I - Email Bot
Part II - Label Bot
Part III - Determine labels automatically

Technical Challenges
References

1. Problem

Currently there are many issues on Incubator-MXNet repo, labeling issues can help contributors who know a particular area to pick up the issue and help user. However, currently issues are all manually labelled, which is time consuming. And every time maintainers need to @ a committer to add labels. This bot will help automate/simplify this issue labeling process.

2. Goal

Part I - Email Bot
Create weekly email to dev@mxnet.incubator.apache.org:
(Instead of sending emails directly to dev@, another option is to create another email alia and ask people who are interested in weekly reports to join. )

Count of newly opened issues and closed issues in last 7 days
Average and worst response time for all new issues
List of non-responded new issues with links
List of non-responded issues outside SLA

Part II - Label Bot
Create a bot to add labels for incubator-mxnet issues

Create weekly email to internal team members:

Count of newly opened issues and closed issues in last 7 days
List of non-labelled issues
List of non-responded issues
Pie chart with top 10 labels for all issues
Pie chart with top 10 labels for newly opened issues in last 7 days. （Add "unlabelled" as a segment）
A line/bar graph with week over week statistics of the number of issues closed and the number of issues opened

Generate a spreadsheet with detailed information of non-labelled issues. Every team member should have access to view and fill in labels to it.
Read filled-in labels and add labels to corresponding issue.

Part III - Determine labels automatically from GitHub issues:
- Identify the corresponding programming language to it (ex: Python, C/C++, Scala)
- Multi-label classification

3. Approach

Part I - Email Bot
An amazon cloudwatch event will trigger lambda function in a certain frequency(ex: 9am every Monday). Once the lambda function is executed, the issue report will be generated and sent to the mailing list. Figure1 shows the bot design and Figure2 shows demo email content.

Figure1 Email Bot Design

Figure 2 Demo Email Content

Part II - Label Bot
Amazon cloudwatch event (a) will trigger lambda function(a) 9am every Monday. At that time, lambda function(a) will generate an email and write non-labelled issues' data into a Google sheet. Every team member has access to view and fill in labels to it. 12 hours later, another lambda function (lambda function b) will be executed and add labels to corresponding issues. This bot should have restricted permissions to avoid unexpected operations. Figure3 shows the bot design, Figure 4 shows the demo email content and Figure 5 shows the demo Google sheet content.

Figure 3 Label Bot Design

Figure 4 Demo Email Content (updating)

Figure 5 Demo Google Sheet Content

Part III Determine labels automatically
Each instance can be assigned with multiple categories, so these types of problems are known as multi-label classification problem, where we have a set of target labels. Multi-label classification problems are very common in the real world, for example, audio categorization, image categorization, bioinformatics..etc. Our project mainly focus on text categorizations because labels are learned from issue title and issue description.

Steps to achieve it

Step 1: Retrieve Data
Extract data from GitHub issues into JSON format.

Step 2: Data Cleaning
Data cleaning is very important for us to keep the valuable information such as keywords extraction and reduce the noise.

Step 3: Vector Representation
Classifiers and learning algorithms cannot directly process the text documents in their original form. During a preprocessing step, the documents are converted into a more manageable representation. Typically, the documents are represented by feature vectors.

Bag-of-word model uses all words in a document as the features, and thus the dimension of the feature space is equal to the number of different words in all of the documents.
Binary, in which the feature weight is either one - if the corresponding word is present in the document - or zero otherwise.
TF-IDF scheme gives the word w in the document d the weight
TF-IDF Weight(w, d) = TermFreq(w,d) * log(N/DocFreq(w))
Word2Vec, a two-layer neural net that processes text.
Doc2Vec, an extension of Word2Vec that learns to correlate labels and words.

Step 4: Feature Extraction
Map original high-dimensional data onto a lower-demensional space. Remove non-informative terms (irrelevant words) from documents.Improve classification effectiveness and reduce computational complexity.
Feature selection methods:

Document Frequency Threshold(DF) is a measure of the relevance of each feature in the document.
Information Gain(IG) measures the number of bits of information obtained for the prediction of categories by the presence or absence in a document of the feature f.
Chi-square measures the maximal strength of dependence between the feature and the categories.
Mutual Information(MI)

Step 5: Multi-Label Classification
Use two different approaches for multi-label classification. Problem transformation methods try to transform the multi-label classification into single-label or multi-class classification problems. Algorithm adaptation methods adapt multi-label algorithms so they can be applied directly to the problem. Pick top 10 labels to do classification at the beginning.

Problem Transformation

Binary Relevance
This is the simplest technique, which basically treats each label as a separate single class classification problem.
Classifier Chains
The first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
Label Powerset
Transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

Algorithm adaptation
Manual: rule-based
Automatic:

Vector space model based

Prototype-based
K-nearest neighbor
Decision-tree
Neural Networks
Support Vector Machines

Probabilistic or generative model based

Naive Bayes classifier

4. Technical Challenges

Restrict permissions of this bot to avoid unexpected operations.
Training data is limited.

Page tree

Problem

1. Problem

2. Goal

3. Approach

Steps to achieve it

4. Technical Challenges

5. Reference