Machine Learning Based GitHub Bot

This is the initial design of ML Based GitHub Bot.

1. Problem

Currently there are many issues on Incubator-MXNet repo, labeling issues can help contributors who know a particular area to pick up the issue and help user. However, currently issues are all manually labelled, which is time consuming. And every time maintainers need to @ a committer to add labels. This bot will help automate/simplify this issue labeling process.

2. Goal

Part I - Email Bot
Send daily GitHub issue reports to the mailing list:

Count of newly opened issues and closed issues in last 7 days
Average and worst response time for all new issues
List of non-responded new issues with links
List of non-responded issues outside SLA
Predictions of unlabeled issues
Pie chart with top 10 labels for all issues

Part II - Predict labels automatically for unlabeled issues
- Build a web server which could response to GET/POST requests and realize self-maintenance:
  - Predict labels: once it receives GET/POST requests with issue ID, it will send predictions back.
  - Self-maintenance: it will re-train Machine Learning models every 24 hours.
Part III - Label Bot:
This bot serves to help non-committers add labels to GitHub issues.
- Recognize people's commands. ie "@mxnet-label-bot, please add labels :[A, B]".
- Be able to add labels for incubator-mxnet issues using a committer's credentials.

3. Approach

Part I - Email Bot
An amazon cloudwatch event will trigger lambda function in a certain frequency(ex: 9am every Monday). Once the lambda function is executed, the issue report will be generated and sent to the mailing list. Figure1 shows the email bot architecture and Figure2 shows demo email content

Figure1 Email Bot Design

Figure 2 Demo Email Content

Part II -Predict labels automatically for unlabeled issues
This part will use Machine Learning models to predict labels and send them by emails. Figure 3 shows the architecture.

Figure 3 Lambda with Elastic Beanstalk

Part III - Label Bot
This label bot serves to help non-committers to add labels. A contributor can @mxnet-label-bot and comment "@mxnet-label-bot, please add labels: [A, B]". Then this bot will recognize notifications and add .
All code is on a lambda function. A CloudWatch event will trigger this lambda function every 5 minutes. Once the lambda function is executed, it will read valid notifications, extract labels' information from comments then add labels. Figure shows architecture.
Figure 5 Label Bot Design

4. Multi-label classification

Each instance can be assigned with multiple categories, so these types of problems are known as multi-label classification problem, where we have a set of target labels. Multi-label classification problems are very common in the real world, for example, audio categorization, image categorization, bioinformatics..etc. Our project mainly focus on text categorizations because labels are learned from issue title and issue description.

Step 1: Retrieve Data
Extract data from GitHub issues into JSON format.

Step 2: Data Cleaning
Data cleaning is very important for us to keep the valuable information such as keywords extraction and reduce the noise.

Step 3: Vector Representation
Classifiers and learning algorithms cannot directly process the text documents in their original form. During a preprocessing step, the documents are converted into a more manageable representation. Typically, the documents are represented by feature vectors.

Bag-of-word model uses all words in a document as the features, and thus the dimension of the feature space is equal to the number of different words in all of the documents.
Binary, in which the feature weight is either one - if the corresponding word is present in the document - or zero otherwise.
TF-IDF scheme gives the word w in the document d the weight
TF-IDF Weight(w, d) = TermFreq(w,d) * log(N/DocFreq(w))
Word2Vec, a two-layer neural net that processes text.
Doc2Vec, an extension of Word2Vec that learns to correlate labels and words.

Step 4: Feature Extraction
Map original high-dimensional data onto a lower-demensional space. Remove non-informative terms (irrelevant words) from documents.Improve classification effectiveness and reduce computational complexity.
Feature selection methods:

Document Frequency Threshold(DF) is a measure of the relevance of each feature in the document.
Information Gain(IG) measures the number of bits of information obtained for the prediction of categories by the presence or absence in a document of the feature f.
Chi-square measures the maximal strength of dependence between the feature and the categories.
Mutual Information(MI)

Step 5: Multi-Label Classification
Use two different approaches for multi-label classification. Problem transformation methods try to transform the multi-label classification into single-label or multi-class classification problems. Algorithm adaptation methods adapt multi-label algorithms so they can be applied directly to the problem. Pick top 10 labels to do classification at the beginning.

Problem Transformation

Binary Relevance
This is the simplest technique, which basically treats each label as a separate single class classification problem.
Classifier Chains
The first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
Label Powerset
Transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

Algorithm adaptation
Manual: rule-based
Automatic:

Vector space model based

Prototype-based
K-nearest neighbor
Decision-tree
Neural Networks
Support Vector Machines

Probabilistic or generative model based

Naive Bayes classifier

5. Technical Challenges

Restrict permissions of this bot to avoid unexpected operations.
Training data is limited.

6. Reference

7. Design Upgrade of Label Bot

Issue:

There is a limitation with the current label bot implementation in that the current label bot can only label unlabelled issues. Key functionality to be implemented includes re-labelling labeled issues and streamlining the process of updating and removing labels. The current label bot implementation is also fairly inefficient in the way it automatically labels our issues and pull requests. The current design is based upon a pull model where every 5 minutes we trigger the bot to pull all issues/pull requests which we then label appropriately and consequently retrain our model every 24 hours. There is also a restriction which can be faced where GitHub limits users to make 5000 HTTP requests in an hour so we want to minimize the requests we make as much as possible.

Proposed Design Decision:

The efficiency of this bot can be improved if the bot was redesigned with a push model where as soon as an issue or pull request is made to the repository, we trigger the label bot to then appropriately label the issue. The lambda bot will also include functionality to not only add but also update, and delete labels.

Implementation:

Taking advantage of GitHub WebHook we can trigger the bot when an issue or pull request is made to the repository (which we specify by denoting the event that we want to subscribe to) this trigger is then managed by the lambda function which decides on the appropriate action to take on a GitHub label.

AWS Services: API Gateway handles receiving a POST notification from the GitHub WebHook and has that response be sent to our lambda which we use to send to SQS. SQS handles management of multiple messages which are received and then sends this data to our lambda bot. The lambda bot reads the payload that has been received from SQS and takes the appropriate action onto a GitHub label.

Current Proposed Design Implementation:

Usage:

Add functionality: adds labels specified to the list of labels:

@mxnet-label-bot add [label1, label2]

Remove functionality removes labels specified from the list of labels:

@mxnet-label-bot remove [label1, label2]

Update functionality updates the labels of the issue to only the labels specified in the list:

@mxnet-label-bot update [label1, label2]

Page tree

1. Problem

2. Goal

3. Approach

Part I - Email Bot

Part II -Predict labels automatically for unlabeled issues

Part III - Label Bot

4. Multi-label classification

5. Technical Challenges

6. Reference

7. Design Upgrade of Label Bot

Issue:

Implementation:

Usage:

5 Comments

Aaron Markham

Piyush Ghai

Kalyanee Chendke

Qing Lan

Kalyanee Chendke