This is the initial design of ML Based GitHub Bot.

1. Problem

Currently there are many issues on Incubator-MXNet repo, labeling issues can help contributors who know a particular area to pick up the issue and help user. However, currently issues are all manually labelled, which is time consuming. And every time maintainers need to @ a committer to add labels. This bot will help automate/simplify this issue labeling process.

2. Goal

  • Part I - Email Bot
    Send daily GitHub issue reports to the mailing list:
    • Count of newly opened issues and closed issues in last 7 days
    • Average and worst response time for all new issues
    • List of non-responded new issues with links
    • List of non-responded issues outside SLA
    • Predictions of unlabeled issues
    • Pie chart with top 10 labels for all issues
  • Part II - Predict labels automatically for unlabeled issues
    • Build a web server which could response to GET/POST requests and realize self-maintenance:
      • Predict labels: once it receives GET/POST requests with issue ID, it will send predictions back.
      • Self-maintenance: it will re-train Machine Learning models every 24 hours.
  • Part III - Label Bot:
    This bot serves to help non-committers add labels to GitHub issues.
    • Recognize people's commands. ie "@mxnet-label-bot, please add labels :[A, B]". 
    • Be able to add labels for incubator-mxnet issues using a committer's credentials.

3. Approach

  • Part I - Email Bot 

    An amazon cloudwatch event will trigger lambda function in a certain frequency(ex: 9am every Monday). Once the lambda function is executed, the issue report will be generated and sent to the mailing list. Figure1 shows the email bot architecture and Figure2 shows demo email content


Figure1 Email Bot Design




Figure 2 Demo Email Content


  • Part II -Predict labels automatically for unlabeled issues

    This part will use Machine Learning models to predict labels and send them by emails. Figure 3 shows the architecture.



Figure 3 Lambda with Elastic Beanstalk 


  • Part III - Label Bot

    This label bot serves to help non-committers to add labels. A contributor can @mxnet-label-bot and comment "@mxnet-label-bot, please add labels: [A, B]". Then this bot will recognize notifications and add . 

    All code is on a lambda function. A CloudWatch event will trigger this lambda function every 5 minutes. Once the lambda function is executed, it will read valid notifications, extract labels' information from comments then add labels. Figure shows architecture.
    Figure 5 Label Bot Design

4. Multi-label classification

Each instance can be assigned with multiple categories, so these types of problems are known as multi-label classification problem, where we have a set of target labels. Multi-label classification problems are very common in the real world, for example, audio categorization, image categorization, bioinformatics..etc. Our project mainly focus on text categorizations because labels are learned from issue title and issue description.

Step 1: Retrieve Data
Extract data from GitHub issues into JSON format.

Step 2: Data Cleaning
Data cleaning is very important for us to keep the valuable information such as keywords extraction and reduce the noise.

Step 3: Vector Representation
Classifiers and learning algorithms cannot directly process the text documents in their original form. During a preprocessing step, the documents are converted into a more manageable representation. Typically, the documents are represented by feature vectors.

  • Bag-of-word model uses all words in a document as the features, and thus the dimension of the feature space is equal to the number of different words in all of the documents.
  • Binary, in which the feature weight is either one - if the corresponding word is present in the document - or zero otherwise.
  • TF-IDF scheme gives the word w in the document d the weight
    TF-IDF Weight(w, d) = TermFreq(w,d) * log(N/DocFreq(w))
  • Word2Vec, a two-layer neural net that processes text.
  • Doc2Vec, an extension of Word2Vec that learns to correlate labels and words.

Step 4: Feature Extraction
Map original high-dimensional data onto a lower-demensional space. Remove non-informative terms (irrelevant words) from documents.Improve classification effectiveness and reduce computational complexity.
Feature selection methods:

  • Document Frequency Threshold(DF) is a measure of the relevance of each feature in the document.
  • Information Gain(IG) measures the number of bits of information obtained for the prediction of categories by the presence or absence in a document of the feature f.
  • Chi-square measures the maximal strength of dependence between the feature and the categories.
  • Mutual Information(MI)

Step 5: Multi-Label Classification
Use two different approaches for multi-label classification. Problem transformation methods try to transform the multi-label classification into single-label or multi-class classification problems. Algorithm adaptation methods adapt multi-label algorithms so they can be applied directly to the problem. Pick top 10 labels to do classification at the beginning.

  • Problem Transformation
    • Binary Relevance
      This is the simplest technique, which basically treats each label as a separate single class classification problem.
    • Classifier Chains
      The first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
    • Label Powerset
      Transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.
  • Algorithm adaptation
    Manual:
    rule-based
    Automatic:
    • Vector space model based
      • Prototype-based
      • K-nearest neighbor
      • Decision-tree
      • Neural Networks
      • Support Vector Machines
    • Probabilistic or generative model based
      • Naive Bayes classifier

5. Technical Challenges

  • Restrict permissions of this bot to avoid unexpected operations.
  • Training data is limited.

6. Reference

7. Design Upgrade of Label Bot 

Issue:

There is a limitation with the current label bot implementation in that the current label bot can only label unlabelled issues. Key functionality to be implemented includes re-labelling labeled issues and streamlining the process of updating and removing labels. The current label bot implementation is also fairly inefficient in the way it automatically labels our issues and pull requests. The current design is based upon a pull model where every 5 minutes we trigger the bot to pull all issues/pull requests which we then label appropriately and consequently retrain our model every 24 hours. There is also a restriction which can be faced where GitHub limits users to make 5000 HTTP requests in an hour so we want to minimize the requests we make as much as possible.


Proposed Design Decision:

The efficiency of this bot can be improved if the bot was redesigned with a push model where as soon as an issue or pull request is made to the repository, we trigger the label bot to then appropriately label the issue. The lambda bot will also include functionality to not only add but also update, and delete labels. 

Implementation:

Taking advantage of GitHub WebHook we can trigger the bot when an issue or pull request is made to the repository (which we specify by denoting the event that we want to subscribe to) this trigger is then managed by the lambda function which decides on the appropriate action to take on a GitHub label. 

AWS Services: API Gateway handles receiving a POST notification from the GitHub WebHook and has that response be sent to our lambda which we use to send to SQS. SQS handles management of multiple messages which are received and then sends this data to our lambda bot. The lambda bot reads the payload that has been received from SQS and takes the appropriate action onto a GitHub label. 

Current Proposed Design Implementation:



Usage:

 Add functionality: adds labels specified to the list of labels:

@mxnet-label-bot add [label1, label2]

Remove functionality removes labels specified from the list of labels:

@mxnet-label-bot remove [label1, label2]

Update functionality updates the labels of the issue to only the labels specified in the list:

@mxnet-label-bot update [label1, label2] 





  • No labels

5 Comments

  1. Feature idea: close issues according to identified text for specific flags plus a time interval. The following events would close an issue.

    • when a contributor responds with "Is this issue resolved?", or "If this issue is resolved could you please close the issue?", or similarly detected phrases, and there's no response for a week
    • when a contributor comments with a special flag like `github-bot-close`

    The bot would also supply some nice text about reopening the issue if it is still a problem.

    1. Agree with Aaron. Having the ability to close the issue would make the bot super useful for everyone in the community. 

  2. Couple of comments:

    1. Why do we need API gateway for trigger SQS? Why not subscribe SQS directly to the webhook? Is there any limitation there?
    2. Please add documentation on usage of AWS secret manager

    1. It also raises the concern that if anybody from outside knew the URL of the SQS services and start flooding with junk data. How do we handle this, is there any restriction we can take in here?