Description:

Currently, within the incubator-mxnet repo there are over 800+ open issues and new ones being generated every day. With a large influx of issues, many issues may go unlabelled and labelling them can be cumbersome. Labelling is important in that MXNet contributors can filter issues and offer their help for problems which our users face. As well, this can be useful to bring in new contributors. For example, a Scala expert may know how to handle an issue posted on the MXNet repo regarding the Scala API. They would be able to assess the issue we face on our repo and can easily become a contributor. Today, we employ the label bot to help ease the issue/pull request labelling process. Given the data of previously labelled issues and pull requests, an interesting use case opens up. Based upon this data, we can provide label predictions on new issues and pull requests. Overall we can provide a better experience to the community as we will able to address issues in a more efficient matter. 

Proposal:

The label bot will provide predictions to label certain issues and pull requests. We will gather these metrics and accuracy figures, and given a certain accuracy threshold we can have the label bot label an issue as it see fit (based on a threshold i.e. >90% accuracy). We want to make sure to the best degree possible that all these labelled issues by the bot have the correct labels on them. If it isn't able to meet this threshold, it can provide a recommendation to the user of the labels it predicted. Note: Labelling is not permanent, we can always remove labels - however we would strive for this to be at a minimum.

Data Analysis

Our dataset consists of all the issues present on the repository which includes issues that are open and closed. In helping to determine our label prediction, we gather the titles, descriptions and labels of the issues on the MXNet repository. We will retrain our model given new issues and pull requests every 24 hours so that the dataset is updated and ready to predict labels for new issues. We will set specific target labels that we are interested in labelling and help predict those labels (i.e. feature request, doc, bug, ...) on new issues.

Note: Training data here is limited (~13,000 issues both closed and opened), after the data cleaning process we expect this value to be greatly further reduced. Also, we have to consider that not all issues have been labelled and if labelled not all labels which may represent that issue have been. However, within the word embedding step we can take data from other sources (i.e. tensorflow/pytorch issues)

Metrics:

Multi-label Classification:

Accurate prediction of at least one label (for our specific target labels) in an issue across issues: ~87%

Accuracy in predicting all labels in an issue (i.e. an exact match of all labels to an issue) across issues: ~20%


How was the label accuracy measured:

The labels below were chosen for prediction initially by the model. Only the issues which are specific to these labels are what is being tested on, in other words, either the specific label being tested on was predicted by the model or the specific label was the actual label on the issue. The accuracy shown below denotes where the model predicted a label and that was one of the actual labels in the repo.

Target Labels: [Performance, Test, Clojure, Java, Python, C++, Scala, Question, Doc, Installation, Example, Bug, Build, ONNX, Gluon, Flaky, Feature,  CI, Cuda] 


*** The accuracy metric was collected using sklearn's accuracy_score method ***

(https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)

Classification Accuracy:

LabelAccuracyIssue Count
Performance100%87
Test99.59%

245

Clojure98.90%12 (Test set: 1000)
Java98.50%2 (Test set: 1000)
Python98.30%170 (Test set: 1000)
C++97.20%2 (Test set: 1000)
Scala96.30%40 (Test set: 1000)
Question97.02%302
Doc90.32%155
Installation84.07%113
Example80.81%99
Bug78.66%389
Build69.87%156
onnx69.57%23
gluon44.38%160
flaky42.78%194
Feature32.24%335
ci28.30%53
Cuda22.09%86



*
** In depth analysis with precision, recall, and f1 ***

Classification report with precision, recall, and f1 score

LabelPrecisionRecallF1 ScoreCount
Performance100%100%100%87
Test99.59%100%99.8%245
Clojure98.31%98.90%98.61%12 (Test set: 1000)
Python98.70%98.30%98.50%170 (Test set: 1000)
Question100%97.02%98.49%302
Java97.24%98.50%97.87%2 (Test set: 1000)
C++98.28%97.20%97.74%2 (Test set: 1000)
Scala97.37%96.30%96.84%40 (Test set: 1000)
Doc100%90.32%94.92%155
Installation100%84.07%91.35%113
Example100%80.81%89.39%99
Bug100%78.66%88.06%389
Build100%69.87%82.26%156
onnx80%84.21%82.05%23
gluon62.28%60.68%61.47%160
flaky96.51%43.46%59.93%194
Feature32.43%98.18%48.76%335
ci48.39%40.54%44.12%53
Cuda22.09%100%36.19%86

The test set here represents a test set of the data snippets of files for the specific languages (covered later on below)

Precision here representing how accurate our classifier was in correctly labelling an issue given all the times it had predicted that label.

Recall here representing how accurate our classifier was in correctly labelling an issue given all the times the issue actually had that label.

F1 score balances both the precision and recall scores


Label was actually on the issueLabel was not on the issue
Label was predictedDesired outcome

False Positive – A high precision value means that this is reduced

Label was not predictedFalse Negative - A high recall value means that this is reducedDesired outcome

Programming languages were trained on large amounts of data pulled from a wide array of repositories we are able to deliver these high metrics especially with regards to programming languages by making use of MXNet for deep learning to learn similarities among these languages we consider (which are the programming languages that are present in the repo). Specifically this was trained on data snippets of files pulled from the data files present here: https://github.com/aliostad/deep-learning-lang-detection/tree/master/data. Thus, we can believe that this accuracy measurement can be maintained on prediction of new issues which have code snippets presented within them. Training was done with a 6 layer deep model in Keras-MXNet using the 2000 files present (and creating snippets out of them) for the languages we are interested in from the repository data above. For inference, we are using pure MXNet with the model being served using Model Server for Apache MXNet (MMS) - a flexible and easy to use tool for serving deep learning models for MXNet (https://github.com/awslabs/mxnet-model-server)


Two models combined for specific groups of labels allow for us to be able to deliver this capability:

TFIDF vectorizer with LinearSVC - Used for the more generic labels (i.e. Performance, Bug, Test, .... )

MLP (Multilayer perceptron) - Used for coding language labels (i.e. Clojure, Python, Java, ... ) - due to access of a larger dataset

Motivations/Conclusion:

We do notice that there are potential cases of overfitting here, especially with the case of the Performance label. However in looking further into the issues labeled as Performance, we notice that similar words and phrases are included across issues labeled as Performance (i.e. in most cases the word itself, and words like speed..). The training data for the word embeddings that our model has trained on is able to give these kinds of results due because of word2vec which provides us with a high cosine similarity - we can speculate that these common words were grouped together and hence the model was able to predict these labels with a high accuracy. Given this data, we are able to see which labels the model can predict accurately for. Given a certain accuracy threshold, the bot has the potential to label an issue given that surpasses this value. As a result, we would be able to accurately provide some labels to new issues. Overall, the mxnet-label-bot will be able to provide an improved experience for MXNet developers.





  • No labels