...
Classification Accuracy:
Label | Accuracy | Issue Count |
Performance | 100% | 87 |
Test | 99.59% | 245 |
Clojure | 98.90% | 12 (Test set: 1000) |
Java | 98.50% | 2 (Test set: 1000) |
Python | 98.30% | 170 (Test set: 1000) |
C++ | 97.20% | 2 (Test set: 1000) |
Scala | 96.30% | 40 (Test set: 1000) |
Question | 97.02% | 302 |
Doc | 90.32% | 155 |
Installation | 84.07% | 113 |
Example | 80.81% | 99 |
Bug | 78.66% | 389 |
Build | 69.87% | 156 |
onnx | 69.57% | 23 |
gluon | 44.38% | 160 |
flaky | 42.78% | 194 |
Feature | 32.24% | 335 |
ci | 28.30% | 53 |
Cuda | 22.09% | 86 |
Language Detection from Code Snippets in Issues:
*** In depth analysis with precision, recall, and f1 ***
Classification report with precision, recall, and f1 score
Label | Precision | Recall | F1 Score | Count |
Performance | 100% | 100% | 100% | 87 |
Test | 99.59% | 100% | 99.8% | 245 |
Clojure | 98.31% | 98.90% | 98.61% | 12 (Test set: 1000) |
Python | 98.70% | 98.30% | 98.50% | 170 (Test set: 1000) |
Question | 100% | 97.02% | 98.49% | 302 |
Java | 97.24% | 98.50% | 97.87% | 2 (Test set: 1000) |
C++ | 98.28% | 97.20% | 97.74% | 2 (Test set: 1000) |
Scala | 97.37% | 96.30% | 96.84% | 40 (Test set: 1000) |
Doc | 100% | 90.32% | 94.92% | 155 |
Installation | 100% | 84.07% | 91.35% | 113 |
Example | 100% | 80.81% | 89.39% | 99 |
Bug | 100% | 78.66% | 88.06% | 389 |
Build | 100% | 69.87% | 82.26% | 156 |
onnx | 80% | 84.21% | 82.05% | 23 |
gluon | 62.28% | 60.68% | 61.47% | 160 |
flaky | 96.51% | 43.46% | 59.93% | 194 |
Feature | 32.43% | 98.18% | 48.76% | 335 |
ci | 48.39% | 40.54% | 44.12% | 53 |
Cuda | 22.09% | 100% | 36.19% | 86 |
...
Precision
...
Precision here representing how accurate our classifier was in correctly labelling an issue given all the times it had predicted that label.
...
F1 score balances both the precision and recall scores
Programming languages were trained on large amounts of data pulled from a wide array of repositories we are able to deliver these high metrics especially with regards to programming languages by making use of MXNet for deep learning to learn similarities among these languages we consider (which are the programming languages that are present in the repo). Specifically this was trained on data snippets of files pulled from the data files present here: https://github.com/aliostad/deep-learning-lang-detection/tree/master/data.
Motivations/Conclusion:
We do notice that there is a case that may be present of overfitting here, especially with the case of the Performance label. However in looking further into the issues labeled as Performance, we notice that similar words and phrases are included across issues labeled as Performance (i.e. in most cases the word itself, and words like speed..). Given this data, we are able to see which labels the model can predict accurately for. Given a certain accuracy threshold, the bot has the potential to label an issue given that it surpasses this value. As a result, we would be able to accurately provide labels to new issues.
...