A Novel Two-Step Classification Approach for Runtime Performance Improvement of Duplicate Bug Report Detection

Document Type : Machine learning-Sadoghi

Authors

Department of Software Engineering, University of Kashan, Kashan, Iran.

Abstract

Duplicate Bug Report Detection (DBRD) is one of the famous problems in software triage systems like Bugzilla. There are two main approaches to this problem, including information retrieval and machine learning. The second one is more effective for validation performance. Duplicate detection needs feature extraction, which is a time-consuming process. Both approaches suffer runtime issues, because they should check the new bug report to all bug reports in the repository, and it takes a long time for feature extraction and duplicate detection. This study proposes a new two-step classification approach which tries to reduce the search space of the bug repository search space in the first step and then check the duplicate detection using textual features. The Mozilla and Eclipse datasets are used for experimental evaluation. The results show that overall, 87.70% and 89.01% validation performance achieved averagely for accuracy and F1-measure, respectively. Moreover, 95.85% and 87.65% of bug reports can be classified in step one very fast for Eclipse and Mozilla datasets, respectively, and the other one needs textual feature extraction until it can be checked by the traditional DBRD approach. An average of 90% runtime improvement is achieved using the proposed method.

Keywords


[1]   Zhang, J., Wang, X., Hao, D., Xie, B., Zhang, L., and Mei, H., "A survey on bug-report analysis", Science China Information Sciences, journal article vol. 58, no. 2, pp. 1-24, doi: 10.1007/s11432-014-5241-2. Science China Press, February 01, 2015.
[2]   Soleimani, Neysiani, B., and Babamir, S. M., "Methods of Feature Extraction for Detecting the Duplicate Bug Reports in Software Triage Systems", presented at the International Conference on Information Technology, Communications and Telecommunications (IRICT), Tehran, Iran, 2016, 2016. [Online]. Available: http://www.sid.ir/En/Seminar/ViewPaper.aspx?ID=7677.
[3]   Runeson, P., Alexandersson, M., and Nyholm, O., "Detection of duplicate defect reports using natural language processing", in 29th International Conference on Software Engineering (ICSE) IEEE, pp. 499-510, 2007.
[4]   Sun, C., Lo, D., Khoo, S. -C., and Jiang, J., "Towards more accurate retrieval of duplicate bug reports," in Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE Computer Society, pp. 253-262, 2011.
[5]   Soleimani Neysiani, B., and Babamir, S. M., "Improving Performance of Automatic Duplicate Bug Reports Detection Using Longest Common Sequence", in IEEE 5th International Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, Iran, Vol. 5, 2019.
[6]   Banerjee, S., Cukic, B., and Adjeroh, D., "Automated duplicate bug report classification using subsequence matching", in IEEE 14th International Symposium on High-Assurance Systems Engineering (HASE), IEEE, pp. 74-81, doi: http://dx.doi.org/10.1109/HASE.2012. 38, 2012.
[7]   Soleimani Neysiani, B., and Babamir, S. M., "New Methodology of Contextual Features Usage in Duplicate Bug Reports Detection", in IEEE 5th International Conference on Web Research (ICWR), Tehran, Iran, Vol. 5, 2019.
[8]   Aggarwal, K., Rutgers, T., Timbers, F., Hindle, A., Greiner, R., and Stroulia, E., "Detecting duplicate bug reports with software engineering domain knowledge", in IEEE 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER), Montreal, IEEE, pp. 211-220, doi: http://dx.doi.org/10.1109/SANER.2015.7081831, QC 2015.
[9]   Aggarwal, K., Timbers, F., Rutgers, T., Hindle, A., Stroulia, E., and Greiner, R., "Detecting duplicate bug reports with software engineering domain knowledge", Journal of Software: Evolution and Process, Vol. 29, No. 3, pp. e1821-n/a, Art no. e1821, doi: 10.1002/smr.1821, 2017.
[10] Soleimani Neysiani, B., and Babamir, S. M., "Automatic Typos Detection in Bug Reports," presented at the IEEE 12th International Conference Application of Information and Communication Technologies, Kazakhstan, 2018.
[11] Soleimani Neysiani, B., and Babamir, S. M., "Automatic Interconnected Lexical Typo Correction in Bug Reports of Software Triage Systems", presented at the International Conference on Contemporary Issues in Data Science, Zanjan, Iran, 2019.
[12] Soleimani Neysiani, B., and Babamir, S. M., "Fast Language-Independent Correction of Interconnected Typos to Finding Longest Terms", presented at the 24th International Conference on Information Technology (IVUS), Lithuania, 2019.
[13] Soleimani Neysiani,    B., and Babamir, S. M., "New labeled dataset of interconnected lexical typos for automatic correction in the bug reports", SN Applied Sciences, Vol. 1, No. 11, pp. 1385, 2019.
[14] Soleimani Neysiani, B., and Babamir, S. M., "Effect of Typos Correction on the validation performance of Duplicate Bug Reports Detection", presented at the 10th International Conference on Information and Knowledge Technology (IKT), Tehran, Iran, 2020-1-2, 1157, 2019.
[15] Soleimani Neysiani, B., and Babamir, S. M., "Duplicate Detection Models for Bug Reports of Software Triage Systems: A Survey", Current Trends In Computer Sciences & Applications, Review Article, Vol. 1, No. 5, pp. 128-134, 11-22 2019, doi: 10.32474/CTCSA.2019.01.000123, 2019.
[16] Soleimani Neysiani, B., and Babamir, S. M., "Automatic Duplicate Bug Report Detection using Information Retrieval-based versus Machine Learning-based Approaches", in IEEE 6th International Conference on Web Research (ICWR), Tehran, Iran, Vol. 6, pp. 288-293, doi: 10.1109/ICWR49608.2020.9122288, 2020.
[17] Hindle, A., "Stopping duplicate bug reports before they start with Continuous Querying for bug reports", PeerJ Preprints, 2167-9843, 2016.
[18] Hindle,           A., and Onuczko, C., "Preventing duplicate bug reports by continuously querying bug reports," Empirical Software Engineering, pp. 1-35, 2018.
[19] Soleimanian Gharehchopogh, F., and Mousavi, S. K., "A New Feature Selection in Email Spam Detection by Particle Swarm Optimization and Fruit Fly Optimization Algorithms", Journal of Computer and Knowledge Engineering, Vol. 2, No. 2, pp. 49-62, 2020-02-11, doi: 10.22067/cke.v2i2.81750, 2020.
[20] Soleimani Neysiani, B., Doostali, S., Babamir, S. M., and Aminoroaya, Z., "Fast Duplicate Bug Reports Detector Training using Sampling for Dimension Reduction: Using Instance-based Learning for Continous Query in Real-World", presented at the 11th International (Virtual) Conference on Information and Knowledge Technology (IKT), Tehran, Iran, 22-23 Dec. 2020, 2020.
[21] Banerjee, S., Syed, Z., Helmick, J., Culp, M., Ryan, K., and Cukic, B., "Automated triaging of very large bug repositories," Information and Software Technology, Vol. 89, pp. 1-13, 2017/09/01, doi: https://doi.org/10.1016/j.infsof.2016.09.006, 2017.
[22] Yang, X., Lo, D., Xia, X., Bao, L., and Sun, J., "Combining word embedding with information retrieval to recommend similar bug reports," in IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), IEEE, pp. 127-137, 2016.
[23] Lin, M.-J., Yang, C.-Z., Lee, C.-Y., and Chen, C.-C., "Enhancements for duplication detection in bug reports with manifold correlation features", Journal of Systems and Software, Vol. 121, No. Supplement C, pp. 223-233, 2016/11/01, doi: https://doi.org/10.1016/j.jss. 2016.02.022, 2016.
[24] Budhiraja, A., Dutta, K., Reddy, R., and Shrivastava, M., "DWEN: deep word embedding network for duplicate bug report detection in software repositories", in Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings, ACM, pp. 193-194, 2018.
[25] Lazar, A., Ritchey, S., and Sharif, B., "Improving the accuracy of duplicate bug report detection using textual similarity measures", in MSR 2014 Proceedings of the 11th Working Conference on Mining Software Repositories, Hyderabad, India ACM, pp. 308-311, doi: 10.1145/2597073.2597088. [Online]. Available: http://icse2014.acm.org/, 2014.
[26] Wang, S., Khomh, F., and Zou, Y., "Improving bug localization using correlations in crash reports," in 10th IEEE Working Conference on Mining Software Repositories (MSR) IEEE, pp. 247-256, doi: http://dx.doi.org/10.1109/MSR.2013.6624036, 2013.
[27] Wang, X., Zhang, L., Xie, T., Anvik, J., and Sun, J., "An approach to detecting duplicate bug reports using natural language and execution information", in Proceedings of the 30th international conference on Software engineering, Leipzig, Germany, ACM, in ICSE '08, pp. 461-470, doi: http://doi.acm.org/10.1145/1368088.1368151, 2008.
[28] Kim, S., Zimmermann, T., and Nagappan, N., "Crash graphs: An aggregated view of multiple crashes to improve crash triage", in Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on, IEEE, pp. 486-493, 2011.
[29] Ebrahimi, N., Trabelsi, A., Islam, M. S., Hamou-Lhadj, A., and Khanmohammadi, K., "An HMM-based approach for automatic detection and classification of duplicate bug reports", Information and Software Technology, Vol. 113, pp. 98-109, 2019/09/01, doi: https://doi.org/10.1016/j.infsof.2019.05.007, 2019.
[30] Alipour, A., Hindle, A., and Stroulia, E., "A Contextual Approach Towards More Accurate Duplicate Bug Report Detection", in Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, IEEE Press, pp. 183-192, doi: 10.1109/MSR.2013.6624026. [Online]. Available: http://dl.acm.org/citation.cfm?id=2487085.2487123, 2013.
[31] Nguyen, A. T., Nguyen, T. T., Nguyen, T. N., Lo, D., and Sun, C., "Duplicate bug report detection with a combination of information retrieval and topic modeling", in Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, pp. 70-79, 2012.
[32] Bagal, P. V., et al., "Duplicate bug report detection using machine learning algorithms and automated feedback incorporation", Patent US 2017/01998.03 A1, 2017.
[33] Koochekian Sabor, K., Hamou-Lhadj, A., and Larsson, A., "DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports", in 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), Prague, Czech Republic, IEEE, pp. 240-250, doi: 10.1109/QRS.2017.35, 25-29 July, 2017.
[34] Deshmukh, J., Podder, S., Sengupta, S., and Dubash, N., "Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques", in 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, pp. 115-124, 2017.
[35] Ebrahimi Koopaei, N., "Machine Learning And Deep Learning Based Approaches For Detecting Duplicate Bug Reports With Stack Traces", Concordia University, 2019.
[36] Xie,                Q., Wen, Z., Zhu, J., Gao, C., and Zheng, Z., "Detecting Duplicate Bug Reports with Convolutional Neural Networks", in 2018 25th Asia-Pacific Software Engineering Conference (APSEC), 4-7 Dec. 2018, pp. 416-425, doi: 10.1109/APSEC.2018.00056, 2018.
[37] Aminoroaya, Z., Soleimani Neysiani, B., and Nadimi Shahraki, M. H., "Detecting Duplicate Bug Reports Techniques", Research Journal of Applied Sciences, Vol. 13, No. 9, pp. 522-531, 2018/09/30, 2018.
[38] Bettenburg, N., Premraj, R., Zimmermann, T., and Kim, S., "Duplicate bug reports considered harmful… really?", in IEEE International Conference on Software Maintenance (ICSM), IEEE, pp. 337-345, doi: http://dx.doi.org/10.1109/ICSM.2008.4658082, [Online]. Available: https://www.st.cs.uni-saarland.de/ softevo/, 2008.
[39] Sun, C., Lo, D., Wang, X., Jiang, J., and Khoo, S.-C., "A discriminative model approach for accurate duplicate bug report retrieval", in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, ACM, pp. 45-54, 2010.
[40] Tian, Y., Sun, C., and Lo, D., "Improved duplicate bug report identification," in Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on, IEEE, pp. 385-390, 2012.
[41] Liu, K., Tan, H. B. K., and Chandramohan, M., "Has this bug been reported?", in Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, ACM, p. 28, doi: 10.1109_wcre.2013.6671283, 2012.
[42] Alipour, A., "A Contextual Approach Towards More Accurate Duplicate Bug Report Detection", Master of Science, Department of Computing Science, University of Alberta, Faculty of Graduate Studies and Research, 2013.
[43] Feng, L., Song, L., Sha, C., and Gong, X., "Practical duplicate bug reports detection in a large web-based development community", in Asia-Pacific Web Conference, Springer, pp. 709-720, 2013.
[44] Tsuruda,        A., Manabe, Y., and Aritsugi, M., "Can We Detect Bug Report Duplication with Unfinished Bug Reports?", in Asia-Pacific Software Engineering Conference (APSEC), IEEE, pp. 151-158, 2015.
[45] Sharma, A., and Sharma, S., "Bug Report Triaging Using Textual, Categorical and Contextual Features Using Latent Dirichlet Allocation", International Journal for Innovative Research in Science and Technology (IJIRST), Vol. 1, No. 9, pp. 85-96, Feb, 2015.
[46] Hindle, A., Alipour, A., and Stroulia, E., "A contextual approach towards more accurate duplicate bug report detection and ranking", Empirical Software Engineering, journal article, Vol. 21, No. 2, pp. 368-410, doi: 10.1007/s10664-015-9387-3, April 01, 2016.
[47] Pasala, A., Guha, S., Agnihotram, G., Prateek B, S., and Padmanabhuni, S., "An Analytics-Driven Approach to Identify Duplicate Bug Records in Large Data Repositories," in Data Science and Big Data Computing: Frameworks and Methodologies, Z. Mahmood Ed. Cham: Springer International Publishing, pp. 161-187, 2016.
[48] Rakha, M. S., Shang, W., and Hassan, A. E., "Studying the needed effort for identifying duplicate issues", Empirical Software Engineering, journal article, Vol. 21, No. 5, pp. 1960-1989, October 01, doi: 10.1007/s10664-015-9404-6, 2016.
[49] Su, E., and Joshi, S., "Leveraging product relationships to generate candidate bugs for duplicate bug prediction", in Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings, ACM, pp. 210-211, 2018.
[50] Soleimani Neysiani, B., Babamir, S. M., and Aritsugi, M., "Efficient Feature Extraction Model for Validation Performance Improvement of Duplicate Bug Report Detection in Software Bug Triage Systems", Information and Software Technology, vol. 126, pp. 106344-106363, 2020/10/01 2020, doi: 10.1016/j.infsof.2020.106344.
[51] Kukkar, A., Mohana, R., Kumar, Y., Nayyar, A., Bilal, M., and Kwak, K., "Duplicate Bug Report Detection and Classification System based on Deep Learning Technique", IEEE Access, Vol. 8, pp. 200749-200763, 10/23, doi: 10.1109/ACCESS.2020.3033045, 2020.
[52] Kim,               T., and Yang, G., "Predicting Duplicate in Bug Report Using Topic-Based Duplicate Learning With Fine Tuning-Based BERT Algorithm", IEEE Access, Vol. 10, pp. 129666-129675, doi: 10.1109/ACCESS.2022.3226238, 2022.
[53] Zhang, T., et al., "Duplicate Bug Report Detection: How Far Are We?", ACM Transactions on Software Engineering and Methodology, doi: 10.1145/3576042, 2022.
[54] IngoRM., "Confidence values", RapidMiner. https://community.rapidminer.com/discussion/17058/confidence-values, accessed 12/10/2020, 2020.
[55] Alipour, A., Hindle, A., Rutgers, T., Dawson, R., Timbers, F., and Aggarwal, K., "Bug Reports Dataset", https://github.com/kaggarwal/Dedup, accessed.
[56] Šarić, F., Glavaš, G., Karan, M., Šnajder, J., and Bašić, B. D., "Takelab: Systems for measuring semantic text similarity", in Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, Montréal, Canada, Stroudsburg, PA, USA: Association for Computational Linguistics, in SemEval '12, pp. 441-448, doi: 10.5555/2387636.2387708. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387636.2387708, 2012.
[57] RapidMiner Studio (9.5.1) RapidMiner Inc. [Online]. Available: rapidminer.com, , (2019).
[58] Candel, A., Parmar, V., LeDell, E., and Arora, A., "Deep learning with H2O", H2O. ai Inc, 2016.
[59] Cook, D., "Practical machine learning with H2O: powerful, scalable techniques for deep learning and AI", O'Reilly Media, Inc.", 2016.
[60] Karimi Zandian, Z., and Keyvanpour, M. R., "SSLBM: A New Fraud Detection Method Based on Semi- Supervised Learning", Journal of Computer and Knowledge Engineering, Vol. 2, No. 2, pp. 10-18, 2020-02-26, doi:10.22067/cke.v2i2.82152, 2020.
 
CAPTCHA Image