An Investigation Towards Resampling Techniques and Classification Algorithms on CM1 NASA PROMISE Dataset for Software Defect Prediction

Agung Fatwanto; Muh Nur Aslam; Rebbecah Ndugi; Muhammad Syafrudin

doi:10.29207/resti.v8i5.5910

Agung Fatwanto UIN Sunan Kalijaga Yogyakarta https://orcid.org/0000-0003-2780-487X
Muh Nur Aslam UIN Sunan Kalijaga Yogyakarta
Rebbecah Ndugi St. Petersburg State University
Muhammad Syafrudin Sejong University

DOI: https://doi.org/10.29207/resti.v8i5.5910

Keywords: Software Defect Prediction, Machine Learning, Classification Algorithm, Imbalanced Data, Resampling

Abstract

Software defect prediction is a practical approach to improving the quality and efficiency of software testing processes. However, establishing robust and trustworthy models for software defect prediction is quite challenging due to the limitation of historical datasets that most developers are capable of collecting. The inherently imbalanced nature of most software defect datasets also posed another problem. Therefore, an insight into how to properly construct software defect prediction models on a small, yet imbalanced, dataset is required. The objective of this study is therefore to provide the required insight by way of investigating and comparing a number of resampling techniques, classification algorithms, and evaluation measurements (metrics) for building software defect prediction models on CM1 NASA PROMISE data as the representation of a small yet unbalanced dataset. This study is comparative descriptive research. It follows a positivist (quantitative) approach. Data were collected through observation towards experiments on four categories of resampling techniques (oversampling, under sampling, ensemble, and combine) combined with three categories of machine learning classification algorithms (traditional, ensemble, and neural network) to predict defective software modules on CM1 NASA PROMISE dataset. Training processes were carried out twice, each of which used the 5-fold cross-validation and the 70% training and 30% testing data splitting (holdout) method. Our result shows that the combined and oversampling techniques provide a positive effect on the performance of the models. In the context of classification models, ensemble-based algorithms, which extend the decision tree classification mechanism such as Random Forest and eXtreme Gradient Boosting, achieved sufficiently good performance for predicting defective software modules. Regarding the evaluation measurements, the combined and rank-based performance metrics yielded modest variance values, which is deemed suitable for evaluating the performance of the models in this context.

Downloads

Download data is not yet available.

References

Ö. F. Arar and K. Ayan, “Software defect prediction using cost-sensitive neural network,” Appl. Soft Comput. J., vol. 33, pp. 263–277, 2015, doi: 10.1016/j.asoc.2015.04.045. https://doi.org/10.29207/resti.v4i5.2391.

H. Alsawalqah, H. Faris, I. A. B, and L. Alnemer, “Hybrid SMOTE-Ensemble Approach,” Adv. Intell. Syst. Comput., vol. 1, 2017, doi: 10.1007/978-3-319-57141-6.

L. Qiao, X. Li, Q. Umer, and P. Guo, “Deep learning based software defect prediction,” Neurocomputing, vol. 385, pp. 100–110, 2020, doi: 10.1016/j.neucom.2019.11.067.

R. B. Bahaweres, K. Zawawi, D. Khairani, and N. Hakiem, “Analysis of statement branch and loop coverage in software testing with genetic algorithm,” Int. Conf. Electr. Eng. Comput. Sci. Informatics, vol. 2017-Decem, no. September, pp. 19–21, 2017, doi: 10.1109/EECSI.2017.8239088.

F. Matloob et al., “Software Defect Prediction Using Ensemble Learning : A Systematic Literature Review,” IEEE Access, vol. 9, no. October, pp. 98754–98771, 2021, doi: 10.1109/ACCESS.2021.3095559.

M. N. M. Rahman, R. A. Nugroho, M. R. Faisal, F. Abadi, and R. Herteno, “Optimized multi correlation-based feature selection in software defect prediction,” TELKOMNIKA Telecommun Comput El Control J., vol. 98, no. 3, pp. 598-605, 2024, doi: https://doi.org/10.12928/TELKOMNIKA.v22i3.25793.

U. S. Bhutamapuram and R. Sadam, “With-in-project defect prediction using bootstrap aggregation based diverse ensemble learning technique,” J. of King Saud University - Computer and Information Sciences., p. 4832864, 15 pages 2021, https://doi.org/10.1155/2021/4832864.S.

S. Feng, J. Keung, Y. Xiao, P. Zhang, X. Yu and X. Cao, “Improving the undersampling technique by optimizing the termination condition for software defect prediction,” Expert Sys. with App., vol. 235, p. 121084, 2024, https://doi.org/10.1016/j.eswa.2023.121084.

A. Saifudin, S. W. H. L. Hendric, B. Soewito, F. L. Gaol, E. Abdurachman, and Y. Heryadi, “Tackling Imbalanced Class on Cross-Project Defect Prediction Using Ensemble SMOTE,” IOP Conf. Ser. Mater. Sci. Eng., vol. 662, no. 6, pp. 0–10, 2019, doi: 10.1088/1757-899X/662/6/062011.

C. Tantithamthavorn, A. E. Hassan and K. Matsumoto, "The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models," IEEE Transactions on Software Engineering, vol. 46, no. 11, pp. 1200-1219, 1 Nov. 2020, doi: 10.1109/TSE.2018.2876537.

Anna, “Penerapan k-nearest neighbor menggunakan pendekatan random walk over-sampling menangani ketidakseimbangan kelas pada prediksi cacat software,” Master Thesis, STMIK Nusa Mandiri, 2018, [Online]. Available: https://repository.bsi.ac.id/repo/files/354032/download/Full-Tesis-Anna.pdf.

M. Sonhaji Akbar and S. Rochimah, “Prediksi Cacat Perangkat Lunak Dengan Optimasi Naive Bayes Menggunakan Pemilihan Fitur Gain Ratio,” J. Sist. Dan Inform., vol. 11, no. 1, pp. 147–155, 2018.

S. K. Pemmada, J. Nayak, H. S. Behera, and D. Pelusi, “Light Gradient Boosting Machine in Software Defect Prediction: Concurrent Feature Selection and Hyper Parameter Tuning,” in Intelligent Sustainable Systems, J. S. Raj, Y. Shi, D. Pelusi, and V. E. Balas, Eds., Singapore: Springer Nature Singapore, 2022, pp. 427–442.

A. Alazba and H. Aljamaan, “Software Defect Prediction Using Stacking Generalization of Optimized Tree-Based Ensembles,” Appl. Sci., vol. 12, no. 9, 2022, doi: 10.3390/app12094577.

Y. Al-Smadi, M. Eshtay, A. Al-Qerem, S. Nashwan, O. Ouda, and A. A. Abd El-Aziz, “Reliable prediction of software defects using Shapley interpretable machine learning models,” Egypt. Informatics J., vol. 24, no. 3, p. 100386, 2023,doi: https://doi.org/10.1016/j.eij.2023.05.011.

M. Shepperd, Q. Song, Z. Sun, and C. Mair, “Data quality: Some comments on the nasa software defect datasets,” IEEE Trans. On Soft. Eng., vol. 39, no. 9, pp. 1208–1215, 2013, doi: 10.1109/TSE.2013.11.

S. Raschka, “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning,” arXiv, primaryClass={cs.LG}, eprint 1811.12808, 2012, url=https://arxiv.org/abs/1811.12808.

Q. H. Nguyen, H-B. Ly, L. S. Ho, N. Al-Ansari, H. V. Le, V. Q. Tran, I. Prakash and B. T. Pham, “Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil,” Math. Problems in Eng., vol. 29, issue. 4, pp. 565-570, 2019, doi: 10.1145/3459665.

C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M Reps and P. R. Rijnbeek, “Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data,” J. of Big Data., vol. 11, no. 7, pp. 2196-1115, 2024, https://doi.org/10.1186/s40537-023-00857-7.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, no. Sept. 28, pp. 321–357, 2002, https://arxiv.org/pdf/1106.1813.pdf%0Ahttp://www.snopes.com/horrors/insects/telamonia.asp.

H. He, Y. Bai, E. A. Garcia and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322-1328, doi: 10.1109/IJCNN.2008.4633969.

T. Sasada, Z. Liu, T. Baba, K. Hatano and Y. Kimura, " A Resampling Method for Imbalanced Datasets Considering Noise and Overlap," Procedia Computer Science (24 Int. Conf. on Knowledge-based & Intelligent Information & Engineering System), vol. 176, 2020, pp. 420-429, doi: 10.1016/j.procs.2020.08.043.

Y. Tang, Y. Q. Zhang, and N. V. Chawla, “SVMs modeling for highly imbalanced classification,” IEEE Trans. Syst. Man, Cybern. Part B Cybern., vol. 39, no. 1, pp. 281–288, 2009, doi: 10.1109/TSMCB.2008.2002909.

F. Wang et al., “Imbalanced data classification algorithm with support vector machine kernel extensions,” Evol. Intell., vol. 12, no. 3, pp. 341–347, 2019, doi: 10.1007/s12065-018-0182-0.

P. Mooijman, C. Catal, B. Tekinerdogan, A. Lommen, and M. Blokland, “The effects of data balancing approaches: A case study,” Appl. Soft Comput., vol. 132, p. 109853, 2023, doi: 10.1016/j.asoc.2022.109853.

W. Zheng et al., “Machine learning for imbalanced datasets: Application in prediction of 3d-5d double perovskite structures,” Comput. Mater. Sci., vol. 209, no. March, 2022, doi: 10.1016/j.commatsci.2022.111394.

M. Xing et al., “Predict DLBCL patients’ recurrence within two years with Gaussian mixture model cluster oversampling and multi-kernel learning,” Comput. Methods Programs Biomed., vol. 226, p. 107103, 2022, doi: 10.1016/j.cmpb.2022.107103

J. Bakerman, K. Pazdernik, G. Korkmaz, and A. G. Wilson, “Dynamic logistic regression and variable selection: Forecasting and contextualizing civil unrest,” Int. J. Forecast., vol. 38, no. 2, pp. 648–661, 2022, doi: 10.1016/j.ijforecast.2021.07.003.

M. Fang and D. T. N. Huy, “Building a cross-border e-commerce talent training platform based on logistic regression model,” J. High Technol. Manag. Res., vol. 34, no. 2, p. 100473, 2023, doi: 10.1016/j.hitech.2023.100473.

Y. Zhang, “Support Vector Machine Classification Algorithm and Its Application,” Information Computing and Applications (ICICA 2012. Communications in Computer and Information Science), vol. 308, part II, pp. 179–186, 2012, Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34041-3_27.

P. Cunningham and S. J. Delany, “k-Nearest Neighbour Classifiers - A Tutorial,” ACM Computing Surveys., vol. 54, no. 6, pp. 1–25, 2021, doi: 10.1145/3459665.

Wickramasinghe and H. Kalutarage “Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation”, Soft Comput, vol. 25, pp. 2277–2293, 2021, https://doi.org/10.1007/s00500-020-05297-6.

S. Tangirala, “Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 2, pp. 612–619, 2020, doi: 10.14569/ijacsa.2020.0110277.

Z. He, Z. Wu, G. Xu, Y. Liu, and Q. Zou, “Decision Tree for Sequences,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 1, pp. 251–263, 2023, doi: 10.1109/TKDE.2021.3075023.

F. Zou, “Research on data cleaning in big data environment,” Proc. - 2022 Int. Conf. Cloud Comput. Big Data Internet Things, 3CBIT 2022, pp. 145–148, 2022, doi: 10.1109/3CBIT57391.2022.00037.

Z. Jin, J. Shang, Q. Zhu, C. Ling, W. Xie, and B. Qiang, “RFRSF: Employee Turnover Prediction Based on Random Forests and Survival Analysis,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12343 LNCS, pp. 503–515, 2020, doi: 10.1007/978-3-030-62008-0_35.

J. Lee et al., “Data-driven disruption prediction using random forest in KSTAR,” Fusion Eng. Des., vol. 199, no. December 2023, p. 114128, 2024, doi: 10.1016/j.fusengdes.2023.114128.

J. Singh and R. Banerjee, "A Study on Single and Multi-layer Perceptron Neural Network," 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 2019, pp. 35-40, https://doi.org/10.1016/j.ijmst.2019.06.009.

Y. Pu, D. B. Apel, V. Liu and H. Mitri, “Machine learning methods for rockburst prediction-state-of-the-art review,” Int. J. of Mining Sci. and Tech., vol. 29, issue. 4, pp. 565-570, 2019, doi: 10.1145/3459665.

J. S. Akosa, “Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data,” in Proceeding, 2017, https://api.semanticscholar.org/CorpusID:43504747.