Kmeans-SMOTE Integration for Handling Imbalance Data in Classifying Financial Distress Companies using SVM and Naïve Bayes
Abstract
Imbalanced data presents significant challenges in machine learning, leading to biased classification outcomes that favor the majority class. This issue is especially pronounced in the classification of financial distress, where data imbalance is common due to the scarcity of such instances in real-world datasets. This study aims to mitigate data imbalance in financial distress companies using the Kmeans-SMOTE method by combining Kmeans clustering and the synthetic minority oversampling technique (SMOTE). Various classification approaches, including Nave Bayes and support vector machine (SVM), are implemented on a Kaggle financial distress data set to evaluate the effectiveness of Kmeans-SMOTE. Experimental results show that SVM outperforms Nave Bayes with impressive accuracy (99.1%), f1-score (99.1%), area under precision recall (AUPRC) (99.1%), and geometric mean (Gmean) (98.1%). On the basis of these results, Kmeans-SMOTE can balance the data effectively, leading to a quite significant improvement in performance.
Downloads
References
P. Kr, “FINANCIAL DISTRESS CLASSIFICATION Mária Stachová – Pavol Kráľ,” pp. 977–988, 2021.
J. Sun, H. Li, H. Fujita, B. Fu, and W. Ai, “Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting,” Inf. Fusion, vol. 54, no. December 2018, pp. 128–144, 2020, doi: 10.1016/j.inffus.2019.07.006.
M. D. Costa, H. J. Huang, B. U. Bhuiyan, and L. Sun, “Determinants and consequences of financial distress : a review of the empirical literature”, doi: 10.1111/acfi.12400.
N. W. D. Ayuni, N. N. Lasmini, and A. A. Putrawan, “Support Vector Machine (SVM) as Financial Distress Model Prediction in Property and Real Estate Companies,” Proc. Int. Conf. Appl. Sci. Technol. Soc. Sci. 2022 (iCAST-SS 2022), pp. 397–402, 2022, doi: 10.2991/978-2-494069-83-1_72.
S. Doğan, D. Koçak, and M. Atan, “Financial Distress Prediction Using Support Vector Machines and Logistic Regression,” Contrib. to Econ., no. May, pp. 429–452, 2022, doi: 10.1007/978-3-030-85254-2_26.
P. Patel, A. Shrivastava, and S. Nagar, “Bankruptcy Prediction Model Using Naïve Bayes Algorithms,” vol. 59, no. 01, p. 83, 2019, [Online]. Available: https://archive.ics.uci.edu/ml/machine-
G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE,” Inf. Sci. (Ny)., vol. 465, pp. 1–20, 2018, doi: 10.1016/j.ins.2018.06.056.
A. Zhafirah and & Majidah, “Analisis Determinan Financial Distress (Studi Empiris Pada Perusahaan Subsektor Tekstil dan Garmen Periode 2013-2017),” J. Ris. Akunt. dan Keuang., vol. 7, no. 1, pp. 195–202, 2019, doi: 10.17509/jrak.v7i1.15497.
W. Setyowati and N. R. Sari Nanda, “Pengaruh Likuiditas, Operating Capacity, Ukuran Perusahaan Dan Pertumbuhan Penjualan Terhadap Financial Distress (Studi Pada Perusahaan Manufaktur Yang Terdaftar Di Bei Tahun 2016-2017),” J. Magisma, vol. 4, no. 2, pp. 618–624, 2019.
I. Setyawati and R. Amelia, “The Role of Current Ratio, Operating Cash Flow and Inflation Rate in Predicting Financial Distress: Indonesia Stock Exchange,” J. Din. Manaj., vol. 9, no. 2, pp. 140–148, 2018, doi: 10.15294/jdm.v9i2.14195.
K. Andrić and D. Kalpić, “An insight into the effects of class imbalance and sampling on classification accuracy in credit risk assessment,” vol. 16, no. 1, pp. 155–178.
R. Abdillah, “The Effect of Class Imbalance Against LVQ Classification,” no. October, pp. 42–45, 2018.
D. Berrar, “Bayes’ theorem and naive bayes classifier,” Encycl. Bioinforma. Comput. Biol. ABC Bioinforma., vol. 1–3, pp. 403–412, 2018, doi: 10.1016/B978-0-12-809633-8.20473-1.
J. Zhou et al., “Optimization of support vector machine through the use of metaheuristic algorithms in forecasting TBM advance rate,” Eng. Appl. Artif. Intell., vol. 97, no. October 2020, p. 104015, 2021, doi: 10.1016/j.engappai.2020.104015.
J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomputing, vol. 408, no. xxxx, pp. 189–215, 2020, doi: 10.1016/j.neucom.2019.10.118.
S. Ruuska, W. Hämäläinen, S. Kajava, M. Mughal, P. Matilainen, and J. Mononen, “Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle,” Behav. Processes, vol. 148, pp. 56–62, 2018, doi: 10.1016/j.beproc.2018.01.004.
Aditya Gumilar, Sri Suryani Prasetiyowati, and Yuliant Sibaroni, “Performance Analysis of Hybrid Machine Learning Methods on Imbalanced Data (Rainfall Classification),” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 6, no. 3, pp. 481–490, 2022, doi: 10.29207/resti.v6i3.4142.
O. Barukab, A. Ahmad, T. Khan, and M. R. Thayyil Kunhumuhammed, “Analysis of Parkinson’s Disease Using an Imbalanced-Speech Dataset by Employing Decision Tree Ensemble Methods,” Diagnostics, vol. 12, no. 12, pp. 1–21, 2022, doi: 10.3390/diagnostics12123000.
M. Hayaty, S. Muthmainah, and S. M. Ghufran, “Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification,” Int. J. Artif. Intell. Res., vol. 4, no. 2, p. 86, 2021, doi: 10.29099/ijair.v4i2.152.
K. Fithriasari, I. Hariastuti, and K. S. Wening, “Handling Imbalance Data in Classification Model with Nominal Predictors,” Int. J. Comput. Sci. Appl. Math., vol. 6, no. 1, p. 33, 2020, doi: 10.12962/j24775401.v6i1.6643.
A. Indrawati, H. Subagyo, A. Sihombing, W. Wagiyah, and S. Afandi, “Analyzing the Impact of Resampling Method for Imbalanced Data Text in Indonesian Scientific Articles Categorization,” Baca J. Dokumentasi Dan Inf., vol. 41, no. 2, p. 133, 2020, doi: 10.14203/j.baca.v41i2.702.
C. Kaope and Y. Pristyanto, “The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance,” vol. 22, no. 2, pp. 227–238, 2023, doi: 10.30812/matrik.v22i2.2515.
Copyright (c) 2024 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;