Enhancing Stroke Prediction with Logistic Regression and Support Vector Machine Using Oversampling Techniques

  • Syamsul Risal Universitas Teknologi Akba Makassar
  • Fajar Apriyadi Universitas Teknologi Akba Makassar
  • A. Sumardin Universitas Teknologi Akba Makassar
  • Andini Dani Achmad Universitas Hasanuddin
  • Annisa Nurul Puteri Politeknik Negeri Ujung Pandang
Keywords: grid search cross-validation, logistic regression, machine learning, stroke disease, support vector machine

Abstract

Stroke is a significant health concern that can result in both death and disability, making the early identification of risk factors crucial. Previous studies on stroke prediction have been limited by inadequate handling of class imbalance, lack of comprehensive feature selection, and parameter optimization, with accuracy rates usually below 80%. This study compares the performance of Logistic Regression (LR) and Support Vector Machine (SVM) algorithms combined with different oversampling methods—SMOTE, Borderline-SMOTE, ADASYN, Random Over Sampling (ROS), and Random Under Sampling (RUS)—on a stroke prediction dataset. Correlation-based feature selection identified age, hypertension, and heart disease as significant predictors. GridSearchCV with 10-fold cross-validation was used for hyperparameter optimization, and performance was evaluated using precision, recall, accuracy, and ROC curves. The results showed that SVM significantly outperformed Logistic Regression across all sampling methods. SVM+ROS achieved the highest performance with perfect recall (100%), precision of 97.18%, and accuracy of 98.56% (AUC: 0.9857), whereas SVM + Borderline-SMOTE offered balanced performance with a recall of 94.99%, precision of 95.06%, and accuracy of 95.17% (AUC: 0.9512). LR + Borderline-SMOTE performed the best with an accuracy of 84.98% (AUC: 0.8503), significantly better than previous studies. This improved accuracy shows significant clinical benefits, potentially reducing missed stroke diagnoses by identifying thousands of additional at-risk patients in large-scale screening programs. Healthcare providers should consider implementing SVM with ROS in critical care settings, where potentially missed stroke cases have severe consequences. Simultaneously, SVM with Borderline-SMOTE may be more appropriate for resource-constrained environments.

Downloads

Download data is not yet available.

References

T. N. Rochmah, I. T. Rahmawati, M. Dahlui, W. Budiarto, and N. Bilqis, “Economic burden of stroke disease: A systematic review,” Int. J. Environ. Res. Public Health, vol. 18, no. 14, 2021, doi: 10.3390/ijerph18147552.

U. N. Wisesty, T. A. B. Wirayuda, F. Sthevanie, and R. Rismala, “Analysis of Data and Feature Processing on Stroke Prediction using Wide Range Machine Learning Model,” J. Online Inform., vol. 9, no. 1, pp. 29–40, 2024, doi: 10.15575/join.v9i1.1249.

B. W. Negasa, T. W. Wotale, M. E. Lelisho, L. K. Debusho, K. Sisay, and W. Gezimu, “Modeling Survival Time to Death among Stroke Patients at Jimma University Medical Center, Southwest Ethiopia: A Retrospective Cohort Study,” Stroke Res. Treat., vol. 2023, 2023, doi: 10.1155/2023/1557133.

H. Jindal, S. Agrawal, R. Khera, R. Jain, and P. Nagrath, “Heart disease prediction using machine learning algorithms,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1022, no. 1, 2021, doi: 10.1088/1757-899X/1022/1/012072.

V. R. Modhugu and S. Ponnusamy, “Comparative Analysis of Machine Learning Algorithms for Liver Disease Prediction: SVM, Logistic Regression, and Decision Tree,” Asian J. Res. Comput. Sci., vol. 17, no. 6, pp. 188–201, 2024, doi: 10.9734/ajrcos/2024/v17i6467.

S. Ghanipour and S. Yousefzadeh Boroujeni, “Stroke Prediction with Logistic Regression and assessing it using Confusion Matrix,” no. October, 2022, [Online]. Available: https://www.researchgate.net/publication/364359247

E. Dritsas and M. Trigka, “Stroke Risk Prediction with Machine Learning Techniques,” Sensors, vol. 22, no. 13, 2022, doi: 10.3390/s22134670.

T. Tazin, M. N. Alam, N. N. Dola, M. S. Bari, S. Bourouis, and M. Monirujjaman Khan, “Stroke Disease Detection and Prediction Using Robust Learning Approaches,” J. Healthc. Eng., vol. 2021, 2021, doi: 10.1155/2021/7633381.

G. Sailasya and G. L. A. Kumari, “Analyzing the Performance of Stroke Prediction using ML Classification Algorithms,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 6, pp. 539–545, 2021, doi: 10.14569/IJACSA.2021.0120662.

E. Utami, “Enhanced Heart Disease Diagnosis Using Machine Learning Algorithms: A Comparison of Feature Selection,” vol. 9, no. 2, pp. 385–392, 2025, doi: 10.29207/resti.v9i2.6175.

W. Aprilliandhika and F. Fauzi Abdulloh, “Comparison of K-Nearest Neighbor and Support Vector Machine Algorithm Optimization With Grid Search Cv on Stroke Prediction,” vol. 5, no. 4, pp. 991–1000, 2024, doi: 10.52436/1.jutif.2024.5.4.1951.

E. C. Zabor, C. A. Reddy, R. D. Tendulkar, and S. Patil, “Logistic Regression in Clinical Studies,” Int. J. Radiat. Oncol. Biol. Phys., vol. 112, no. 2, pp. 271–277, 2022, doi: 10.1016/j.ijrobp.2021.08.007.

J. Premsmith and H. Ketmaneechairat, “A predictive model for heart disease detection using data mining techniques,” J. Adv. Inf. Technol., vol. 12, no. 1, pp. 14–20, 2021, doi: 10.12720/jait.12.1.14-20.

C. Gupta, A. Saha, N. V. S. Reddy, and U. D. Acharya, “Cardiac Disease Prediction using Supervised Machine Learning Techniques,” J. Phys. Conf. Ser., vol. 2161, no. 1, 2022, doi: 10.1088/1742-6596/2161/1/012013.

Fedesoriano, “Stroke Prediction Dataset.” 2020. [Online]. Available: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

A. N. Puteri, A. Arizal, and A. D. Achmad, “Feature Selection Correlation-Based on Bank Telemarketing Customer Predictions for Time Deposits,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 20, no. 2, pp. 335–342, 2021, doi: 10.30812/matrik.v20i2.1183.

Z. Noroozi, A. Orooji, and L. Erfannia, “Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction,” Sci. Rep., vol. 13, no. 1, pp. 1–15, 2023, doi: 10.1038/s41598-023-49962-w.

T. Al‐shehari and R. A. Alsowail, “An insider data leakage detection using one‐hot encoding, synthetic minority oversampling and machine learning techniques,” Entropy, vol. 23, no. 10, 2021, doi: 10.3390/e23101258.

L. Yu, R. Zhou, R. Chen, and K. K. Lai, “Missing Data Preprocessing in Credit Classification: One-Hot Encoding or Imputation?,” Emerg. Mark. Financ. Trade, vol. 58, no. 2, pp. 472–482, 2020, doi: 10.1080/1540496X.2020.1825935.

G. A. B. Suryanegara, Adiwijaya, and M. D. Purbolaksono, “Peningkatan Hasil Klasifikasi pada Algoritma Random Forest untuk Deteksi,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 5, no. 1, pp. 114–122, 2021, doi: 10.29207/resti.v5i1.2880.

M. K. Rezki, M. I. Mazdadi, F. Indriani, Muliadi, T. H. Saragih, and V. A. Athavale, “Application of Smote to Address Class Imbalance in Diabetes Disease Categorization Utilizing C5.0, Random Forest, and Support Vector Machine,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 4, pp. 343–354, 2024, doi: 10.35882/jeeemi v6i4.434.

Y. Sun et al., “Borderline SMOTE Algorithm and Feature Selectiom‐Based Network Anomalies Detection Strategy,” Energies, vol. 15, no. 13, 2022, doi: 10.3390/en15134751.

M. Imani, A. Beikmohammadi, and H. R. Arabnia, “Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels,” Technologies, vol. 13, no. 3, pp. 1–40, 2025, doi: 10.3390/technologies13030088.

G. Ahmed et al., “DAD-Net: Classification of Alzheimer’s Disease Using ADASYN Oversampling Technique and Optimized Neural Network,” Molecules, vol. 27, no. 7085, pp. 1–21, 2022, doi: 10.3390/molecules27207085.

T. Wongvorachan, S. He, and O. Bulut, “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” Inf., vol. 14, no. 1, 2023, doi: 10.3390/info14010054.

M. Hayaty, S. Muthmainah, and S. M. Ghufran, “Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification,” Int. J. Artif. Intell. Res., vol. 4, no. 2, pp. 86–94, 2020, doi: 10.29099/ijair.v4i2.152.

S. Annas, A. Aswi, M. Abdy, and B. Poerwanto, “Stroke Classification Model using Logistic Regression,” J. Phys. Conf. Ser., vol. 2123, no. 1, 2021, doi: 10.1088/1742-6596/2123/1/012016.

Y. Dani and M. A. Ginting, “Classification of Predicting Customer Ad Clicks Using Logistic Regression and k-Nearest Neighbors,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 98–104, 2023, doi: 10.30630/joiv.7.1.1017.

R. A. Khan et al., “A novel framework for classification of two-class motor imagery EEG signals using logistic regression classification algorithm,” PLoS One, vol. 18, no. 9 September, pp. 1–18, 2023, doi: 10.1371/journal.pone.0276133.

M. M. Siregar, R. Hizria, and D. Pardede, “Perbandingan Kinerja Kernel SVM dalam Klasifikasi Kategori Kanker Kulit Menggunakan Transfer Learning,” vol. 4, no. 1, pp. 83–90, 2024, doi: 10.47709/dsi.v4i1.4665.

R. Guido, S. Ferrisi, D. Lofaro, and D. Conforti, “An Overview on the Advancements of Support Vector Machine Models in Healthcare Applications: A Review,” Inf., vol. 15, no. 4, 2024, doi: 10.3390/info15040235.

E. Winarno, W. Hadikurniawati, A. Septiarini, and H. Hamdani, “Analysis of color features performance using support vector machine with multi-kernel for batik classification,” Int. J. Adv. Intell. Informatics, vol. 8, no. 2, pp. 151–164, 2022, doi: 10.26555/ijain.v8i2.821.

M. R. Siregar, D. Hartama, I. Engineering, S. Program, I. Systems, and S. Program, “OPTIMIZING THE KNN ALGORITHM FOR CLASSIFYING CHRONIC,” vol. 10, no. 3, pp. 680–689, 2025, doi: 10.33480/jitk.v10i3.6214.

F. A. Nasution, S. Saadah, and P. E. Yunanto, “Credit Risk Detection in Peer-to-Peer Lending Using CatBoost,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 7, no. 5, pp. 1056–1062, 2023, doi: 10.29207/resti.v7i5.5139.

N. Hafidz and D. Yanti Liliana, “Klasifikasi Sentimen pada Twitter Terhadap WHO Terkait Covid-19 Menggunakan SVM, N-Gram, PSO,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 5, no. 2, pp. 213–219, 2021, doi: 10.29207/resti.v5i2.2960.

Published
2025-06-22
How to Cite
Risal, S., Fajar Apriyadi, A. Sumardin, Andini Dani Achmad, & Annisa Nurul Puteri. (2025). Enhancing Stroke Prediction with Logistic Regression and Support Vector Machine Using Oversampling Techniques. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 9(3), 562 - 574. https://doi.org/10.29207/resti.v9i3.6431
Section
Artificial Intelligence