Impact of Adaptive Synthetic on Naïve Bayes Accuracy in Imbalanced Anemia Detection Datasets

Muhammad Khahfi Zuhanda; Lisya Permata; Hartono; Erianto Ongko; Desniarti

doi:10.29207/resti.v9i1.6031

Muhammad Khahfi Zuhanda Universitas Medan Area
Lisya Permata Universitas Islam Sumatera Utara
Hartono Universitas Medan Area
Erianto Ongko Institut Modern Arsitektur dan Teknologi
Desniarti Universitas Muslim Nusantara Al Washliyah

DOI: https://doi.org/10.29207/resti.v9i1.6031

Keywords: ADASYN, Oversampling, naive Bayes, Class Imbalance, Machine Learning

Abstract

This research aims to analyze the impact of the Adaptive Synthetic (ADASYN) oversampling technique on the performance of the Naïve Bayes classification algorithm on datasets with class imbalance. Class imbalance is a common problem in machine learning that can cause bias in prediction results, especially in minority classes. ADASYN is one of the oversampling methods that focuses on adaptively synthesizing new data for minority classes. In this study, the performance of the Naïve Bayes algorithm was tested on Anemia Diagnosis datasets before and after the application of ADASYN. This dataset contains 104 instances, 5 attributes, and 2 classes, and has an imbalance ratio of 3. The evaluation was carried out by comparing accuracy, confusion matrix, precision, recall, and F1-score to obtain a more comprehensive picture of the effectiveness of ADASYN in improving Naïve Bayes. The results of the study show that the performance of the oversampling method depends on the imbalance ratio so it is important to ensure that the oversampling method does not cause overfitting and this can be overcome by using ADASYN which only selects Selected Neighbors. The results showed that ADASYN significantly increased accuracy from 0.57 to 0.78, precision from 0.17 to 0.74, recall from 0.20 to 0.88, and F1-Score from 0.18 to 0.80. In this study, we also compared the application of ADASYN and SMOTE on the Naïve Bayes algorithm. The results show that ADASYN outperforms SMOTE across all key metrics—accuracy, precision, recall, and F1-Score—while the accuracy improvements were statistically significant (p-value = 0.00903).

Downloads

Download data is not yet available.

References

M. Ahammed, Md. A. Mamun, and M. S. Uddin, “A machine learning approach for skin disease detection and classification using image segmentation,” Healthcare Analytics, vol. 2, p. 100122, Nov. 2022, doi: 10.1016/j.health.2022.100122.

T. Karagül Yıldız, N. Yurtay, and B. Öneç, “Classifying anemia types using artificial learning methods,” Engineering Science and Technology, an International Journal, vol. 24, no. 1, pp. 50–70, Feb. 2021, doi: 10.1016/j.jestch.2020.12.003.

H. Hairani, T. Widiyaningtyas, and D. D. Prasetya, “Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies,” JOIV : International Journal on Informatics Visualization, vol. 8, no. 3, pp. 1310–1318, Sep. 2024, doi: 10.62527/joiv.8.3.2283.

Hartono and R. B. Y. Syah, “Hybrid Approach with Membership-Density Based Oversampling for handling multi-class imbalance in Internet Traffic Identification with overlapping and noise,” ICT Express, p. S2405959524000444, Apr. 2024, doi: 10.1016/j.icte.2024.04.007.

Q. D. Nguyen and H.-T. Thai, “Crack segmentation of imbalanced data: The role of loss functions,” Engineering Structures, vol. 297, p. 116988, Dec. 2023, doi: 10.1016/j.engstruct.2023.116988.

A. Noor, N. Javaid, N. Alrajeh, B. Mansoor, A. Khaqan, and S. H. Bouk, “Heart Disease Prediction Using Stacking Model With Balancing Techniques and Dimensionality Reduction,” IEEE Access, vol. 11, pp. 116026–116045, 2023, doi: 10.1109/ACCESS.2023.3325681.

A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, “RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification,” Journal of King Saud University - Computer and Information Sciences, Jun. 2022, doi: 10.1016/j.jksuci.2022.06.005.

J. Jedrzejowicz and P. Jedrzejowicz, “Bicriteria Oversampling for Imbalanced Data Classification,” Procedia Computer Science, vol. 207, pp. 245–254, 2022, doi: 10.1016/j.procs.2022.09.057.

E. B. Fatima, B. Omar, E. M. Abdelmajid, F. Rustam, A. Mehmood, and G. S. Choi, “Minimizing the Overlapping Degree to Improve Class-Imbalanced Learning Under Sparse Feature Selection: Application to Fraud Detection,” IEEE Access, vol. 9, pp. 28101–28110, 2021, doi: 10.1109/ACCESS.2021.3056285.

F. Dai, Y. Song, W. Si, G. Yang, J. Hu, and X. Wang, “Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data,” Information Sciences, vol. 569, pp. 70–89, Aug. 2021, doi: 10.1016/j.ins.2021.04.017.

P. Sadhukhan and S. Palit, “Adaptive learning of minority class prior to minority oversampling,” Pattern Recognition Letters, vol. 136, pp. 16–24, Aug. 2020, doi: 10.1016/j.patrec.2020.05.020.

D. Appasani, C. S. Bokkisam, and S. Surendran, “An Incremental Naive Bayes Learner for Real-time Health Prediction,” Procedia Computer Science, vol. 235, pp. 2942–2954, 2024, doi: 10.1016/j.procs.2024.04.278.

H. Cang et al., “Jujube quality grading using a generative adversarial network with an imbalanced data set,” Biosystems Engineering, vol. 236, pp. 224–237, Dec. 2023, doi: 10.1016/j.biosystemseng.2023.11.002.

E. Elyan, C. F. Moreno-Garcia, and C. Jayne, “CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification,” Neural Comput & Applic, vol. 33, no. 7, pp. 2839–2851, Apr. 2021, doi: 10.1007/s00521-020-05130-z.

J. Liu, “A minority oversampling approach for fault detection with heterogeneous imbalanced data,” Expert Systems with Applications, vol. 184, p. 115492, Dec. 2021, doi: 10.1016/j.eswa.2021.115492.

A. S. Tarawneh, A. B. A. Hassanat, K. Almohammadi, D. Chetverikov, and C. Bellinger, “SMOTEFUNA: Synthetic Minority Over-Sampling Technique Based on Furthest Neighbour Algorithm,” IEEE Access, vol. 8, pp. 59069–59082, 2020, doi: 10.1109/ACCESS.2020.2983003.

S. Korkmaz, M. A. Şahman, A. C. Cinar, and E. Kaya, “Boosting the oversampling methods based on differential evolution strategies for imbalanced learning,” Applied Soft Computing, vol. 112, p. 107787, Nov. 2021, doi: 10.1016/j.asoc.2021.107787.

W. Wang and F. Liu, “ADDPC-SMOTE: An Oversampling Algorithm Based on Density Difference Peak Clustering and Spatial Distribution Entropy,” IEEE Access, vol. 11, pp. 108152–108166, 2023, doi: 10.1109/ACCESS.2023.3320265.

I. Czarnowski, “Weighted Ensemble with one-class Classification and Over-sampling and Instance selection (WECOI): An approach for learning from imbalanced data streams,” Journal of Computational Science, vol. 61, p. 101614, May 2022, doi: 10.1016/j.jocs.2022.101614.

J. Liu, Y. Gao, and F. Hu, “A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM,” Computers & Security, vol. 106, p. 102289, Jul. 2021, doi: 10.1016/j.cose.2021.102289.

T. Zhang, Y. Li, and X. Wang, “Gaussian prior based adaptive synthetic sampling with non-linear sample space for imbalanced learning,” Knowledge-Based Systems, vol. 191, p. 105231, Mar. 2020, doi: 10.1016/j.knosys.2019.105231.

B. Mirzaei, B. Nikpour, and H. Nezamabadi-pour, “CDBH: A clustering and density-based hybrid approach for imbalanced data classification,” Expert Systems with Applications, vol. 164, p. 114035, Feb. 2021, doi: 10.1016/j.eswa.2020.114035.

I. Czarnowski, “Agent-based population learning algorithm for over-sampling in the classification of imbalanced data streams,” Procedia Computer Science, vol. 225, pp. 686–692, 2023, doi: 10.1016/j.procs.2023.10.054.

R. Mitra, A. Bajpai, and K. Biswas, “ADASYN-assisted machine learning for phase prediction of high entropy carbides,” Computational Materials Science, vol. 223, p. 112142, Apr. 2023, doi: 10.1016/j.commatsci.2023.112142.

R. Obiedat et al., “Sentiment Analysis of Customers’ Reviews Using a Hybrid Evolutionary SVM-Based Approach in an Imbalanced Data Distribution,” IEEE Access, vol. 10, pp. 22260–22273, 2022, doi: 10.1109/ACCESS.2022.3149482.

P. Mooijman, C. Catal, B. Tekinerdogan, A. Lommen, and M. Blokland, “The effects of data balancing approaches: A case study,” Applied Soft Computing, vol. 132, p. 109853, Jan. 2023, doi: 10.1016/j.asoc.2022.109853.

Y. Zhu, C. Jia, F. Li, and J. Song, “Inspector: a lysine succinylation predictor based on edited nearest-neighbor undersampling and adaptive synthetic oversampling,” Analytical Biochemistry, vol. 593, p. 113592, Mar. 2020, doi: 10.1016/j.ab.2020.113592.

Z. Qing, Q. Zeng, H. Wang, Y. Liu, T. Xiong, and S. Zhang, “ADASYN-LOF Algorithm for Imbalanced Tornado Samples,” Atmosphere, vol. 13, no. 4, Art. no. 4, Apr. 2022, doi: 10.3390/atmos13040544.

T. Xu, G. Coco, and M. Neale, “A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning,” Water Research, vol. 177, p. 115788, Jun. 2020, doi: 10.1016/j.watres.2020.115788.

M. Vishwakarma and N. Kesswani, “A new two-phase intrusion detection system with Naïve Bayes machine learning for data classification and elliptic envelop method for anomaly detection,” Decision Analytics Journal, vol. 7, p. 100233, Jun. 2023, doi: 10.1016/j.dajour.2023.100233.

D. Petschke and T. E. M. Staab, “A supervised machine learning approach using naive Gaussian Bayes classification for shape-sensitive detector pulse discrimination in positron annihilation lifetime spectroscopy (PALS),” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 947, p. 162742, Dec. 2019, doi: 10.1016/j.nima.2019.162742.

T. Wahyuningsih, D. Manongga, I. Sembiring, and S. Wijono, “Comparison of Effectiveness of Logistic Regression, Naive Bayes, and Random Forest Algorithms in Predicting Student Arguments,” Procedia Computer Science, vol. 234, pp. 349–356, 2024, doi: 10.1016/j.procs.2024.03.014.

D.-H. Vu, “Privacy-preserving Naive Bayes classification in semi-fully distributed data model,” Computers & Security, vol. 115, p. 102630, Apr. 2022, doi: 10.1016/j.cose.2022.102630.

Hubert, P. Phoenix, R. Sudaryono, and D. Suhartono, “Classifying Promotion Images Using Optical Character Recognition and Naïve Bayes Classifier,” Procedia Computer Science, vol. 179, pp. 498–506, 2021, doi: 10.1016/j.procs.2021.01.033.

A. V. D. Sano, A. A. Stefanus, E. D. Madyatmadja, H. Nindito, A. Purnomo, and C. P. M. Sianipar, “Proposing a visualized comparative review analysis model on tourism domain using Naïve Bayes classifier,” Procedia Computer Science, vol. 227, pp. 482–489, 2023, doi: 10.1016/j.procs.2023.10.549.

S. Wang, J. Ren, and R. Bai, “A semi-supervised adaptive discriminative discretization method improving discrimination power of regularized naive Bayes,” Expert Systems with Applications, vol. 225, p. 120094, Sep. 2023, doi: 10.1016/j.eswa.2023.120094.

W. Guo, G. Wang, C. Wang, and Y. Wang, “Distribution network topology identification based on gradient boosting decision tree and attribute weighted naive Bayes,” Energy Reports, vol. 9, pp. 727–736, Sep. 2023, doi: 10.1016/j.egyr.2023.04.256.

H. Zhang, L. Jiang, and G. I. Webb, “Rigorous non-disjoint discretization for naive Bayes,” Pattern Recognition, vol. 140, p. 109554, Aug. 2023, doi: 10.1016/j.patcog.2023.109554.

Shahzad Aslam, “Anemia Diagnosis.” [Online]. Available: https://www.kaggle.com/datasets/zeesolver/uhygtttt/data

S. Suner et al., “Prediction of anemia and estimation of hemoglobin concentration using a smartphone camera,” PLoS ONE, vol. 16, no. 7, p. e0253495, Jul. 2021, doi: 10.1371/journal.pone.0253495.

A. Alabrah, “Scientific Elegance in NIDS: Unveiling Cardinality Reduction, Box-Cox Transformation, and ADASYN for Enhanced Intrusion Detection,” CMC, vol. 79, no. 3, pp. 3897–3912, 2024, doi: 10.32604/cmc.2024.048528.

Y. Shang, “Prevention and detection of DDOS attack in virtual cloud computing environment using Naive Bayes algorithm of machine learning,” Measurement: Sensors, vol. 31, p. 100991, Feb. 2024, doi: 10.1016/j.measen.2023.100991.

A. Yudhana, D. Sulistyo, and I. Mufandi, “GIS-based and Naïve Bayes for nitrogen soil mapping in Lendah, Indonesia,” Sensing and Bio-Sensing Research, vol. 33, p. 100435, Aug. 2021, doi: 10.1016/j.sbsr.2021.100435.

O. Peretz, M. Koren, and O. Koren, “Naive Bayes classifier – An ensemble procedure for recall and precision enrichment,” Engineering Applications of Artificial Intelligence, vol. 136, p. 108972, Oct. 2024, doi: 10.1016/j.engappai.2024.108972.

A. Özdemir, K. Polat, and A. Alhudhaif, “Classification of imbalanced hyperspectral images using SMOTE-based deep learning methods,” Expert Systems with Applications, vol. 178, p. 114986, Sep. 2021, doi: 10.1016/j.eswa.2021.114986.

T. A. Assegie, A. O. Salau, K. Sampath, R. Govindarajan, S. Murugan, and B. Lakshmi, “Evaluation of Adaptive Synthetic Resampling Technique for Imbalanced Breast Cancer Identification,” Procedia Computer Science, vol. 235, pp. 1000–1007, Jan. 2024, doi: 10.1016/j.procs.2024.04.095.

R. Malhotra and S. Kamal, “An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data,” Neurocomputing, vol. 343, pp. 120–140, May 2019, doi: 10.1016/j.neucom.2018.04.090.