The Exploring feature selection techniques on Classification Algorithms for Predicting Type 2 Diabetes at Early Stage

  • Mila Desi Anasanti University College London
  • Khairunisa Hilyati Universitas Nusa Mandiri
  • Annisa Novtariany Universitas Nusa Mandiri
Keywords: Type 2 diabetes, machine learning, feature selection, feature importance


Predicting early Type 2 diabetes (T2D) is critical for improved care and better T2D outcomes. An accurate and efficient T2D prediction relies on unbiased relevant features. In this study, we searched for important features to predict T2D by integrating ML-based models for feature selection and classification from 520 individuals newly diagnosed with diabetes or who will develop it. We used standard machine learning classifications, such as logistic regression (LR), Gaussian naive Bayes (NB), decision tree (DT), random forest (RF), support vector machine (SVM) with linear basis function, and k-nearest neighbors (KNN). We set out to systematically explore the viability of main feature selection representing each different technique, such as a statistical filter method (F-score), an entropy-based filter method (mutual information), an ensemble-based filter method (random forest importance), and a stochastic optimization (simultaneous perturbation feature selection and ranking (SpFSR)). We used a stratified 10-fold cross-validation technique and assessed the performance of discrimination, calibration, and clinical utility. We attained the highest accuracy of 98% using RF with the full set of features (16 features), then used RF as a classifier wrapper to select the important features. We observed a combination of SpFSR and RF as the best model with a P-value above 0.05 (P-value = 0.26), statistically attaining the same accuracy as the full features. The study's findings support the efficiency and usefulness of the suggested method for choosing the most important features of diabetic data: polyuria, gender, polydipsia, age, itching, sudden weight loss, delayed healing, and alopecia.


Download data is not yet available.


S. A. Mahmoudinejad Dezfuli, S. R. Mahmoudinejad Dezfuli, S. V. Mahmoudinejad Dezfuli, And Y. Kiani, "Early Diagnosis Of Diabetes Mellitus Using Data Mining And Classification Techniques," Jundishapur Journal Of Chronic Disease Care, Vol. 8, No. 3, Jul. 2019, Doi: 10.5812/Jjcdc.94173.

F. Nasution, A. Azwar Siregar, And S. Tinggi Kesehatan Indah Medan, “Faktor Risiko Kejadian Diabetes Mellitus (Risk Factors For The Event Of Diabetes Mellitus),” Jurnal Ilmu Kesehatan, Vol. 9, No. 2, 2021, Accessed: Sep. 13, 2022. [Online]. Available: Https://Doi.Org/10.32831/Jik.V9i2.304

C. Zhu, C. U. Idemudia, And W. Feng, "Improved Logistic Regression Model For Diabetes Prediction By Integrating Pca And K-Means Techniques," Inform Med Unlocked, Vol. 17, Jan. 2019, Doi: 10.1016/J.Imu.2019.100179.

"New Who Global Compact To Speed Up Action To Tackle Diabetes," Apr. 14, 2021. Https://Www.Who.Int/News/Item/14-04-2021-New-Who-Global-Compact-To-Speed-Up-Action-To-Tackle-Diabetes (Accessed Aug. 06, 2022).

R. Saxena, S. K. Sharma, M. Gupta, And G. C. Sampada, "A Novel Approach For Feature Selection And Classification Of Diabetes Mellitus: Machine Learning Methods," Comput Intell Neurosci, Vol. 2022, 2022, Doi: 10.1155/2022/3820360.

T. Mahboob Alam Et Al., "A Model For Early Prediction Of Diabetes," Inform Med Unlocked, Vol. 16, Jan. 2019, Doi: 10.1016/J.Imu.2019.100204.

Yuhelma, Y. Hasneli, And F. Annis Nauli, “Identifikasi Dan Analisis Komplikasi Makrovaskuler Dan Mikrovaskuler Pada Pasien Diabetes Mellitus”.

W. Apriliah Et Al., “Prediksi Kemungkinan Diabetes Pada Tahap Awal Menggunakan Algoritma Klasifikasi Random Forest,” 2021. Accessed: Sep. 13, 2022. [Online]. Available: Https://Doi.Org/10.32520/Stmsi.V10i1.1129

Lestari, Zulkarnain, And St. Aisyah Sijid, “Diabetes Melitus: Review Etilogi, Patofisiologi, Gejala, Penyebab, Cara Pemeriksaan, Cara Pengobatan Dan Cara Pencegahan.” Accessed: Sep. 13, 2022. [Online]. Available: Https://Doi.Org/10.24252/Psb.V7i1.24229

J. Piri Et Al., "Feature Selection Using Artificial Gorilla Troop Optimization For Biomedical Data: A Case Analysis With Covid-19 Data," Mathematics, Vol. 10, No. 15, P. 2742, Aug. 2022, Doi: 10.3390/Math10152742.

A. Mangal And E. A. Holm, "A Comparative Study Of Feature Selection Methods For Stress Hotspot Classification In Materials," Integr Mater Manuf Innov, Vol. 7, No. 3, Pp. 87–95, Sep. 2018, Doi: 10.1007/S40192-018-0109-8.

F. Septianingrum And A. S. Y. Irawan, “Metode Seleksi Fitur Untuk Klasifikasi Sentimen Menggunakan Algoritma Naive Bayes: Sebuah Literature Review,” Jurnal Media Informatika Budidarma, Vol. 5, No. 3, P. 799, Jul. 2021, Doi: 10.30865/Mib.V5i3.2983.

J. A. Putra And A. Laksita Akbar, “Klasifikasi Pengidap Diabetes Pada Perempuan Menggunakan Penggabungan Metode Support Vector Machine Dan K-Nearest Neighbour,” 2016. Accessed: Sep. 13, 2022. [Online]. Available: Https://Jurnal.Unej.Ac.Id/Index.Php/Informal/Article/View/2719/2515

R. Manimaran And M. Vanitha, "Novel Approach To Prediction Of Diabetes Using Classification Mining Algorithm," International Journal Of Innovative Research In Science, Engineering And Technology (An Iso, Vol. 3297, 2007, Doi: 10.15680/Ijirset.2017.0607266.

D. A. Agatsa, R. Rismala, And U. N. Wisesty, “Klasifikasi Pasien Pengidap Diabetes Menggunakan Metode Support Vector Machine,” 2020.

M. M. F. Islam, R. Ferdousi, S. Rahman, And H. Y. Bushra, "Likelihood Prediction Of Diabetes At Early Stage Using Data Mining Techniques," In Advances In Intelligent Systems And Computing, 2020, Vol. 992, Pp. 113–125. Doi: 10.1007/978-981-13-8798-2_12.

"Early Stage Diabetes Risk Prediction Dataset | Ieee Dataport." Https://Ieee-Dataport.Org/Documents/Early-Stage-Diabetes-Risk-Prediction-Dataset (Accessed Aug. 10, 2022).

B. Sarojini Ilango, "A Hybrid Prediction Model With F-Score Feature Selection For Type Ii Diabetes Databases," 2010. Accessed: Sep. 13, 2022. [Online]. Available: Http://Dx.Doi.Org/10.1145/1858378.1858391

N. Barraza, S. Moro, M. Ferreyra, And A. De La Peña, "Mutual Information And Sensitivity Analysis For Feature Selection In Customer Targeting: A Comparative Study," J Inf Sci, Vol. 45, No. 1, Pp. 53–67, Feb. 2019, Doi: 10.1177/0165551518770967.

C. Strobl, A. L. Boulesteix, A. Zeileis, And T. Hothorn, "Bias In Random Forest Variable Importance Measures: Illustrations, Sources And A Solution," Bmc Bioinformatics, Vol. 8, 2007, Doi: 10.1186/1471-2105-8-25.

T. N. Joshi And P. M. Chawan, "Diabetes Prediction Using Machine Learning Techniques," Computer Engg. And Info. Tech., V.J.T.I, Vol. 8, Pp. 2248–9622, 2018, Doi: 10.9790/9622-0801020913.

Rakesh S Raj, Sanjay D S, Dr. Kusuma M, And Dr. S. Sampath, Comparison Of Support Vector Machine And Naive Bayes Classifiers For Predicting Diabetes. 2019. Accessed: Sep. 13, 2022. [Online]. Available: Https://Doi.Org/10.1109/Icatiece45860.2019.9063792

D. Yuni Utami, E. Nurlelah, And F. Nur Hasan, "Jite (Journal Of Informatics And Telecommunication Engineering) Comparison Of Neural Network Algorithms, Naive Bayes And Logistic Regression To Find The Highest Accuracy In Diabetes," Jite, Vol. 5, No. 1, 2021, Doi: 10.31289/Jite.V5i1.5201.

Anita Ahmad Kasim, Muhammad Sudarsono, And M. Sudarsono, “Algoritma Svm Untuk Klasifikasi Ekonomi Penduduk Penerima Bantuan Pemerintah Di Kecamatan Simpang Raya Sulawesi Tengah,” 2019.

N. G. Ramadhan, "Comparative Analysis Of Adasyn-Svm And Smote-Svm Methods On The Detection Of Type 2 Diabetes Mellitus," Scientific Journal Of Informatics, Vol. 8, No. 2, Pp. 276–282, Nov. 2021, Doi: 10.15294/Sji.V8i2.32484.

S. Mirza, S. Mittal, And M. Zaman, "Applying Decision Tree For Prognosis Of Diabetes Mellitus," International Journal Of Applied Research On Information Technology And Computing, Vol. 9, No. 1, P. 15, 2018, Doi: 10.5958/0975-8089.2018.00002.7.

Gde Agung Brahmana Suryanegara, Adiwijaya, And Mahendra Dwifebri Purbolaksono, “Peningkatan Hasil Klasifikasi Pada Algoritma Random Forest Untuk Deteksi Pasien Penderita Diabetes Menggunakan Metode Normalisasi,” Jurnal Resti (Rekayasa Sistem Dan Teknologi Informasi), Vol. 5, No. 1, Pp. 114–122, Feb. 2021, Doi: 10.29207/Resti.V5i1.2880.

I. Listiowarni And E. R. Setyaningsih, “Feature Selection Chi-Square Dan K-Nn Pada Pengkategorian Soal Ujian Berdasarkan Cognitive Domain Taksonomi Bloom,” 2018. [Online]. Available: Http://Jurnal.Pcr.Ac.Id

A. Ali, M. Alrubei, L. F. M. Hassan, M. Al-Ja'afari, And S. Abdulwahed, "Diabetes Classification Based On Knn," Iium Engineering Journal, Vol. 21, No. 1, Pp. 175–181, 2020, Doi: 10.31436/Iiumej.V21i1.1206.

A. P. Ayudhitama And U. Pujianto, “Analisa 4 Algoritma Dalam Klasifikasi Penyakit Liver Menggunakan Rapidminer”, Accessed: Sep. 13, 2022. [Online]. Available: Https://Doi.Org/10.33795/Jip.V6i2.274

M. Pradhan And G. R. Bamnote, "Design Of Classifier For Detection Of Diabetes Mellitus Using Genetic Programming," In Proceedings Of The 3rd International Conference On Frontiers Of Intelligent Computing: Theory And Applications (Ficta) 2014, 2015, Pp. 763–770.

S. Y. Rubaiat, M. M. Rahman, And M. K. Hasan, "Important Feature Selection & Accuracy Comparisons Of Different Machine Learning Models For Early Diabetes Detection," In 2018 International Conference On Innovation In Engineering And Technology (Iciet), 2018, Pp. 1–6. Doi: 10.1109/Ciet.2018.8660831.

R. Saxena, S. K. Sharma, M. Gupta, And G. C. Sampada, "A Novel Approach For Feature Selection And Classification Of Diabetes Mellitus: Machine Learning Methods," Comput Intell Neurosci, Vol. 2022, P. 3820360, 2022, Doi: 10.1155/2022/3820360.

K. C. Tan, E. J. Teoh, Q. Yu, And K. C. Goh, "A Hybrid Evolutionary Algorithm For Attribute Selection In Data Mining," Expert Syst Appl, Vol. 36, No. 4, Pp. 8616–8630, 2009, Doi: Https://Doi.Org/10.1016/J.Eswa.2008.10.013.

L. Widya Astuti, I. Saluza, And E. Yulianti, “Feature Selection Menggunakan Binary Wheal Optimizaton Algorithm (Bwoa) Pada Klasifikasi Penyakit Diabetes,” Jurnal Ilmiah Informatika Global, Vol. 13, No. 1, 2022, Doi: 10.36982/Jiig.V13i1.2057.

S. Rahayu And Stik. Jayakarta Pkp Dki Jakarta, “Hubungan Usia, Jenis Kelamin Dan Indeks Massa Tubuh Dengan Kadar Gula Darah Puasa Pada Pasien Diabetes Melitus Tipe 2 Di Klinik Pratama Rawat Jalan Proklamasi, Depok, Jawa Barat,” 2020. Accessed: Sep. 13, 2022. [Online]. Available: Https://Doi.Org/10.34035/Jk.V11i1.412

A. Dewi Et Al., “Pengaruh Minyak Kelapa Terhadap Penurunan Rasa Gatal Pada Pasien Diabetes Mellitus Di Rsud Kota Slatiga.”

I. Wayan, A. Putra, And K. N. Berawi, “Empat Pilar Penatalaksanaan Pasien Diabetes Mellitus Tipe 2,” 2015.

E. Setiyorini And N. A. Wulandari, “Hubungan Status Nutrisi Dengan Kualitas Hidup Pada Lansia Penderita Diabetes Mellitus Tipe 2 Yang Berobat Di Poli Penyakit Dalam Rsd Mardi Waluyo Blitar,” Jurnal Ners Dan Kebidanan (Journal Of Ners And Midwifery), Vol. 4, No. 2, Pp. 125–133, Oct. 2017, Doi: 10.26699/Jnk.V4i2.Art.P125-133.

How to Cite
Mila Desi Anasanti, Khairunisa Hilyati, & Annisa Novtariany. (2022). The Exploring feature selection techniques on Classification Algorithms for Predicting Type 2 Diabetes at Early Stage. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 6(5), 832 - 839.
Information Technology Articles