Diabetes Risk Prediction using Feature Importance Extreme Gradient Boosting (XGBoost)

  • Kartina Diah Kusuma Wardani Politeknik Caltex Riau
  • Memen Akbar Politeknik Caltex Riau
Keywords: diabetes, prediction, machine learning, xgboost


Diabetes results from impaired pancreatic function as a producer of insulin and glucagon hormones, which regulate glucose levels in the blood. People with diabetes today are not only experienced adults, but pre-diabetes has been identified since the age of children and adolescents. Early prediction of diabetes can make it easier for doctors and patients to intervene as soon as possible so that the risk of complications can be reduced. One of the uses of medical data from diabetes patients is to produce a model that medical personnel can use to predict and identify diabetes in patients. Various techniques are used to provide the earliest possible prediction of diabetes based on the symptoms experienced by diabetic patients, including the use of machine learning. People can use machine learning to generate models based on historical data from diabetic patients, and predictions are made with the model. In this study, extreme gradient boosting is the machine learning technique for predicting diabetes (xgboost) using XGBoost with importance features. The diabetes dataset used in this study comes from the early stage diabetes risk prediction dataset published by UCI Machine Learning, which has 520 records and 16 attributes. The diabetes prediction model using xgboost is displayed as a tree. The model precision result in this study was 98.71%, for the F1 score was 98.18%. The accuracy obtained based on the best 10 attributes using the importance of the XGBoost feature is 98.72%.



Download data is not yet available.


U. e. Laila, K. Mahboob, A. W. Khan, F. Khan, and W. Taekeun, “An Ensemble Approach to Predict Early-Stage Diabetes Risk Using Machine Learning: An Empirical Study,” Sensors, vol. 22, no. 14, pp. 1–15, 2022, doi: 10.3390/s22145247.

Y. Tan, H. Chen, J. Zhang, R. Tang, and P. Liu, “Early Risk Prediction of Diabetes Based on GA-Stacking,” Appl. Sci., vol. 12, no. 2, 2022, doi: 10.3390/app12020632.

V. Vaidya and L. K. Vishwamitra Scholar, “Diabetes Detection using Convolutional Neural Network through Feature Sequencing,” Turkish J. Comput. Math. Educ., vol. 12, no. 10, pp. 2783–2789, 2021.

A. D. Association, “Classification and diagnosis of diabetes,” Diabetes Care, vol. 38 Su, 2015, doi: 10.2337/dc15-S005.

S. Patel, R. Patel, N. Ganatra, and A. Patel, “Predicting a Risk of Diabetes at Early Stage using Machine Learning Approach,” Turkish J. Comput. Math. Educ., vol. 12, no. 10, pp. 5277–5284, 2021.

H. Y. Islam, M. M., Ferdousi, R., Rahman, S., & Bushra, “Likelihood prediction of diabetes at early stage using data mining techniques,” Comput. Vis. Mach. Intell. Med. image Anal. (pp. 113-125), 2020, [Online]. Available: https://sreyas.ac.in/wp-content/uploads/2021/07/1.-Dr.-Rohit-Raja.pdf#page=119

T. Dendup, X. Feng, S. Clingan, and T. Astell-Burt, “Environmental risk factors for developing type 2 diabetes mellitus: A systematic review,” Int. J. Environ. Res. Public Health, vol. 15, no. 1, 2018, doi: 10.3390/ijerph15010078.

G. Bin Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–501, 2006, doi: 10.1016/j.neucom.2005.12.126.

Darussalam and G. Arief, “Sentiment Analysis on Social Media with Glove Using Combination CNN and RoBERTa,” J. Resti, vol. 7 No.3, no. 1, pp. 457–463, 2023, doi: https://doi.org/10.29207/resti.v7iX3.4892.

V. Nasteski, “An overview of the supervised machine learning methods,” Horizons.B, vol. 4, no. December, pp. 51–62, 2017, doi: 10.20544/horizons.b.04.1.17.p05.

S. K. Bhoi et al., “Prediction of Diabetes in Females of Pima Indian Heritage: A Complete Supervised Learning Approach,” Turkish J. Comput. Math. Educ., vol. 12, no. 10, pp. 3074–3084, 2021.

V. R. Geetha, N. Jayaveeran, and A. S. A. K. N, “Classification Of Gestational Diabetes Using Modified Fuzzy C Means Clustering And Machine Learning Technique,” vol. 12, no. 10, pp. 2416–2427, 2021.

R. Saxena and S. Kumar Sharma Manali Gupta, “Role of K-nearest neighbour in detection of Diabetes Mellitus,” Turkish J. Comput. Math. Educ., vol. 12, no. 10, pp. 373–376, 2021.

J. J. S. M. Et. al., “Predictive Modeling Framework for Diabetes Classification Using Big Data Tools and Machine Learning,” Turkish J. Comput. Math. Educ., vol. 12, no. 10, pp. 818–823, 2021, doi: 10.17762/turcomat.v12i10.4255.

K. D. K. Wardhani and M. Akbar, “Diabetes Risk Prediction Using Extreme Gradient Boosting (XGBoost),” J. Online Inform. 7(2), 244-250., vol. Vol 7.No 2, 2022, doi: 10.15575/join.v7i2.970.

F. M. Basysyar and G. Dwilestari, “House Price Prediction Using Exploratory Data Analysis and Machine Learning with Feature Selection,” Acadlore Trans. AI Mach. Learn., vol. 1, no. 1, pp. 11–21, 2022, doi: 10.56578/ataiml010103.

T. Sarwar et al., “The Secondary Use of Electronic Health Records for Data Mining: Data Characteristics and Challenges,” ACM Comput. Surv., vol. 55, no. 2, 2023, doi: 10.1145/3490234.

H. Mo, H. Sun, J. Liu, and S. Wei, “Developing window behavior models for residential buildings using XGBoost algorithm,” Energy Build., vol. 205, pp. 1–23, 2019, doi: 10.1016/j.enbuild.2019.109564.

A. Mello, “XGBoost: theory and practice,” https://towardsdatascience.com/xgboost-theory-and-practice-fb8912930ad6, 2020. [Online]. Available: https://towardsdatascience.com/xgboost-theory-and-practice-fb8912930ad6

M. Heydarian, T. E. Doyle, and R. Samavi, “MLCM: Multi-Label Confusion Matrix,” IEEE Access, vol. 10, pp. 19083–19095, 2022, doi: 10.1109/ACCESS.2022.3151048.

How to Cite
Kartina Diah Kusuma Wardani, & Memen Akbar. (2023). Diabetes Risk Prediction using Feature Importance Extreme Gradient Boosting (XGBoost). Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 7(4), 824 - 831. https://doi.org/10.29207/resti.v7i4.4651
Information Systems Engineering Articles