XGBoost Algorithm for Cervical Cancer Risk Prediction:  Multi-dimensional Feature Analysis

Sudi Suryadi; Masrizal

doi:10.29207/resti.v9i3.6587

Sudi Suryadi Universitas Labuhanbatu
Masrizal Universitas Labuhanbatu

DOI: https://doi.org/10.29207/resti.v9i3.6587

Keywords: cervical cancer screening, computational oncology, machine learning, risk stratification, XGBoost

Abstract

Cervical cancer continues to pose a significant global health challenge, with early detection remaining the cornerstone for effective intervention. This study is situated at the intersection of clinical oncology and computational intelligence, exploring the potential of gradient-boosting algorithms to overcome the limitations of conventional screening methodologies. An XGBoost model was developed to predict cervical cancer risk. This model incorporates demographic, behavioral, and clinical parameters. The model was developed using data from 858 patients at the Hospital Universitario de Caracas. The preprocessing pipeline was designed to address the complexities inherent in medical data, including strategic management of missing values and standardizing heterogeneous features. The model demonstrated an overall accuracy of 96.3%, with a sensitivity of 66.7% and a specificity of 97.6%. This performance profile indicates adept navigation of the delicate balance between missed diagnoses and unnecessary interventions. Feature importance analysis revealed a multifaceted risk landscape, where screening test results contributed substantial predictive power (approximately 60%), complemented by demographic and behavioral factors, including age, reproductive history, and contraceptive usage patterns. The confusion matrix analysis revealed the clinical implications of the model predictions, demonstrating a promising positive predictive value of 55.0% despite the pronounced class imbalance. These findings suggest that ensemble learning approaches can effectively synthesize diverse patient data into meaningful risk assessments, potentially enhancing screening efficiency through personalized stratification. Future research directions include prospective validation across diverse populations, integration of longitudinal data, and further exploration of explainable AI techniques to bridge the gap between algorithmic predictions and clinical implementation.

Downloads

Download data is not yet available.

References

R. A. Ayeni et al., “Interconnectedness threat: unveiling the mechanisms behind human papillomavirus-induced cervical cancer,” Explor Med, vol. 6, Mar. 2025, doi: 10.37349/emed.2025.1001292.

W. K. S. A. El Rahman, N. M. Saber, and A. A. Ahmed, “Efficacy of precede model-based educational program on women’s knowledge and practice regarding cervical cancer prevention,” Int J Health Sci (Qassim), 2021, doi: 10.53730/ijhs.v5ns1.13896.

S. L. Bedell, L. S. Goldstein, A. R. Goldstein, and A. T. Goldstein, “Cervical Cancer Screening: Past, Present, and Future,” 2020. doi: 10.1016/j.sxmr.2019.09.005.

R. Hull1 et al., “Cervical cancer in low and middle.income countries (Review),” Oncol Lett, vol. 20, no. 3, 2020, doi: 10.3892/ol.2020.11754.

W. Small et al., “Cervical cancer: A global health crisis,” 2017. doi: 10.1002/cncr.30667.

A. A. Swanson and L. Pantanowitz, “The evolution of cervical cancer screening,” 2024. doi: 10.1016/j.jasc.2023.09.007.

M. J. Khan, “Cervical Cancer Screening: Evolution of National Guidelines and Current Recommendations,” Clin Obstet Gynecol, vol. 66, no. 3, 2023, doi: 10.1097/GRF.0000000000000791.

M. Safaeian, D. Solomon, and P. E. Castle, “Cervical Cancer Prevention-Cervical Screening: Science in Evolution,” 2007. doi: 10.1016/j.ogc.2007.09.004.

U. Menon, M. Griffin, and A. Gentry-Maharaj, “Ovarian cancer screening - Current status, future directions,” 2014. doi: 10.1016/j.ygyno.2013.11.030.

S. Prusty, S. Patnaik, and S. K. Dash, “SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer,” Frontiers in Nanotechnology, vol. 4, 2022, doi: 10.3389/fnano.2022.972421.

Y. I. Abdullah, J. S. Schuman, R. Shabsigh, A. Caplan, and L. A. Al-Aswad, “Ethics of Artificial Intelligence in Medicine and Ophthalmology,” 2021. doi: 10.1097/APO.0000000000000397.

D. S. Char, N. H. Shah, and D. Magnus, “Implementing Machine Learning in Health Care — Addressing Ethical Challenges,” New England Journal of Medicine, vol. 378, no. 11, 2018, doi: 10.1056/nejmp1714229.

L. Oala et al., “Machine Learning for Health: Algorithm Auditing & Quality Control,” J Med Syst, vol. 45, no. 12, 2021, doi: 10.1007/s10916-021-01783-y.

Agus Perdana Windarto, Anjar Wanto, S Solikhun, and Ronal Watrianthos, “A Comprehensive Bibliometric Analysis of Deep Learning Techniques for Breast Cancer Segmentation: Trends and Topic Exploration (2019-2023),” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 7, no. 5, pp. 1155–1164, Oct. 2023, doi: 10.29207/resti.v7i5.5274.

S. Samsir, J. H. P. Sitorus, Zulkifli, Z. Ritonga, F. A. Nasution, and R. Watrianthos, “Comparison of machine learning algorithms for chest X-ray image COVID-19 classification,” J Phys Conf Ser, vol. 1933, no. 1, p. 012040, 2021, doi: 10.1088/1742-6596/1933/1/012040.

X. Wang, Y. Wang, S. Zhang, L. Yao, and S. Xu, “Analysis and Prediction of Gestational Diabetes Mellitus by the Ensemble Learning Method,” International Journal of Computational Intelligence Systems, vol. 15, no. 1, 2022, doi: 10.1007/s44196-022-00110-8.

A. C. R. Klaar, S. F. Stefenon, L. O. Seman, V. C. Mariani, and L. dos S. Coelho, “Structure Optimization of Ensemble Learning Methods and Seasonal Decomposition Approaches to Energy Price Forecasting in Latin America: A Case Study about Mexico,” Energies (Basel), vol. 16, no. 7, 2023, doi: 10.3390/en16073184.

T. Chen and C. Guestrin, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, vol. 13-17-August-2016. 2016.

J. L. Lee et al., “Clinical assessment and identification of immuno-oncology markers concerning the 19-gene based risk classifier in stage IV colorectal cancer,” World J Gastroenterol, vol. 25, no. 11, pp. 1341–1354, Mar. 2019, doi: 10.3748/wjg.v25.i11.1341.

H. Abubakar, M. Misiran, A. A. I. Sayed, and A. B. Karaye, “Optimization of Weibull Distribution Parameters with Application to Short-Term Risk Assessment and Strategic Investment Decision-Making,” Statistics, Optimization & Information Computing, vol. 12, no. 6, pp. 1684–1709, Aug. 2024, doi: 10.19139/soic-2310-5070-2099.

A. Dongyao Jia, B. Zhengyi Li, and C. Chuanwang Zhang, “Detection of cervical cancer cells based on strong feature CNN-SVM network,” Neurocomputing, vol. 411, 2020, doi: 10.1016/j.neucom.2020.06.006.

G. Sun, S. Li, Y. Cao, and F. Lang, “Cervical cancer diagnosis based on random forest,” International Journal of Performability Engineering, vol. 13, no. 4, 2017, doi: 10.23940/ijpe.17.04.p12.446457.

E. Nsugbe, “Towards the use of cybernetics for an enhanced cervical cancer care strategy,” Intelligent Medicine, vol. 2, no. 3, 2022, doi: 10.1016/j.imed.2022.02.001.

N. Houssami and K. Kerlikowske, “AI as a new paradigm for risk-based screening for breast cancer,” Nat Med, vol. 28, no. 1, pp. 29–30, Jan. 2022, doi: 10.1038/s41591-021-01649-3.

J. Liu et al., “BREAst screening Tailored for HEr (BREATHE)—A study protocol on personalised risk-based breast cancer screening programme,” PLoS One, vol. 17, no. 3, p. e0265965, Mar. 2022, doi: 10.1371/journal.pone.0265965.

S. Homayoun, “Cervical Cancer Risk Prediction,” Kaggle. Accessed: Apr. 26, 2025. [Online]. Available: https://www.kaggle.com/code/sashahomayoun/cervical-cancer-risk-prediction

F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, 2011.

G. Varoquaux, L. Buitinck, G. Louppe, O. Grisel, F. Pedregosa, and A. Mueller, “Scikit-learn,” GetMobile: Mobile Computing and Communications, vol. 19, no. 1, 2015, doi: 10.1145/2786984.2786995.

S. Liang, “Comparative Analysis of SVM, XGBoost and Neural Network on Hate Speech Classification,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 5, no. 5, pp. 896–903, 2021, doi: 10.29207/resti.v5i5.3506.

E. Sugiharti, R. Arifudin, D. T. Wiyanti, and A. B. Susilo, “Integration of convolutional neural network and extreme gradient boosting for breast cancer detection,” Bulletin of Electrical Engineering and Informatics, vol. 11, no. 2, 2022, doi: 10.11591/eei.v11i2.3562.

M. Wang, X. Li, M. Lei, L. Duan, and H. Chen, “Human health risk identification of petrochemical sites based on extreme gradient boosting,” Ecotoxicol Environ Saf, vol. 233, 2022, doi: 10.1016/j.ecoenv.2022.113332.

A. B. Moscicki et al., “Updating the natural history of human papillomavirus and anogenital cancers,” 2012. doi: 10.1016/j.vaccine.2012.05.089.

“Cervical cancer and hormonal contraceptives: collaborative reanalysis of individual data for 16 573 women with cervical cancer and 35 509 women without cervical cancer from 24 epidemiological studies,” Lancet, vol. 370, no. 9599, 2007, doi: 10.1016/S0140-6736(07)61684-5.

J. M. M. Walboomers et al., “Human papillomavirus is a necessary cause of invasive cervical cancer worldwide,” Journal of Pathology, vol. 189, no. 1, 1999, doi: 10.1002/(SICI)1096-9896(199909)189:1<12::AID-PATH431>3.0.CO;2-F.