Comparing Correlation-Based Feature Selection and Symmetrical Uncertainty for Student Dropout Prediction
Abstract
Predicting student dropout is essential for universities dealing with high attrition rates. This study compares two feature selection (FS) methods—correlation-based feature selection (CFS) and symmetrical uncertainty (SU)—in educational data mining for dropout prediction. We evaluate these methods using three classification algorithms: decision tree (DT), support vector machine (SVM), and naive Bayes (NB). Results show that SU outperforms CFS overall, with SVM achieving the highest accuracy (98.16%) when combined with SU Moreover, this study identifies total credits in the fourth semester, cumulative GPA, gender, and student domicile as key predictors of student dropout. This study shows how using feature selection methods can improve the accuracy of predicting student dropout, helping educational institutions retain students better.
Downloads
References
V. Hegde and P. P. Prageeth, “Higher education student dropout prediction and analysis through educational data mining,” in 2018 2nd International Conference on Inventive Systems and Control (ICISC), Coimbatore: IEEE, Jan. 2018, pp. 694–699. doi: 10.1109/ICISC.2018.8398887.
R. W. Rumberger, Dropping Out: Why Students Drop Out of High School and What Can Be Done About It. Harvard University Press, 2011. doi: 10.4159/harvard.9780674063167.
A. Akkari, “Education in the Middle East and North Africa,” in International Encyclopedia of the Social & Behavioral Sciences, Elsevier, 2015, pp. 210–214. doi: 10.1016/B978-0-08-097086-8.92149-4.
A. Behr, M. Giese, H. D. Teguim Kamdjou, and K. Theune, “Dropping out of university: a literature review,” Rev Educ, vol. 8, no. 2, pp. 614–652, Jun. 2020, doi: 10.1002/rev3.3202.
M. P. Marchbanks et al., “More than a Drop in the Bucket: The Social and Economic Costs of Dropouts and Grade Retentions Associated With Exclusionary Discipline,” Journal of Applied Research on Children: Informing Policy for Children at Risk, vol. 5, no. 2, Feb. 2015, doi: 10.58464/2155-5834.1226.
E. Arias Ortiz and C. Dehon, “Roads to Success in the Belgian French Community’s Higher Education System: Predictors of Dropout and Degree Completion at the Université Libre de Bruxelles,” Res High Educ, vol. 54, no. 6, pp. 693–723, Sep. 2013, doi: 10.1007/s11162-013-9290-y.
B. Daniel, “Big Data and analytics in higher education: Opportunities and challenges,” Brit J Educational Tech, vol. 46, no. 5, pp. 904–920, Sep. 2015, doi: 10.1111/bjet.12230.
A. Namoun and A. Alshanqiti, “Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review,” Applied Sciences, vol. 11, no. 1, p. 237, Dec. 2020, doi: 10.3390/app11010237.
C. S. Lyche, “Taking on the Completion Challenge: A Literature Review on Policies to Prevent Dropout and Early School Leaving,” OECD Education Working Papers 53, Nov. 2010. doi: 10.1787/5km4m2t59cmr-en.
J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques Third Edition. Elsevier, 2012.
M. S. P. Babu and S. H. Sastry, “Big data and predictive analytics in ERP systems for automating decision making process,” in 2014 IEEE 5th International Conference on Software Engineering and Service Science, Beijing: IEEE, Jun. 2014, pp. 259–262. doi: 10.1109/ICSESS.2014.6933558.
W.-W. Wu, Y.-T. Lee, M.-L. Tseng, and Y.-H. Chiang, “Data mining for exploring hidden patterns between KM and its performance,” Knowledge-Based Systems, vol. 23, no. 5, pp. 397–401, Jul. 2010, doi: 10.1016/j.knosys.2010.01.014.
A. Mueen, B. Zafar, and U. Manzoor, “Modeling and Predicting Students’ Academic Performance Using Data Mining Techniques,” IJMECS, vol. 8, no. 11, pp. 36–42, Nov. 2016, doi: 10.5815/ijmecs.2016.11.05.
C. Romero and S. Ventura, “Data mining in education: Data mining in education,” WIREs Data Mining Knowl Discov, vol. 3, no. 1, pp. 12–27, Jan. 2013, doi: 10.1002/widm.1075.
C. Romero and S. Ventura, “Educational Data Mining: A Review of the State of the Art,” IEEE Trans. Syst., Man, Cybern. C, vol. 40, no. 6, pp. 601–618, Nov. 2010, doi: 10.1109/TSMCC.2010.2053532.
M. Alban and D. Mauricio, “Predicting University Dropout trough Data Mining: A systematic Literature,” Indian Journal of Science and Technology, vol. 12, no. 4, pp. 1–12, Jan. 2019, doi: 10.17485/ijst/2019/v12i4/139729.
M. A. Hall, “Correlation-based Feature Selection for Machine Learning,” Doctoral dissertation, University of Waikato, 1999.
G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, Jan. 2014, doi: 10.1016/j.compeleceng.2013.11.024.
S. Hussain, N. Abdulaziz Dahan, F. M. Ba-Alwi, and N. Ribata, “Educational Data Mining and Analysis of Students’ Academic Performance Using WEKA,” IJEECS, vol. 9, no. 2, p. 447, Feb. 2018, doi: 10.11591/ijeecs.v9.i2.pp447-459.
U. Bhimavarapu, “Analysing student performance for online education using the computational models,” Univ Access Inf Soc, Aug. 2023, doi: 10.1007/s10209-023-01033-7.
S. Nuanmeesri, L. Poomhiran, S. Chopvitayakun, and P. Kadmateekarun, “Improving Dropout Forecasting during the COVID-19 Pandemic through Feature Selection and Multilayer Perceptron Neural Network,” IJIET, vol. 12, no. 9, pp. 851–857, 2022, doi: 10.18178/ijiet.2022.12.9.1693.
K. Limsathitwong, K. Tiwatthanont, and T. Yatsungnoen, “Dropout prediction system to reduce discontinue study rate of information technology students,” in 2018 5th International Conference on Business and Industrial Research (ICBIR), Bangkok: IEEE, May 2018, pp. 110–114. doi: 10.1109/ICBIR.2018.8391176.
Sumaiya Iqbal, Mahjabin Muntaha, Jerin Ishrat Natasha, and Dewan Sakib, “Early Grade Prediction Using Profile Data,” IJMLC, vol. 12, no. 5, Sep. 2022, doi: 10.18178/ijmlc.2022.12.5.1100.
J. S. Gil, “Predicting Students’ Dropout Indicators in Public School using Data Mining Approaches,” IJATCSE, vol. 9, no. 1, pp. 774–778, Feb. 2020, doi: 10.30534/ijatcse/2020/110912020.
Nurhana Roslan, Jastini Mohd Jamil, and I. N. Mohd. Shaharanee, “Prediction of Student Dropout in Malaysian’s Private Higher Education Institute using Data Mining Application,” Turkish Journal of Computer and Mathematics Education (TURCOMAT), vol. 12, no. 3, pp. 2326–2334, 2021.
T. A. Cardona and E. A. Cudney, “Predicting Student Retention Using Support Vector Machines,” Procedia Manufacturing, vol. 39, pp. 1827–1833, 2019, doi: 10.1016/j.promfg.2020.01.256.
L. E. Lee et al., “Evaluation of Prediction Algorithms in the Student Dropout Problem,” JCC, vol. 08, no. 03, pp. 20–27, 2020, doi: 10.4236/jcc.2020.83002.
R. Lottering, R. Hans, and M. Lall, “A model for the identification of students at risk of dropout at a university of technology,” in 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa: IEEE, Aug. 2020, pp. 1–8. doi: 10.1109/icABCD49160.2020.9183874.
I. Burman and S. Som, “Predicting Students Academic Performance Using Support Vector Machine,” in 2019 Amity International Conference on Artificial Intelligence (AICAI), Dubai, United Arab Emirates: IEEE, Feb. 2019, pp. 756–759. doi: 10.1109/AICAI.2019.8701260.
P. Nuankaew, W. Nuankaew, D. Teeraputon, K. Phanniphong, and S. Bussaman, “Prediction Model of Student Achievement in Business Computer Disciplines,” Int. J. Emerg. Technol. Learn., vol. 15, no. 20, p. 160, Oct. 2020, doi: 10.3991/ijet.v15i20.15273.
A. Tripathi, S. Yadav, and R. Rajan, “Naïve Bayes Classification Model for the Student Performance Prediction,” 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), 2019, doi: 10.1109/ICICICT46008.2019.8993237.
A. Triayudi and W. O. Widyarto, “Comparison J48 And Naïve Bayes Methods in Educational Analysis,” J. Phys.: Conf. Ser., vol. 1933, no. 1, p. 012062, Jun. 2021, doi: 10.1088/1742-6596/1933/1/012062.
A. Saifudin, Ekawati, Yulianti, and T. Desyani, “Forward Selection Technique to Choose the Best Features in Prediction of Student Academic Performance Based on Naïve Bayes,” J. Phys.: Conf. Ser., vol. 1477, no. 3, p. 032007, Mar. 2020, doi: 10.1088/1742-6596/1477/3/032007.
J. D. Febro, “Utilizing Feature Selection in Identifying Predicting Factors of Student Retention,” IJACSA, vol. 10, no. 9, 2019, doi: 10.14569/IJACSA.2019.0100934.
S. Ghareeb et al., “Evaluating student levelling based on machine learning model’s performance,” Discov Internet Things, vol. 2, no. 1, p. 3, Dec. 2022, doi: 10.1007/s43926-022-00023-0.
S. Alturki and N. Alturki, “Using Educational Data Mining to Predict Students’ Academic Performance for Applying Early Interventions,” JITE:IIP, vol. 20, pp. 121–137, 2021, doi: 10.28945/4835.
Mahmood Shakir Hammoodi and Ahmed Al-Azawei, “Using Socio-Demographic Information in Predicting Students’ Degree Completion based on a Dynamic Model,” IJIES, vol. 15, no. 2, pp. 107–115, Apr. 2022, doi: 10.22266/ijies2022.0430.11.
A. J. Almalki, “Accuracy analysis of Educational Data Mining using Feature Selection Algorithm,” 2021, doi: 10.48550/ARXIV.2107.10669.
J. Gu, L. Wang, H. Wang, and S. Wang, “A novel approach to intrusion detection using SVM ensemble with feature augmentation,” Computers & Security, vol. 86, pp. 53–62, Sep. 2019, doi: 10.1016/j.cose.2019.05.022.
M. Dash and H. Liu, “Feature selection for classification,” Intelligent Data Analysis, vol. 1, no. 1–4, pp. 131–156, 1997, doi: 10.1016/S1088-467X(97)00008-5.
R. J. Urbanowicz, R. S. Olson, P. Schmitt, M. Meeker, and J. H. Moore, “Benchmarking relief-based feature selection methods for bioinformatics data mining,” Journal of Biomedical Informatics, vol. 85, pp. 168–188, Sep. 2018, doi: 10.1016/j.jbi.2018.07.015.
A. G. Karegowda, A. S. Manjunath, and M. A. Jayaram, “Comparative Study of Attribute Selection Using Gain Ratio and Correlation-Based Feature Selection,” 2013.
E. C. Blessie and E. Karthikeyan, “Sigmis: A Feature Selection Algorithm Using Correlation Based Method,” Journal of Algorithms & Computational Technology, vol. 6, no. 3, pp. 385–394, Sep. 2012, doi: 10.1260/1748-3018.6.3.385.
G. Sosa-Cabrera, M. García-Torres, S. Gómez-Guerrero, C. E. Schaerer, and F. Divina, “A multivariate approach to the symmetrical uncertainty measure: Application to feature selection problem,” Information Sciences, vol. 494, pp. 1–20, Aug. 2019, doi: 10.1016/j.ins.2019.04.046.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” jair, vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.
R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” 1995.
D. T. Larose and C. D. Larose, “DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining.” John Wiley & Sons, Inc, 2014.
N. Cristianini and J. Shawe-Taylor, “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.” 2000.
M. Zareapoor, P. Shamsolmoali, D. Kumar Jain, H. Wang, and J. Yang, “Kernelized support vector machine with deep learning: An efficient approach for extreme multiclass dataset,” Pattern Recognition Letters, vol. 115, pp. 4–13, Nov. 2018, doi: 10.1016/j.patrec.2017.09.018.
C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A Practical Guide to Support Vector Classification,” 2003.
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer International Publishing, 2015. doi: 10.1007/978-3-319-14142-8.
M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Information Processing & Management, vol. 45, no. 4, pp. 427–437, Jul. 2009, doi: 10.1016/j.ipm.2009.03.002.
V. Flores, S. Heras, and V. Julian, “Comparison of Predictive Models with Balanced Classes Using the SMOTE Method for the Forecast of Student Dropout in Higher Education,” Electronics, vol. 11, no. 3, p. 457, Feb. 2022, doi: 10.3390/electronics11030457.
N. Bedregal-Alpaca, V. Cornejo-Aparicio, J. Zárate-Valderrama, and P. Yanque-Churo, “Classification Models for Determining Types of Academic Risk and Predicting Dropout in University Students,” IJACSA, vol. 11, no. 1, 2020, doi: 10.14569/IJACSA.2020.0110133.
J. Sadhasivam, V. Muthukumaran, J. Thimmia Raja, R. B. Joseph, M. Munirathanam, and J. M. Balajee, “Diabetes disease prediction using decision tree for feature selection,” J. Phys.: Conf. Ser., vol. 1964, no. 6, p. 062116, Jul. 2021, doi: 10.1088/1742-6596/1964/6/062116.
A. Slim, G. L. Heileman, J. Kozlick, and C. T. Abdallah, “Predicting student success based on prior performance,” in 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2014, pp. 410–415. doi: 10.1109/CIDM.2014.7008697.
E. Yukselturk, S. Ozekes, and Y. K. Türel, “Predicting dropout student: an application of data mining methods in an online education program,” European Journal of Open, Distance and e-learning, vol. 17, no. 1, pp. 118–133, 2014.
A. Sarra, L. Fontanella, and S. Di Zio, “Identifying students at risk of academic failure within the educational data mining framework,” Social Indicators Research, vol. 146, pp. 41–60, 2019.
J. R. Casanova, A. Cervero, J. C. Núñez, L. S. Almeida, and A. Bernardo, “Factors that determine the persistence and dropout of university students,” 2018.
Q. Hu, A. Polyzou, G. Karypis, and H. Rangwala, “Enriching Course-Specific Regression Models with Content Features for Grade Prediction.” 2017. doi: 10.1109/DSAA.2017.74.
A. J. Fernández-García, J. C. Preciado, F. Melchor, R. Rodríguez-Echeverría, J. M. Conejero, and F. Sánchez-Figueroa, “A Real-Life Machine Learning Experience for Predicting University Dropout at Different Stages Using Academic Data,” IEEE Access, 2021, doi: 10.1109/ACCESS.2021.3115851.
F. Gafarov, Y. Rudneva, and U. Y. Sharifov, “Predictive Modeling in Higher Education: Determining Factors of Academic Performance,” Vysšee obrazovanie v Rossii, 2023, doi: 10.31992/0869-3617-2023-32-1-51-70.
Copyright (c) 2024 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;