A Comparative Study of CatBoost and Double Random Forest  for Multi-class Classification

Annisarahmi Nur Aini Aldania; Agus Mohamad Soleh; Khairil Anwar Notodiputro

doi:10.29207/resti.v7i1.4766

Annisarahmi Nur Aini Aldania IPB University
Agus Mohamad Soleh IPB University
Khairil Anwar Notodiputro IPB University

DOI: https://doi.org/10.29207/resti.v7i1.4766

Keywords: catboost, double random forest, multi-class classification

Abstract

Multi-class classification has its challenge compared to binary classification. The challenges mainly caused by the interactions between explanatory and responses variable are increasingly complex. Ensemble-based methods such as boosting and random forest (RF) have been proven to handle classification problems. We conducted this research to study multi-class classification using CatBoost, a method developed with gradient boosting and double random forest (DRF), RF’s development that is good to be used when the resulting RF model is underfitting. Analysis was carried out using simulation and empirical data. In the simulation study, we generate data based on the distance between classes: high, medium, and low. The empirical data used is the industrial classification code, namely KBLI. CatBoost and DRF can rightly solve the multi-class classification problem at a high distance, measured by a 100% balanced accuracy score. At a medium distance, CatBoost and DRF produce balanced accuracy scores of 99.25% and 97.54%, respectively, whereas 32.37% and 23.97% at the low distance. In empirical studies, CatBoost’s performance outperforms DRF by 4.27%. All the differences are statistically significant based on the t-test result. We also use LIME to explain individual predictions of CatBoost and learn words that contribute the most to an example class’s prediction.

Downloads

Download data is not yet available.

References

M. Koziarski, M. Woźniak, and B. Krawczyk, “Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise,” Knowledge-Based Syst., vol. 204, p. 106223, 2020, doi: https://doi.org/10.1016/j.knosys.2020.106223.

H.-Y. Lin, “Efficient classifiers for multi-class classification problems,” Decis. Support Syst., vol. 53, no. 3, pp. 473–481, 2012, doi: https://doi.org/10.1016/j.dss.2012.02.014.

R. Chairunisa, Adiwijaya, and W. Astuti, “Perbandingan CART dan Random Forest untuk Deteksi Kanker berbasis Klasifikasi Data Microarray,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 4, no. 5 SE-Artikel Rekayasa Sistem Informasi, Oct. 2020, doi: 10.29207/resti.v4i5.2083.

I. T. Utami, B. Sartono, and K. Sadik, “Comparison of single and ensemble classifiers of support vector machine and classification tree,” J. Math. Sci. Appl., vol. 2, no. 2, pp. 17–20, 2014.

B. Sartono and U. D. Syafitri, “Metode pohon gabungan: Solusi pilihan untuk mengatasi kelemahan pohon regresi dan klasifikasi tunggal,” in Forum Statistika dan Komputasi, 2010, vol. 15, no. 1.

J. Treboux, D. Genoud, and R. Ingold, “Decision tree ensemble vs. nn deep learning: efficiency comparison for a small image dataset,” in 2018 International Workshop on Big Data and Information Security (IWBIS), 2018, pp. 25–30.

G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: with Applications in R. Springer, 2013. [Online]. Available: https://faculty.marshall.usc.edu/gareth-james/ISL/

T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference and prediction, 2nd ed. Springer, 2009. [Online]. Available: http://www-stat.stanford.edu/~tibs/ElemStatLearn/

T. Chen et al., “Xgboost: extreme gradient boosting,” R Packag. version 0.4-2, vol. 1, no. 4, pp. 1–4, 2015.

G. Ke et al., “Lightgbm: A highly efficient gradient boosting decision tree,” Adv. Neural Inf. Process. Syst., vol. 30, 2017.

A. V. Dorogush, V. Ershov, and A. Gulin, “CatBoost: gradient boosting with categorical features support.” arXiv, 2018. doi: 10.48550/ARXIV.1810.11363.

J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting methods for multi-class imbalanced data classification: an experimental review,” J. Big Data, vol. 7, no. 1, p. 70, 2020, doi: 10.1186/s40537-020-00349-y.

X. Zhang and G. X. Wu, “Text classification method of dongba classics based on CatBoost algorithm,” in The 8th International Symposium on Test Automation & Instrumentation (ISTAI 2020), 2020, vol. 2020, pp. 133–139. doi: 10.1049/icp.2021.1336.

L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

S. Han, H. Kim, and Y.-S. Lee, “Double random forest,” Mach. Learn., vol. 109, no. 8, pp. 1569–1586, 2020, doi: 10.1007/s10994-020-05889-1.

M. Gösgens, A. Zhiyanov, A. Tikhonov, and L. Prokhorenkova, “Good Classification Measures and How to Find Them.” arXiv, 2022. doi: 10.48550/ARXIV.2201.09044.

M. Grandini, E. Bagli, and G. Visani, “Metrics for Multi-Class Classification: an Overview.” arXiv, 2020. doi: 10.48550/ARXIV.2008.05756.

E. Mortaz, “Imbalance accuracy metric for model selection in multi-class imbalance classification problems,” Knowledge-Based Syst., vol. 210, p. 106490, 2020, doi: https://doi.org/10.1016/j.knosys.2020.106490.

F. Pedregosa et al., “Scikit-learn: Machine Learning in {P}ython,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.

P. Probst, B. Bischl, and A.-L. Boulesteix, “Tunability: Importance of Hyperparameters of Machine Learning Algorithms.” arXiv, 2018. doi: 10.48550/ARXIV.1802.09596.

J. Hancock and T. M. Khoshgoftaar, “Impact of Hyperparameter Tuning in Classifying Highly Imbalanced Big Data,” in 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), 2021, pp. 348–354. doi: 10.1109/IRI51335.2021.00054.

C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Artif. Intell. Rev., vol. 54, no. 3, pp. 1937–1967, 2021, doi: 10.1007/s10462-020-09896-5.

C. Molnar, Interpretable Machine Learning, 2nd ed. 2022. [Online]. Available: https://christophm.github.io/interpretable-ml-book

P. Biecek and T. Burzykowski, Explanatory Model Analysis. Chapman and Hall/CRC, New York, 2021. [Online]. Available: https://pbiecek.github.io/ema/

M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144. doi: 10.1145/2939672.2939778.