Pengaruh Normalisasi Teks Dengan Text Expansion Dalam Deteksi Komentar Spam Pada Youtube

  • Imam Thoib Universitas Amikom Yogyakarta
  • Arief Setyanto Universitas Amikom Yogyakarta
  • Suwanto Raharjo Institut Sains & Teknologi AKPRIND Yogyakarta
Keywords: spam detection, text normalization, text expansion, youtube spam comments


The popularity of Youtube as the largest video sharing website in the wolrd give spammers opportunities to get benefit from Youtube in illegal ways by putting spam comments on Youtube's videos. Spam comments are very troubling to channel owners. The variants of spam comments are becoming more difficult to detect. One of them is spam comments using abbreviations, symbols, terms or misspelled word to make detection difficult. This research evaluate some classification techniques and employ text normalization method called TextExpansion to deal with this problem. This research uses Youtube Spam Collections dataset from UCI Machine Learning Library composed by five different datasets, which each one contains text comments extracted from YouTube videos (Psy, Katty Perry, LMFAO, Eminem and Shakira). The evaluation results shows TextExpansion is able to produce the highest accuracy value of 90.23%. To determine the impact of applying the TextExpansion method, this research conducted t-test for each dataset. The results of t-test for each dataset shows P(T<=t) two-tail < 0.05 which indicates a significant impact after applying text normalization using TextExpansion.


[1] Youtube, “Press - Youtube,” 2018. [Online]. Available: press/%0D. [Accessed: 02-Mar-2018].
[2] M. Chakraborty, S. Pal, R. Pramanik, and C. Ravindranath Chowdary, “Recent developments in social spam detection and combating techniques: A survey,” Inf. Process. Manag., vol. 52, no. 6, pp. 1053–1073, Nov. 2016.
[3] A. Mehmood, B.-W. On, I. Lee, I. Ashraf, and G. Sang Choi, “Spam comments prediction using stacking with ensemble learning,” J. Phys. Conf. Ser., vol. 933, p. 012012, Jan. 2018.
[4] H. Nguyen, “Research Report 2013 State of Social Media Spam,” 2013.
[5] K. Stuart, “PewDiePie switches off YouTube comments: ‘It’s mainly spam,’” The Guardian, 2014. [Online]. Available: [Accessed: 02-Mar-2018].
[6] T. C. Alberto, J. V. Lochter, and T. A. Almeida, “TubeSpam: Comment Spam Filtering on YouTube,” in 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 2015, pp. 138–143.
[7] A. Pinandito, R. S. Perdana, M. C. Saputra, and H. M. Az-zahra, “Spam detection framework for Android Twitter application using Naïve Bayes and K-Nearest Neighbor classifiers,” in Proceedings of the 6th International Conference on Software and Computer Applications - ICSCA ’17, 2017, pp. 77–82.
[8] M. Alsaleh, A. Alarifi, F. Al-Quayed, and A. Al-Salman, “Combating Comment Spam with Machine Learning Approaches,” in 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 2015, pp. 295–300.
[9] R. M. Silva, T. C. Alberto, T. A. Almeida, and A. Yamakami, “Towards filtering undesired short text messages using an online learning approach with semantic indexing,” Expert Syst. Appl., vol. 83, pp. 314–325, Oct. 2017.
[10] T. A. Almeida, T. P. Silva, I. Santos, and J. M. Gómez Hidalgo, “Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering,” Knowledge-Based Syst., vol. 108, pp. 25–32, Sep. 2016.
[11] I. Idris et al., “A combined negative selection algorithm–particle swarm optimization for an email spam detection system,” Eng. Appl. Artif. Intell., vol. 39, pp. 33–44, Mar. 2015.
[12] C.-N. Lee, Y.-R. Chen, and W.-G. Tzeng, “An online subject-based spam filter using natural language features,” in 2017 IEEE Conference on Dependable and Secure Computing, 2017, pp. 479–487.
[13] K. Roy, S. Keshari, and S. Giri, “Enhanced Bayesian spam filter technique employing LCS,” in 2016 International Conference on Computer, Electrical & Communication Engineering (ICCECE), 2016, pp. 1–6.
[14] M. Zavvar, M. Rezaei, and S. Garavand, “Email Spam Detection Using Combination of Particle Swarm Optimization and Artificial Neural Network and Support Vector Machine,” Int. J. Mod. Educ. Comput. Sci., vol. 8, no. 7, pp. 68–74, Jul. 2016.
[15] Q. Dang, F. Gao, and Y. Zhou, “Spammer detection based on Hidden Markov Model in micro-blogging,” in 2016 12th World Congress on Intelligent Control and Automation (WCICA), 2016, pp. 407–412.
[16] S. Sedhai and A. Sun, “Semi-Supervised Spam Detection in Twitter Stream,” IEEE Trans. Comput. Soc. Syst., pp. 1–7, 2017.
[17] T. Wu, S. Liu, J. Zhang, and Y. Xiang, “Twitter spam detection based on deep learning,” in Proceedings of the Australasian Computer Science Week Multiconference on - ACSW ’17, 2017, pp. 1–8.
[18] S. Boughorbel, F. Jarray, and M. El-Anbari, “Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric,” PLoS One, vol. 12, no. 6, p. e0177678, Jun. 2017.
Technology Information Article