Feature Extraction for Improvement Text Classification of Spam YouTube Video Comment using Deep Learning
Abstract
The proposed algorithms are Bidirectional Long Short Term Memory (BiLSTM) and Conditional Random Fields (CRF) with Data Augmentation Technique (DAT). DAT integrates spam YouTube video comments into the traditional TF-IDF algorithm and generates a weighted word vector. The weighted word vector is fed into BiLSTM CRF to capture context information effectively. The result of this study is a new classification model to spam YouTube comment videos and increase the computational value of its performance. This research conducted two experiments: the first using BiLSTM CRF without DAT and the second using BiLSTM CRF with DAT. The experimental results state that the evaluation score using BiLSTM CRF with DAT shows outstanding performance in text classification, especially in spam YouTube video comment texts, with accuracy = 83.3%, precision = 83.6%, recall = 83.3%, and F-measure = 83.3%. So the combination of the BiLSTM-CRF method and the Data Augmentation Technique is very precise, so it can be used to increase the accuracy of classification texts for spam YouTube video comments
Downloads
References
R. Lozano-Blasco, M. Mira-Aladrén., and M. Gil-Lamata, “Social media influence on young people and children: Analysis on Instagram, Twitter and YouTube,” Comunicar, vol. 30, no. 74, pp. 117–128, 2023, doi: 10.3916/C74-2023-10.
I. D. and I. Tabak, “An empirical analysis of knowledge co-construction in YouTube comments Ilana,” 2020.
W. W. Sitompul, “Penelitian Tentang Youtube,” … J. Perpust. dan Inf., vol. 2275, 2022.
H. A. R. Harpizon, R. Kurniawan, Iwan Iskandar, R. Salambue, E. Budianita, and F. Syafria, “Analisis Sentimen Komentar Di YouTube Tentang Ceramah Ustadz Abdul Somad Menggunakan Algoritma Naïve Bayes,” … Di YouTube …, vol. 5, no. 1, pp. 131–140, 2022.
R. Handayani, “Youtube Sebagai Media Komunikasi Dalam Berdakwah Di Tengah Pandemi,” Hikmah, vol. 15, no. 1, pp. 123–138, 2021, doi: 10.24952/hik.v15i1.3569.
Rusydi Umar, Sunardi, and M. N. A. Nuriyah, “Comparing the Performance of Data Mining Algorithms in Predicting Sentiments on Twitter,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 7, no. 4, pp. 817–823, 2023, doi: 10.29207/resti.v7i4.4931.
A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, “Deep learning in spiking neural networks,” Neural Networks, vol. 111, pp. 47–63, 2019, doi: 10.1016/j.neunet.2018.12.002.
K. G. Nalbant and Ş. Uyanik, “Computer Vision in the Metaverse,” J. Metaverse, vol. 1, no. 1, pp. 9–12, 2021.
Ł. Brocki and K. Marasek, “Deep Belief Neural Networks and Bidirectional Long-Short Term Memory Hybrid for Speech Recognition,” vol. 40, no. 2, pp. 191–195, 2015, doi: 10.1515/aoa-2015-0021.
K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks”.
A. Elnagar, R. Al-debsi, and O. Einea, “Arabic text classification using deep learning models,” vol. 57, no. April 2019, 2020.
J. Jasmir, S. Nurmaini, and B. Tutuko, “Fine-grained algorithm for improving knn computational performance on clinical trials text classification,” Big Data Cogn. Comput., vol. 5, no. 4, 2021, doi: 10.3390/bdcc5040060.
J. Zhang, F. Liu, W. Xu, and H. Yu, “Feature fusion text classification model combining CNN and BiGRU with multi-attention mechanism,” Futur. Internet, vol. 11, no. 11, 2019, doi: 10.3390/fi11110237.
A. Alsaeedi and M. Z. Khan, “A study on sentiment analysis techniques of Twitter data,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 2, pp. 361–374, 2019, doi: 10.14569/ijacsa.2019.0100248.
G. Xu, Y. Meng, X. Qiu, Z. Yu, and X. Wu, “Sentiment analysis of comment texts based on BiLSTM,” IEEE Access, vol. 7, pp. 51522–51532, 2019, doi: 10.1109/ACCESS.2019.2909919.
H. Baaqeel and R. Zagrouba, “Hybrid SMS spam filtering system using machine learning techniques,” Proc. - 2020 21st Int. Arab Conf. Inf. Technol. ACIT 2020, 2020, doi: 10.1109/ACIT50332.2020.9300071.
H. Yang, Q. Liu, S. Zhou, and Y. Luo, “A spam filtering method based on multi-modal fusion,” Appl. Sci., vol. 9, no. 6, 2019, doi: 10.3390/app9061152.
J. Jasmir, W. Riyadi, S. R. Agustini, Y. Arvita, D. Meisak, and L. Aryani, “Bidirectional Long Short-Term Memory and Word Embedding Feature for,” J. RESTI (Rekayasa Sist. Dan Teknol. Informasi), vol. 6, no. 4, pp. 505–510, 2022, [Online]. Available: https://jurnal.iaii.or.id/index.php/RESTI/article/view/4005/606
J. Jasmir, S. Nurmaini, R. F. Malik, and D. Zaenal, “Text Classification of Cancer Clinical Trials Documents Using Deep Neural Network and Fine Grained Document Clustering,” vol. 172, no. Siconian 2019, 2020.
M. Zulqarnain, R. Ghazali, Y. M. M. Hassim, and M. Rehan, “A comparative review on deep learning models for text classification,” Indones. J. Electr. Eng. Comput. Sci., vol. 19, no. 1, pp. 325–335, 2020, doi: 10.11591/ijeecs.v19.i1.pp325-335.
A. Sherstinsky, “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network,” Phys. D Nonlinear Phenom., vol. 404, no. March, pp. 1–43, 2020, doi: 10.1016/j.physd.2019.132306.
W. K. SARI, D. P. RINI, R. F. MALIK, and I. S. B. AZHAR, “Sequential Models for Text Classification Using Recurrent Neural Network,” vol. 172, no. Siconian 2019, pp. 333–340, 2020, doi: 10.2991/aisr.k.200424.050.
A. Gulli and S. Pal, “Long Short Term Memory - LSTM,” Deep Learn. with Keras, pp. 187–195, 2017, doi: 10.1144/GSL.MEM.1999.018.01.02.
A. Darmawahyuni, S. Nurmaini, and Sukemi, “Deep Learning with Long Short-Term Memory for Enhancement Myocardial Infarction Classification,” Proc. 2019 6th Int. Conf. Instrumentation, Control. Autom. ICA 2019, no. August 2019, pp. 19–23, 2019, doi: 10.1109/ICA.2019.8916683.
C. Jiang, M. Maddela, W. Lan, Y. Zhong, and W. Xu, “Neural CRF model for sentence alignment in text simplification,” Proc. Annu. Meet. Assoc. Comput. Linguist., pp. 7943–7960, 2020, doi: 10.18653/v1/2020.acl-main.709.
J. Jasmir, S. Nurmaini, R. F. Malik, and B. Tutuko, “Bigram feature extraction and conditional random fields model to improve text classification clinical trial document,” Telkomnika (Telecommunication Comput. Electron. Control., vol. 19, no. 3, pp. 886–892, 2021, doi: 10.12928/TELKOMNIKA.v19i3.18357.
L. Yang, Y. Li, J. Wang, and Z. Tang, “Post text processing of chinese speech recognition based on bidirectional LSTM networks and CRF,” Electron., vol. 8, no. 11, p. 1249, 2019, doi: 10.3390/electronics8111248.
Q. H. Pham, T. Nguyen, B. S. Hua, G. Roig, and S. K. Yeung, “JSIS3D: Joint semantic-instance segmentation of 3D point clouds with multi-task pointwise networks and multi-value conditional random fields,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 8819–8828, 2019, doi: 10.1109/CVPR.2019.00903.
C. Zuheros, S. Tabik, A. Valdivia, E. Martínez-cámara, and F. Herrera, “Deep recurrent neural network for geographical entities disambiguation on social media data,” Knowledge-Based Syst., 2019, doi: 10.1016/j.knosys.2019.02.030.
Z. Dai, X. Wang, P. Ni, Y. Li, G. Li, and X. Bai, “Named Entity Recognition Using BERT BiLSTM CRF for Chinese Electronic Health Records,” Proc. - 2019 12th Int. Congr. Image Signal Process. Biomed. Eng. Informatics, CISP-BMEI 2019, pp. 0–4, 2019, doi: 10.1109/CISP-BMEI48845.2019.8965823.
K. Y. Huang, Zhiheng ; Wei Xu, “Bidirectional LSTM-CRF Models for Sequence Tagging,” 2019.
J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 6382–6388, 2020, doi: 10.18653/v1/d19-1670.
F. J. Moreno-Barea, J. M. Jerez, and L. Franco, “Improving classification accuracy using data augmentation on small data sets,” Expert Syst. Appl., vol. 161, 2020, doi: 10.1016/j.eswa.2020.113696.
B. Tang, J. Hu, X. Wang, and Q. Chen, “Recognizing Continuous and Discontinuous Adverse Drug Reaction Mentions from Social Media Using LSTM-CRF,” Wirel. Commun. Mob. Comput., vol. 2018, 2018, doi: 10.1155/2018/2379208.
Z. Wan, J. Xie, W. Zhang, and Z. Huang, “BiLSTM-CRF Chinese Named Entity Recognition Model with Attention Mechanism,” J. Phys. Conf. Ser., vol. 1302, no. 3, 2019, doi: 10.1088/1742-6596/1302/3/032056.
Z. Mushtaq and S. F. Su, “Environmental sound classification using a regularized deep convolutional neural network with data augmentation,” Appl. Acoust., vol. 167, 2020, doi: 10.1016/j.apacoust.2020.107389.
P. Wang, K. Nagrecha, and N. Vasconcelos, “Gradient-based Algorithms for Machine Teaching,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1387–1396, 2021, doi: 10.1109/CVPR46437.2021.00144.
Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Evolving Deep Convolutional Neural Networks for Image Classification,” IEEE Trans. Evol. Comput., vol. 24, no. 2, pp. 394–407, 2020, doi: 10.1109/TEVC.2019.2916183.
G. Rizos and K. Hemker, “Augment to Prevent : Short-Text Data Augmentation in Deep Learning for Hate-Speech Classification,” pp. 991–1000, 2019.
H. W. Anaıs Ollagnier, “Text Augmentation Techniques for Clinical Case Classification.” 2020.
M. Papadaki, “Data Augmentation Techniques for Legal Text Analytics by,” no. October, pp. 1–33, 2017.
S. Aiyar and N. P. Shetty, “N-Gram Assisted Youtube Spam Comment Detection,” Procedia Comput. Sci., vol. 132, no. Iccids, pp. 174–182, 2018, doi: 10.1016/j.procs.2018.05.181.
Copyright (c) 2023 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;