Hate Speech Detection on Twitter in Indonesia with Feature Expansion Using GloVe
Abstract
Twitter is one of the popular social media to channel opinions in the form of criticism and suggestions. Criticism could be a form of hate speech if the criticism implies attacking something (an individual, race, or group). With the limit of 280 characters in a tweet, there is often a vocabulary mismatch due to abbreviations which can be solved with word embedding. This study utilizes feature expansion to reduce vocabulary mismatches in hate speech on Twitter containing Indonesian language by using Global Vectors (GloVe). Feature selection related to the best model is carried out using the Logistic Regression (LR), Random Forest (RF), and Artificial Neural Network (ANN) algorithms. The results show that the Random Forest model with 5.000 features and a combination of TF-IDF and Tweet corpus built with GloVe produce the best accuracy rate between the other models with an average of 88,59% accuracy score, which is 1,25% higher than the predetermined Baseline. The number of features used is proven to improve the performance of the system.
Downloads
References
T. Shi and Z. Liu, “Linking GloVe with word2vec,” vol. arXiv prep, p. 1, 2014, [Online]. Available: http://arxiv.org/abs/1411.5595.
B. Heller and L. Magid, “Parent’s and Educator’s Guide to Combating Online Hate Speech | ConnectSafely.” https://www.connectsafely.org/hatespeech/ (accessed Nov. 22, 2020).
Republik Indonesia, “Undang-Undang ITE,” 2008. https://peraturan.go.id/common/dokumen/ln/2008/UU 11 Tahun 2008.pdf (accessed Nov. 22, 2020).
E. B. Setiawan, D. H. Widyantoro, and K. Surendro, “Feature expansion using word embedding for tweet topic classification,” in Proceeding of 2016 10th International Conference on Telecommunication Systems Services and Applications, TSSA 2016: Special Issue in Radar Technology, 2017, p. 1, doi: 10.1109/TSSA.2016.7871085.
J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” 2014, doi: 10.3115/v1/d14-1162.
P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep learning for hate speech detection in tweets,” 2019, doi: 10.1145/3041021.3054223.
I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, “Hate speech detection in the Indonesian language: A dataset and preliminary study,” 2018, doi: 10.1109/ICACSIS.2017.8355039.
T. Febriana and A. Budiarto, “Twitter Dataset for Hate Speech and Cyberbullying Detection in Indonesian Language,” 2019, doi: 10.1109/ICIMTech.2019.8843722.
M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,” 2019, doi: 10.18653/v1/w19-3506.
I. Z. Muhammad, M. Nasrun, and C. Setianingsih, “Hate Speech Detection using Global Vector and Deep Belief Network Algorithm,” 2020, doi: 10.1109/ibdap50342.2020.9245467.
B. Vidgen and T. Yasseri, “Detecting weak and strong Islamophobic hate speech on social media,” arXiv. 2018.
E. B. Setiawan, D. H. Widyantoro, and K. Surendro, “Feature expansion for sentiment analysis in twitter,” 2018, doi: 10.1109/EECSI.2018.8752851.
Komnas HAM, “Buku Saku Penanganan Ujaran Kebencian (Hate Speech),” in Komisi Nasional Hak Asasi Manusia, 2015, pp. 9, 13.
D. J. Ningrum, S. Suryadi, and D. E. Chandra Wardhana, “KAJIAN UJARAN KEBENCIAN DI MEDIA SOSIAL,” J. Ilm. KORPUS, p. 1, 2019, doi: 10.33369/jik.v2i3.6779.
J. Banks, “Regulating hate speech online,” Int. Rev. Law, Comput. Technol., p. 238, 2010, doi: 10.1080/13600869.2010.522323.
A. Krouska, C. Troussas, and M. Virvou, “The effect of preprocessing techniques on Twitter sentiment analysis,” 2016, doi: 10.1109/IISA.2016.7785373.
M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, “Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi,” 2020, doi: 10.1088/1757-899X/874/1/012017.
G. Angiani et al., “A comparison between preprocessing techniques for sentiment analysis in Twitter,” 2016.
S. Saha, J. Yadav, and P. Ranjan, “Proposed Approach for Sarcasm Detection in Twitter,” Indian J. Sci. Technol., 2017, doi: 10.17485/ijst/2017/v10i25/114443.
E. Fehn Unsvåg and B. Gambäck, “The Effects of User Features on Twitter Hate Speech Detection,” 2019, doi: 10.18653/v1/w18-5110.
I. G. M. Putra and D. Nurjanah, “Hate Speech Detection In Indonesian Language Instagram,” in 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Oct. 2020, pp. 413–420, doi: 10.1109/ICACSIS51025.2020.9263084.
S. Bhoir, T. Ghorpade, and V. Mane, “Comparative analysis of different word embedding models,” in International Conference on Advances in Computing, Communication and Control 2017, ICAC3 2017, 2018, p. 3, doi: 10.1109/ICAC3.2017.8318770.
N. L. Tsao, D. Wible, and C. H. Kuo, “Feature expansion for word sense disambiguation,” in NLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings, 2003, p. 1, doi: 10.1109/NLPKE.2003.1275882.
S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artificial neural network classification models: A methodology review,” J. Biomed. Inform., 2002, doi: 10.1016/S1532-0464(03)00034-0.
L. Breiman, “Random forests,” Mach. Learn., p. 1, 2001, doi: 10.1023/A:1010933404324.
T. K. Ho, “Random decision forests,” 1995, doi: 10.1109/ICDAR.1995.598994.
C. R. Sekhar, Minal, and E. Madhu, “Mode Choice Analysis Using Random Forrest Decision Trees,” in Transportation Research Procedia, 2016, p. 6, doi: 10.1016/j.trpro.2016.11.119.
N. Kuznetsova, “Random forest visualization Eindhoven University of Technology Master Thesis Random Forest Visualization,” Wald Lect. II, Dep. ofStatistics, Calif. Univ., 2014.
G. F. Hepner, T. Logan, N. Ritter, and N. Bryant, “Artificial neural network classification using a minimal training set: comparison to conventional supervised classification,” Photogrammetric Engineering & Remote Sensing. 1990.
S. K and S. S, “Review on Classification Based on Artificial Neural Networks,” Int. J. Ambient Syst. Appl., 2014, doi: 10.5121/ijasa.2014.2402.
P. Lewicki and T. Hill, “Statistics : Methods and Applications - A comprehensive reference for science, industry and data mining,” in StatSoft Inc., vol. 1, 2006.
S. Ruuska, W. Hämäläinen, S. Kajava, M. Mughal, P. Matilainen, and J. Mononen, “Evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle,” Behav. Processes, 2018, doi: 10.1016/j.beproc.2018.01.004.
Copyright (c) 2021 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;