Indonesian Hate Speech Detection Using Bidirectional Long Short-Term Memory (Bi-LSTM)

Aditya Perwira Joan Dwitama; Dhomas Hatta Fudholi; Syarif Hidayat

doi:10.29207/resti.v7i2.4642

Aditya Perwira Joan Dwitama Universitas Islam Indonesia
Dhomas Hatta Fudholi Universitas Islam Indonesia
Syarif Hidayat Universitas Islam Indonesia

DOI: https://doi.org/10.29207/resti.v7i2.4642

Keywords: hate speech, Bi-LSTM, IndoBERT, text, tweet

Abstract

Abstract

Social media is a platform that allows users to express themselves freely including spreading hate speech content. The government has issued the regulation in the UU ITE to handle and prevent hate speech on social media. The research was also conducted using the Bi-LSTM to classify the text into hate speech or not. Another research was purposed to detect hate speech and its categories using Bi-GRU. However, the performance of the model Bi-GRU is still lower than Bi-LSTM with an accuracy of 86.44% and 96.44%. Therefore, this study aims to build a model that can detect hate speech and its categories. The research offers Bi-LSTM as a classification model and IndoBERT as a tokenization model. The dataset used is a public dataset containing 13 thousand tweets. As a result, the best model obtained is using 20 epochs, 192 batch sizes, 1 layer Bi-LSTM with 40 nodes, and applying class weighing in the optimization process. The pre-train model from IndoBERT that is used to support the performance of the model in classifying is "indobenchmark/indobert-large-p2". The performance given by the purposed model is very good with an average accuracy, precision, and recall of 97.66%, 96.50%, and 85.25%.

Downloads

Download data is not yet available.

References

“Digital 2022: Indonesia — DataReportal – Global Digital Insights.” https://datareportal.com/reports/digital-2022-indonesia (accessed Oct. 02, 2022).

A. Marpaung, R. Rismala, and H. Nurrahmi, “Hate Speech Detection in Indonesian Twitter Texts using Bidirectional Gated Recurrent Unit,” in KST 2021 - 2021 13th International Conference Knowledge and Smart Technology, Jan. 2021, pp. 186–190. doi: 10.1109/KST51265.2021.9415760.

G. B. Herwanto, A. M. Ningtyas, K. E. Nugraha, and I. N. P. Trisna, “Hate Speech and Abusive Language Classification using fastText,” in 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 2019, pp. 69–72. doi: 10.1109/ISRITI48646.2019.9034560.

C. Febriyani, “The Danger of Hate Speech in Cyberspace is Regulated as a Crime in UU ITE (Bahaya Ujaran Kebencian di Dunia Maya Diatur Sebagai Tindak Pidana di UU ITE),” 2021. https://www.industry.co.id/read/93219/bahaya-ujaran-kebencian-di-dunia-maya-diatur-sebagai-tindak-pidana-di-uu-ite (accessed Feb. 24, 2022).

D. Putri, “Should all hate speech be punished? Notes for revision of UU ITE (Apakah semua ujaran kebencian perlu dipidana? Catatan untuk revisi UU ITE),” 2021. https://theconversation.com/apakah-semua-ujaran-kebencian-perlu-dipidana-catatan-untuk-revisi-uu-ite-156132 (accessed Feb. 24, 2022).

A. P. J. Dwitama, “Hate Speech Detection on Indonesian Twitter using Machine Learning: Review Literature (Deteksi Ujaran Kebencian Pada Twitter Bahasa Indonesia Menggunakan Machine Learning: Reviu Literatur),” Jurnal SNATi, vol. 1, pp. 31–39, 2021.

A. S. Saksesi, M. Nasrun, and C. Setianingsih, “Analysis Text of Hate Speech Detection Using Recurrent Neural Network,” in 2018 International Conference on Control, Electronics, Renewable Energy and Communications (ICCEREC), 2018, pp. 242–248. doi: 10.1109/ICCEREC.2018.8712104.

E. Sazany and I. Budi, “Deep Learning-Based Implementation of Hate Speech Identification on Texts in Indonesian: Preliminary Study,” in 2018 International Conference on Applied Information Technology and Innovation (ICAITI), Sep. 2018, pp. 114–117. doi: 10.1109/ICAITI.2018.8686725.

H. Mohaouchane, A. Mourhir, and N. S. Nikolov, “Detecting Offensive Language on Arabic Social Media Using Deep Learning,” in 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), Oct. 2019, pp. 466–471. doi: 10.1109/SNAMS.2019.8931839.

A. R. Isnain, A. Sihabuddin, and Y. Suyanto, “Bidirectional Long Short Term Memory Method and Word2vec Extraction Approach for Hate Speech Detection,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 14, no. 2, p. 169, Apr. 2020, doi: 10.22146/ijccs.51743.

I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, “Hate speech detection in the Indonesian language: A dataset and preliminary study,” in 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017, May 2018, vol. 2018-January, pp. 233–237. doi: 10.1109/ICACSIS.2017.8355039.

F. A. Prabowo, M. O. Ibrohim, I. Budi, and Institute of Electrical and Electronics Engineers, “Hierarchical Multi-label Classification to Identify Hate Speech and Abusive Language on Indonesian Twitter,” in 2019 6th International Conference on Information Technology, Computer and Electrical Engineering (ICITACEE), 2019. doi: 10.1109/ICITACEE.2019.8904425.

M. O. Ibrohim, M. A. Setiadi, and I. Budi, “Identification of hate speech and abusive language on Indonesian twitter using theword2vec, part of speech and emoji features,” in Advanced Information Science and System, Nov. 2019. doi: 10.1145/3373477.3373495.

P. Malik, A. Aggrawal, and D. K. Vishwakarma, “Toxic Speech Detection using Traditional Machine Learning Models and BERT and fastText Embedding with Deep Neural Networks,” in Proceedings - 5th International Conference on Computing Methodologies and Communication, ICCMC 2021, Apr. 2021, pp. 1254–1259. doi: 10.1109/ICCMC51019.2021.9418395.

S. Agarwal and C. R. Chowdary, “Combating hate speech using an adaptive ensemble learning model with a case study on COVID-19,” Expert Syst Appl, vol. 185, Dec. 2021, doi: 10.1016/j.eswa.2021.115632.

B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Sep. 2020, pp. 843–857. Accessed: May 23, 2022. [Online]. Available: https://aclanthology.org/2020.aacl-main.85

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics, Nov. 2020, pp. 757–770. doi: 10.18653/v1/2020.coling-main.66.

F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Sep. 2021, pp. 10660–10668. doi: 10.18653/v1/2021.emnlp-main.833.

M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,” in Proceedings of the Third Workshop on Abusive Language Online, 2019, pp. 46–57. [Online]. Available: https://www.komnasham.go.id/index.php/

F. Herrera, F. Charte, A. J. Rivera, and M. J. del Jesus, “Multilabel Classification,” in Multilabel Classification, Springer International Publishing, 2016, pp. 17–31. doi: 10.1007/978-3-319-41111-8_2.