Hate Speech Detection on Twitter in Indonesia with Feature Expansion Using GloVe

Twitter is one of the popular social media to channel opinions in the form of criticism and suggestions. Criticism could be a form of hate speech if the criticism implies attacking something (an individual, race, or group). With the limit of 280 characters in a tweet, there is often a vocabulary mismatch due to abbreviations which can be solved with word embedding. This study utilizes feature expansion to reduce vocabulary mismatches in hate speech on Twitter containing Indonesian language by using Global Vectors (GloVe). Feature selection related to the best model is carried out using the Logistic Regression (LR), Random Forest (RF), and Artificial Neural Network (ANN) algorithms. The results show that the Random Forest model with 5.000 features and a combination of TF-IDF and Tweet corpus built with GloVe produce the best accuracy rate between the other models with an average of 88,59% accuracy score, which is 1,25% higher than the predetermined Baseline. The number of features used is proven to improve the performance of the system.


Introduction
Of the top five countries, Indonesia is one of the countries that invest in social media in general, especially Twitter [1].Many netizens use the Twitter platform as a channel of opinion in the form of criticism and suggestions.But it is often netizens who misinterpret criticism with hate speech.Criticism could be a form of hate speech if the criticism implies attacking something (an individual, race, or group) [2].The hate speech crime has been included in the ITE Law Number 11 of 2008 Article 45 Paragraph 2 [3].
In the detection process, the use of inappropriate vocabulary makes sentences uploaded in the form of Tweets challenging to understand without context [4], which can be overcome by word embedding.Word embedding is a step used to find the vector of the word and its context in the corpus to be matched with specific criteria.Word2vec was used for feature expansion in the previous study [4].In addition to these methods, feature expansion can be carried out using Global Vectors for word representation (GloVe).GloVe is said to be an efficient and effective method for the representation learning process vector of words.GloVe is a log-bilinear global regression model for unsupervised learning of word representation that outperforms other models in analogy, word equations, and named entity detection developed by Stanford University.In this study, the choice of GloVe as the word embedding method was because GloVe consistently outperformed word2vec; by achieving better and faster results, the best results are also obtained regardless of the speed [5].
In research [6], hate speech detection using 16K annotated tweet dataset is the first research to use a deep learning architecture to learn semantic word embeddings to handle this complexity, outperforming the N-gram word method with ∼18 F1 points.Research has also previously been conducted to detect Indonesian hate speech [7]- [9].In the previous study [9], Random Forest Decision Tree (RFDT) with Label Power-set (LP) as a transformation method provides the best accuracy with fast computational time in general.The research [8] used the Latent Dirichlet Allocation (LDA) algorithm, and the F-measure of 93,5% was achieved when using the word n-gram feature with Random Forest.Word N-gram outperformed the character n-gram in research [7].
Hate detection using GloVe has been carried out with Deep Belief Network (DBN) algorithm [10], which weighs the GloVe feature to improve accuracy before classification with 86% accuracy and 85,42% Fl-Score.The superiority of the newly trained GloVe model was also demonstrated in the study [11], outperforming the pre-trained word embeddings model (5,9% higher, 69,13% compared to 63,2%).
Several studies on Feature Expansion have been carried out previously using word2vec, intended for topic classification [4] and Twitter sentiment analysis [12].In research [4], feature expansion with Google News datasets can improve performance consistently when using LR.The performance of LR classification with feature expansion was also obtained with an accuracy rate of 98,81% compared to Naïve Bayes (82,4%) and SVM (92,1%) [12].This research's main objective and focus are to implement feature expansion to reduce vocabulary mismatches in hate speech on Indonesian-language Twitter using GloVe.The researchers' steps included implementing feature extraction using Boolean features and TF-IDF, expanding features with GloVe, and selecting features related to the best model using Logistic Regression (LR), Random Forest (RF) algorithms, and Artificial Neural Networks (ANN).The limitation of the problem in this study is that the data used is Indonesian tweet data.Harsh words that lead to an individual or oneself are included in this study's definition of hate speech.

Research Method
The system plan of the hate speech detection is shown in Figure 1.Hate speech is all actions, both direct and indirect, based on hatred based on specific groups and incitement to individuals or groups through various means [13].In Indonesia, crimes regarding hate speech have been included in UU ITE Number 11 of 2008 Article 45 Paragraph 2, imprisonment for a maximum of 6 years or a fine of one billion rupiahs [3].Based on the Circular Letter of the Chief of Police Number: SE/6/X/2015 section 2f, hate speech is a criminal act in the form of insults, defamation, blasphemy, unpleasant actions, provoking, inciting, and spreading false news or hoaxes [13].
Based on descriptive research that has been done on Facebook, the form of hate speech in the context of speech is found to be the most common form of hate speech regarding blasphemy, and in the comment's column, it is found that the condition of insult is reproachful [14].One of the mediums to express hate speech is social media networks.With the rapid circulation of data and information on social networks, it is easier for individuals to push specific issues and spread hate speech which will cause a commotion among netizens.With that, the anonymity and mobility facilitated by the Internet have made harassment and hate speech easy to express in an abstract landscape and beyond the realm of law enforcement systems to control.By combining legal intervention with technology and regulatory mechanisms, the harm caused by online hate speech could be reduced [15].

Data Crawling
The dataset used is derived from the crawling results of Twitter in Indonesian using the Application Programming Integration (API) Key that the Twitter Developer has provided.In the crawling process, tweets with keywords are taken based on topics.The tweet topics used are determined based on trending topics during the crawling data period (October 2020 -June 2021), such as Omnibus Law, Religion, and Controversial Figures.Explicit words are determined to be topics based on harsh words as one of the characteristics of hate speech.From the crawling results, 20.601 tweets were collected.

Data Preprocessing
Data preprocessing is one of the methods in data processing.The quality of the data will be improved by going through a series of "cleaning" methods to ensure that the results of the classification process can be more accurate [16].Tweet data from crawling is a text with no structure that usually contains significant noise (data that does not collect valuable information for existing analysis).In this research, six stages are passed in data preprocessing.(1) Data Cleaning: eliminates noise, including emoticons, punctuation marks, and numbers to reduce unnecessary data information.( 2) Case folding is the conversion of capital letters to lowercase letters [17].
(3) Stop Words Removal removes or filters words with no importance or irrelevant in the classification process [18].( 4) Normalization or normalization is standardizing words that have almost the same meaning.( 5) Stemming: word substitution into a basic word by removing the affix.( 6) Tokenizing or tokenization is the process of breaking sentences into words, phrases, and symbols called tokens, where the tokens generated will assist in parsing and processing data [19].Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical approach that functions in giving weight to a word according to its level of importance in the corpus of the document [21].In the feature extraction stage, tweets based on keywords are given weighting with a TF-IDF score.After calculating the W weights for each document, the W weights are sorted to determine the degree of similarity between the documents and keywords (the greater the W value, the higher the similarity level).In this case, TF-IDF is used in the second vectorization method.The weight W in the TF-IDF calculation can be formulated in equation 1, with tfij as the number of words searched in a document and Id f j as Inverse Document Frequency.

Global Vectors (GloVe)
Global Vectors (GloVe) is a global log-bilinear regression model for unsupervised learning of word representation that outperforms other models in analogy, word equations, and named entity detection by Stanford University.GloVe is a model that takes advantage of the benefits of count data while simultaneously capturing meaningful linear substructures common in log-bilinear prediction-based methods.Compared to word2vec (when its corpus, vocabulary, window size, and training time are the same), GloVe consistently outperforms word2vec; in addition to achieving better and faster results, the best results are also obtained regardless of the speed [5].In a study comparing GloVe with other wordembedding models such as Continuous bag-of-words, Skip-gram, and Hellinger PCA, it was found that GloVe is the best model compared to other models because GloVe can be scaled to a large corpus with good performance (even with a small corpus), thereby improving the quality of the learned representation by normalizing the sum and log smoothing [22].

Building Similarity Corpus
The corpus is created based on each word contained in the tweet data.From each of these words, a similarity corpus is built with GloVe.To find the similarity between words, tweet data, IndoNews, and a combination of the two are used.IndoNews data was previously used in a study [4] 3 shows examples of vocabulary similar to "LGBT" in the similarity corpus built from the IndoNews dataset.

Feature Expansion
The feature expansion method is used to solve the problem of data distribution in corpus-based supervised Word Sense Disambiguation.Feature expansion can effectively fix the low retrieval efficiency caused by word ambiguity in short queries [23].The concept of feature expansion is to identify missing words in the tweet representation, substituted with semantically related words [4].This research implementation of feature expansion is based on research [4], [12].The following algorithms show the steps of the feature expansion based on the prior study.

Logistic Regression (LR)
Logistic Regression (LR) is a type of regression that connects the independent (independent) and dependent (category) variables.LR and ANN are currently the most widely used biomedical models (based on the number of publications indexed on MEDLINE: 28,500 for LR, 8500 for ANN).Both come from different communities (statistics and computer science) but have much in common [24].LR can predict the presence of a characteristic/outcome based on the value of a set of predictor variables, like Linear Regression, and is suitable for the dichotomous dependent variable model (nominal data scale with two categories) [24].The following is a class membership probability formula for one of the two categories in the data set in the LR model, with P as the logistic function value and x as the input data value.
As for parameters, , we conducted a trial in the LR model with C = 1.0, 100 maximum iteration, newton-cg as the solver, and multinomial logistic regression.

Random Forest (RF)
Random Forest is a combination of tree predictors.Each tree depends on the value of a random vector whose sample is obtained with a uniform distribution independently for all trees in the forest [25].Random Forest was introduced by Ho (1995) by combining many trees in the training data to produce a high level of accuracy [26].The starting point of the tree is the root node, while the end where the chain ends is called the leaf node.A node represents a particular characteristic, whereas a branch represents a range of values [27].In the RF partition, we divide the datasets into test and training sets.Each tree will form in-bag data with a subset of the training data and out-of-bag from the remaining parts [28].
It uses an optimized version of the CART algorithm to build decision trees.Binary trees are constructed in the CART algorithm using threshold and the feature which yield the largest information gain at each node.As for parameters used in this study, we conducted a trial in which we did not give maximum depth of the tree; thus, the nodes expanded until it contained less than the minimum number of samples required to split an internal node.We used the bootstrap samples when building trees.

Artificial Neural Network (ANN)
Artificial Neural Network, commonly referred to as ANN, is a neural network model that is a branch of artificial intelligence, consisting of many interconnected simple processors (neurons) that work in parallel in the network [29].ANN teaches systems to perform tasks instead of programming computational approaches to perform specific tasks.The teaching mode can be either supervised or unsupervised.Neural Networks learn in the presence of noise [30].In this study, the Multi-layer Perceptron (MLP) model is used as a class of ANN [31].MLP consists of three or more layers (input and output layers with one or more hidden layers) of nonlinearly activating nodes.Each node in one layer relates to a certain weight to each node in the next layer.
As for parameters, we set a trial with the ANN model's parameters that used hidden layer sizes = (8, 8, 8) (three hidden layers of 8, 8, 8 units respectively), with alpha = 1e-5 and stochastic gradient-based optimizer by Kingma, Diederik, and Jimmy Ba as the solver for weight optimization.

Performance Evaluation
Confusion Matrix represents how often a behavior is correctly detected and classified as a class [32].In the confusion matrix, a result correctly classified in the positive class is called True Positive (TP) and correctly categorized into the negative class True Negative (TN).Meanwhile, the positive class is classified as False Negative (FN) and the negative class as positive False Positive (FP).From the frequency of the four components, an indicator of the classifier's performance in detecting a given class can be obtained by calculating accuracy, precision, recall, and F1-Score in the built algorithm.In this study, accuracy and F1-Score were obtained through the average of the program execution results in five iterations.Here are the equations of accuracy, precision, recall, and F1-Score.

Results
In the first scenario, feature extraction is performed using boolean features as baseline and TF IDF.Table 5 is the result of evaluating the performance ratio of the boolean features of the RF, LR, and ANN models with each test size ratio of 0,1, 0,2, and 0,5 with 19.370 features.Table 5 shows the highest accuracy obtained at the test size ratio of 0,1 or 10% of the overall tweet data.The next step is to determine the optimal n-gram at the Baseline.The evaluation is limited to 5.000 features for a sample to overcome the runtime memory usage, which is quite large.The following are the results of the evaluation of the performance of the N-gram compared to the Baseline with a test size of 10% and a training size of 90%.for ANN.Unigram will be applied with a test ratio of 0,1 as the Baseline for the following scenario.Furthermore, in feature extraction, weighting is carried out with TF-IDF on the baseline vector with the experimental results in Table 7. Table 7 shows increased accuracy with the application of TF-IDF on the LR (0,24%) and RF (0,69%).There is a decrease in accuracy of 1,42% on the ANN.
The second scenario applies feature expansion with corpus similarity consisting of three types (corpus tweet, IndoNews, and a combination of the two) and their subcombinations using Top 1, Top 5, and Top 10 similarity between words.
Table 8 shows the performance of GloVe on the LR classifier.The decrease in accuracy occurs when using the entire Tweet corpus, Top 5 & 10 on the IndoNews corpus, and the entire combination of Tweet and IndoNews corpus.The highest increase of 0,07% was achieved by Top 1 with the IndoNews corpus.Table 9 shows the performance of GloVe on the RF classifier.The decrease in accuracy appears when using the Top 10 in the IndoNews corpus and a combination of Tweet and IndoNews corpus.The highest increase of 0,44% was achieved by Top 10 with Tweet corpus.Table 10 shows the performance of GloVe on the ANN classifier.The decrease in accuracy only occurs when using Top 1 in the IndoNews corpus.The highest increase of 2,37% was achieved by Top 10 with the combination of Tweet and IndoNews corpus.Table 11 shows the performance of GloVe on the TF-IDF and LR classifiers.The decrease in accuracy occurs when using the Tweet corpus in the Top 5 and the IndoNews corpus in the Top 1.The highest increase of 0,41% was achieved by Top 1 with Tweet corpus.Table 12 shows the performance of GloVe on the TF-IDF and RF classifier.Decreased accuracy occurs when using the IndoNews corpus.The highest increase of 0,85% was achieved by the Top 10 combined Tweet and IndoNews corpus.Table 13 shows the performance of GloVe on the TF-IDF and ANN classifiers.The decrease in accuracy occurred when using the IndoNews corpus, Top 5 combination of Tweet and IndoNews corpus, Top 5 and Top 10 in the Tweet corpus.The highest increase of 0,88% was achieved by the Top 10 combined Tweet and IndoNews corpus.
The third scenario applies feature selection, where there is a comparison between data with 5.000, 10.000, 15.000, and 19.370 TF-IDF feature vectors using RF.
After that, we apply feature expansion to the number of features with the highest accuracy.Table 16 shows the performance of GloVe on the TF-IDF and LR classifiers.A decrease in accuracy occurs when using the Top 1 corpus of IndoNews and the Top 5 corpus with the combination of Tweet and IndoNews corpus.The highest increase of 0,56% was achieved by Top 1 with the combination of Tweet and IndoNews corpus.Table 17 shows the performance of GloVe on the TF-IDF and RF classifier.There is no increase in accuracy when using the IndoNews corpus.The highest gain in accuracy of 1,25% was achieved by Top 1 with Tweet corpus.Table 18 shows the performance of GloVe on the TF-IDF and ANN classifier.There is no increase in accuracy when using the Top 1 and Top 10 of the IndoNews corpus and the Top 1 and Top 5 of the Tweet corpus, IndoNews.The highest increase of 0,6% was achieved by Top 1 with the IndoNews corpus.

Discussion
Based on the results of the tests, the RF and ANN classifiers most often experience an increase in accuracy after feature expansion, which is 16 increases compared to LR with 15 increases.RF achieves higher accuracy than other classifier models.The highest increase in accuracy in the feature expansion of the ANN model occurred in the combination of 19.370 features on Baseline + IndoNews, with an accuracy value of 84,73% and an increase of 2,37%.Meanwhile, the highest accuracy was achieved by RF in Top 1 with GloVe Tweet corpus, TF-IDF, and 5.000 features at 88,59%.Therefore, the classification algorithm that worked better and had the most influence on the feature expansion are RF and ANN.This result proves that feature selection and the weighing method with TF-IDF is responsible for the RF model achieving the best accuracy compared to the Baseline with the same similarity corpus and rank (87,07%).On the contrary, ANN performs better when we don't implement it.
The combination of the Tweet and IndoNews is the similarity corpus with the most increase in accuracy compared to the Baseline (19 increased accuracy).Meanwhile, the Tweet corpus has the most accuracy increase against the Baseline with 5.000 features (7 improvements in accuracy), followed by the combined corpus (6 improvements) and the IndoNews corpus (5 improvements).Thus, the dataset used as similarity corpus that has the most influence on the overall Baseline is the combination of the Tweet and IndoNews.
After we implemented the feature selection, the most influential corpus in increasing accuracy is the Tweet similarity corpus.The implementation of feature selection has proven to improve the system's performance scaled by each accuracy compared to the Baseline.

Conclusion
In this study, research on the detection of hate speech on Indonesian Twitter has been carried out.The researcher applies feature expansion using a word embedding Global Vectors (GloVe) to overcome vocabulary mismatches.Researchers apply this approach using a collection of 20.601 Indonesian tweet data.A corpus of similarity was developed, which was needed in the feature expansion process with the tweet and IndoNews data with GloVe.The implementation of feature extraction uses Boolean features and TF-IDF.After that, we perform feature expansion and selects features related to the best model using Logistic Regression (LR), Random Forest (RF), and Artificial Neural Network (ANN) algorithms.
The results show that the Random Forest model with 5.000 features and a combination of TF-IDF and Tweet corpus built with GloVe produce the best accuracy rate between the other models with an average of 88,59% accuracy score, which is 1,25% higher than the predetermined Baseline.The highest increase of average accuracy was obtained by ANN with 84,73% accuracy, Febiana Anistya, Erwin Budi Setiawan Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol.Meanwhile, after the feature selection, the most influential corpus in increasing accuracy is the Tweet similarity corpus.Based on the research results, in the feature expansion of the RF and ANN classifiers, the accuracy increases the most after the feature expansion, with RF achieving higher accuracy than the others.The number of features used proven to improve the performance of the system.

Figure 2 .
Figure 2. Word Cloud of Preprocessed Data 2.4.Feature ExtractionIn the feature extraction stage, a tweet will be represented.The representation of tweets in this study uses a fixed-length Boolean vector feature, with each feature indicating the presence or absence of a word in the tweet[4].

3 .
Result and DiscussionThis research is divided into three test scenarios for each classification model with LR, RF, and ANN.Accuracy results are obtained through the average of the results of program execution five times.The first scenario implements feature extraction using Boolean features and TF-IDF.The second application of feature expansion with corpus similarity was built with GloVe.The similarity corpus consists of three types (Tweet, IndoNews, and a combination of the two) and their subcombinations using rankings of one (Top 1), five (Top 5), and top ten (Top 10) in the ranking of similarity between words.The third scenario applies feature selection to compare data with 5.000, 10.000, 15.000, and 19.370 feature vectors.

5
No. 6 (2021)  1044 -1051 DOI: https://doi.org/10.29207/resti.v5i6.3521Creative Commons Attribution 4.0 International License (CC BY 4.0) 1051 gaining 2,37% accuracy with the Top 10 from the combination of Tweet and IndoNews similarity corpus built with GloVe compared to the Baseline.The corpus that has the most influence on the overall Baseline is the combination of the Tweet and IndoNews corpus.

Table 3
explains that the ranking is obtained from the similarity value generated by GloVe for the highest Rank-1 to Rank-10 with the lowest value.Table4shows the number of vocabulary in each corpus that has been

Table 4 .
Number of Vocabulary in Corpus

Table 5 .
Baseline Ratio Performance Value

Table 6 ,
it can be concluded that Unigram with a test size of 0,1 in each classifier proved to have the highest accuracy compared to Bigram and Trigram, respectively 86,94% for LR, 87,34% for RF, and 83,34%

Table 8 .
GloVe Performance with Baseline on LR

Table 9 .
GloVe Performance with Baseline on RF

Table 10 .
GloVe Performance with Baseline on ANN

Table 11 .
GloVe Performance with Baseline, TF-IDF on LR

Table 12 .
GloVe Performance with Baseline, TF-IDF on RF

Table 13 .
GloVe Performance with Baseline, TF-IDF on ANN

Table 14 .
Performance Comparison on Number of Features

Table 14 ,
it can be concluded that the combination of TF-IDF with 5.000 features has the highest accuracy.Table15describes the performance values from the Baseline with 5.000 features.

Table 15 .
Baseline with 5.000 Features Performance