Comparison of LSTM and IndoBERT Method in Identifying Hoax on Twitter

In recent years, social media users have been increasing significantly, in January 2022 social media users in Indonesia reached 191 million people which has an increase of 12.35% from the previous year as many as 170 million people, With this massive increase every year, more and more people tend to seek and consume information through social media. Despite the many advantages provided by social media, However, the quality of information on social media is lower than in traditional news media there is a lot of hoax information spreading. With many disadvantages felt by hoax information, it has led to many research to detect hoax information on social media, especially information that is widely spread on Twitter. There are several previous researches that use various models using machine learning and also using deep learning to detect hoax. deep learning is very well used to perform several text classification tasks, especially in detecting hoax. The aim of this paper is to compare the LSTM and IndoBERT methods in detecting hoax using datasets taken from Twitter. In this study, two experiments work are conducted, LSTM and IndoBERT methods. The experimental results is average value obtained from experiments using 10-fold cross-validation. The IndoBERT model shows good performance with an average accuracy value of 92.07%, and the LSTM model provides an average accuracy value of 87.54%. The IndoBERT model can show good performance in hoax detection tasks and is shown to outperform the LSTM model which can provide the best average accuracy results in this study.


Introduction
In recent years, social media users have been increasing significantly, in January 2022 social media users in Indonesia reached 191 million people which has an increase of 12.35% from the previous year [1], With this massive increase every year, more and more people tend to seek and consume information through social media [2], The reason people get information through social media is because quite easy and faster to get information compared with getting information through traditional media such as newspapers.Despite the many advantages provided by social media, However, the quality of information on social media is lower than in traditional news media there is a lot of hoax information spreading, this is due to the easy access and also the lack of control of the internet [3].According to data from the Ministry of Communication and Information of the Republic of Indonesia (KEMKOMINFO), the spread of hoax on social media in the last three years has occurred in as many as 9,546 cases.Hoax information is intentionally written to mislead readers [4] and it carried out by irresponsible people with various purposes, such as the purpose of a group or individual that can have a serious negative impact on society [2].With many disadvantages felt by hoax information, it has led to many research to detect hoax information on social media, especially information that is widely spread on Twitter [5].
There are several previous researches that use various models using machine learning [6]- [8] and also using deep learning [9]- [11] to detect hoax.deep learning is very well used to perform several text classification tasks, especially in detecting hoax [12].One of the commonly used models is RNN (Recurrent Neural Network), RNN developed to overcome subsequent data, but RNN has limitations in capturing long-term dependencies.Long Short-Term Memory (LSTM) is a model of RNN that developed or modified to overcome the limitations of RNN in capturing long-term dependencies by remembering long-term information [13], in other words, the limitations of long-term dependencies in RNN, are not a problem in LSTM, so LSTM is more efficient in process, prediction, and classification of data.In research [13] to detect hoax in Indonesian languages using the Long Short-Term Memory (LSTM) method, the results of the research Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol.obtained the average value of precision, recall, f1 score 0.819; 0.809; and 0.807 from several experiments.In another research [9] to analyze hoax on Indonesian news using several deep learning models, the LSTM model produces a high accuracy value of 95.6% without using dropout, after using dropout the accuracy value has increased slightly to 95.6%.
Recent studies have shown that pre-trained models trained on a large corpus can successfully accomplish various tasks with transfer learning.One of the pretrained models that can provide the best results for specific tasks is the BERT architecture [14].IndoBERT is a transformer-based model using BERT [14].There are currently two kinds of IndoBERT, developed by IndoNLU and IndoLEM.IndoBERT from IndoNLU conducts the training process with datasets collected from various sources such as social media, blogs, news, and websites [15].IndoBERT from IndoLEM conducts the training process with datasets from Wikipedia Indonesia, Indonesian news articles, and the Indonesian Corpus website [16].In research [14] using the IndoBERT model trained on the Indonesian language corpus, IndoBERT fine-tuned models, IndoNLU and IndoLEM provide the same performance with accuracy 97.67.IndoBERT outperformed the other fine-tuned models and the multilingual BERT model with the uncased version.
LSTM and IndoBERT are methods designed to process sequential input data, such as natural language, and both methods have often been used in text classification tasks, especially in hoax detection, and the LSTM and IndoBERT methods have shown good accuracy in hoax detection tasks.because these two methods have shown good performance results in hoax detection tasks, this study will compare the accuracy results of the LSTM and IndoBERT methods in detecting hoax using datasets taken from Twitter.

System Flowchart
The design of the work system in this study as shown in Figure 1, begins with crawling the dataset taken from the tweets of Twitter users, and continued with the manual labeling process to determine whether a tweet is a hoax or not, then it will correct the structure of inconsistent sentences by doing pre-processing.After pre-processing the dataset, then the next process in the LSTM model is carried out the training process using Word2Vec before splitting the data for training data and test data using K-fold cross-validation, in the IndoBERT model does not need to do the Word2Vec process and directly splits the data using K-fold crossvalidation, then the training and testing process will be carried out.After all the processes are completed, the last step in this research will compare the accuracy results obtained from the LSTM and IndoBERT models in detecting hoaxes.

Data
The dataset used in this study was taken from Twitter by crawling using the snscrape library, this library can collect data from Twitter with various features that we need for our research.the data taken is Indonesian tweets with the keywords "covid-19" and "corona", with a time range of data collection from March 2020 -May 2020.Table 2, is an overview of the features of the data taken from Twitter.

Pre-processing
Pre-processing is used to process datasets into information that is more efficient and useful in carrying out the classification process.datasets taken by crawling from twitter usually often have errors, missing values, and inconsistencies.Pre-processing consists of several stages, first, remove emoji serves to clean some emoji or emoticons in the tweet data.After that, case folding aims to convert all letters into lowercase letters, and only letters 'a' to 'z' are accepted other than those that are delimiters.then, normalization which aims to correct abbreviated or unclear words that are matched in the normalization dictionary of words.
Next is clean tweet, in this section function to eliminate things that are not too important that can affect the results of research such as punctuation, hashtags, and links.After that, the stemming process is to convert the word into its basic form.and the last step is filtering which aims to remove words that are considered not important.

K-Fold Cross Validation
The training and evaluation process of this study uses K-fold cross-validation, the use of cross-validation to test the robustness of the model in the classification of unseen data [17].k value used in k-fold cross-validation is 10.The dataset used in this study will be randomly divided into 10 equal parts (folds) which are repeated ten times.The dataset, 90% is used for training and 10% is used for testing.By shifting the folds at each iteration, new training data and testing data will be generated.
Figure 2 shows the distribution of data in training and testing using 10-fold cross-validation.Word2Vec is the name of the word vector representation, two architectures in Word2Vec modeling are used to represent word vectors, which are continuous bag-of-word (CBOW) and skip-gram.In this research, we will use a skip-gram architecture that uses the current word as a target for the neighbor words.
In building the Word2Vec model, Figure 3 shows three processes are involved, vocabulary builder, context builder, and neural network (skip-gram architecture) [11].The first part of the Word2Vec model is the vocabulary builder, which is used to build the vocabulary of the corpus text.This section will collect all the unique words from the corpus and use to build a dictionary.The vocabulary builder process will result in a dictionary of words with word indexes and occurrence values for each word [11].
Context builder is a process to find the relation between the occurrence of one word and other words around it by using the context window concept or commonly called a sliding window [11].The Word2Vec training process in this study uses one type of architecture from Word2Vec, that is skip-gram which will predict the context or word (output) around the current word (input) which is bounded by a window.The window is used to obtain the input and target words, window will be moved from the beginning to the end of the word order.An illustration of the window can be seen in Figure 4.This process is used to perform training so that each word can be represented with a vector.Word2Vec uses an artificial neural network architecture that uses 3 layers, input layer, hidden layer, output layer [11] formed from skip-gram architecture.Figure 5 shows the skip-gram architecture to generate word2vec.One of the commonly used models that is a modification of RNN is Long Short-Term Memory (LSTM).LSTM is modified to complement the shortcomings of RNNs that cannot predict words and remember information stored for long periods of time and delete data that is no longer needed [7].In LSTM, there are some parts that control the use and update of previous information, namely the input gate, forget gate, and output gate and a memory cell [13].The memory cell and the three gates are designed to read, store, and update past information.Figure 6 shows the LSTM architecture.
There are several steps in the LSTM model in processing input data, the initial stage starts through the forget gate (  ).At this stage, parts that are not needed or have less meaning in this case will be removed.The calculation of the forget gate value uses the equation ( 1) In the next step, the data or information is processed through the input gate (  ) using equation ( 2), this process will divide and determine the information that will be updated to the cell state section, and in this step also creates a new candidate vector which will then be added to the section cell state (1  ) using equation (3).
After that, update the value of the old cell state (-1 ) to the new cell state (  ) using equation ( 4).
The last step is in the output gate section, after producing sigmoid and tanh output values, the results will be multiplied before going to the next step using equations ( 5) and ( 6).After all the calculation processes in the LSTM model are complete, it will produce a classification value.

IndoBERT
IndoBERT is the result of a modification of BERT Base that follows the settings of BERT-Base (uncased) [18].
In the process, IndoBERT uses the mechanism of transformers, where the mechanism learns the relationship between each word in a sentence.IndoBERT uses two mechanisms, an encoder to read the input and a decoder to generate predictions [18].Unlike other language models that can only read input text sequentially from left to right or vice versa, using the BERT method can read the entire word at once [19].There are two steps used in IndoBERT various public sources such as social media, blogs, news, and websites [15].

System Performance Measure
System performance measures in this study using confusion matrix.The confusion matrix contains information about the actual classification as well as the predictions that have been made by the classifier model, which aims to analyze the impact of each scenario on the performance of an analysis model [21].Table 4 shows the confusion matrix.The confusion matrix measurement in this study uses the accuracy value.Accuracy is a measure of how much data is correctly predicted in the classification process.Accuracy uses equation (7).

Results and Discussions
This section explains the performance results using the LSTM and IndoBERT methods that have been implemented in building a system to detect hoaxes from datasets taken on Twitter, and also shows a comparison of the average results obtained from the LSTM and IndoBERT methods.

LSTM
The results from applying the LSTM method to the dataset can be seen in table 5.In table 5, it can be seen from the use of the LSTM method on the dataset, the highest accuracy value is obtained from fold 1 with a value of 90.69%, the lowest accuracy value in fold 7 and 8 is 85.66%, while the precision value is 84.33%, the recall value is 86.66% and the average accuracy result is 87.54%.

IndoBERT
The results from applying the IndoBERT method to the dataset can be seen in table 6.In table 6, it can be seen from the use of the IndoBERT method on the dataset, the highest accuracy value is obtained from fold 7 with a value of 98.66%, the lowest accuracy value in fold 10 is 53.33%, while the precision value is 93.33%, the recall value is 97,22% and the average accuracy result is 92,07%.Figure 7 shows the performance graph of the LSTM and IndoBERT models in this study, the value used in the graph in Figure 7 uses the average value obtained from experiments using 10-fold cross-validation.The IndoBERT model shows good performance with an average accuracy value of 92.07%, and the LSTM model provides an average accuracy value of 87.54%.

Comparison of Evaluation Results
Figure 8 shows the results of the average loss value of the LSTM and IndoBERT models.the average loss value obtained from the LSTM model is higher, which makes the accuracy obtained lower than the IndoBERT model.

Figure 4 .
Figure 4. Ilustration of The Window

Figure 6 .
Figure 6.LSTM Architecture [20], namely pre training, during the pre-training process the model is trained on unlabeled data on different pretraining tasks.And fine-tuning, the pre-trained parameters are initialized to the BERT model, and the parameters are fine-tuned with labeled data from the given task.Each given task has a separate fine-tuned model, although it is initialized with the same pretrained parameters.Currently, there are two types of IndoBERT trained on different corpus, one of which is IndoNLU[14].IndoNLU includes 12 tasks, which perform the trained process using a large Indonesian clean dataset called Indo4B which is collected from Muhammad Ikram Kaer Sinapoy, Yuliant Sibaroni, Sri Suryani Prasetyowati Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 7 No. 3 (2023) DOI: https://doi.org/10.29207/resti.v7i3.4830Creative Commons Attribution 4.0 International License (CC BY 4.0) 661

Figure 7 .
Figure 7.Comparison of Accuracy Result

Figure 8 .
Figure 8.Comparison of Loss Result

Table 1 .
Description of Features in The Data including hoax information or not which is done manually by the author.The author labels about 3000 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 7 No. 3 (2023) DOI: https://doi.org/10.29207/resti.v7i3.4830Creative Commons Attribution 4.0 International License (CC BY 4.0) 659 data from all existing data.The results of data comparison with hoax and non-hoax labels from existing datasets are shown in table 2.

Table 5 .
Accuracy of Each Fold Using LSTM

Table 6 .
Accuracy of Each Fold Using IndoBERT