Implementation of BERT, IndoBERT, and CNN-LSTM in Classifying Public Opinion about COVID-19 Vaccine in Indonesia

COVID-19 was classified as a pandemic in March 2020, and then in July 2021, this virus had its variance that spreads all over the world including Indonesia. The probability of the detrimental of its effect cannot be avoided, because this virus has a huge transmission risk during daily activity. To prevent suffering from COVID-19, people certainly need to be vaccinated. In responding to its vaccine, the citizen of Indonesia become expressive, so they try to express opinions, for example by uploading text on Twitter. Those expressions can be learned using deep learning frameworks which are BERT, CNN-LSTM, and IndoBERTweet to get knowledge about negative speech categories such as anxiety, panic, and emotion, or positive speech such as vaccines whether worked well. By then, these three methods accomplish in carrying out the prediction of sentiments about vaccination using dataset tweets on Twitter from January-2021 to March-2022, for instance using IndoBERT succeeds to classify sentiments as positive sentiment at around 80%, and then IndoBERTweet at 68%, in addition using CNN-LSTM reach 53% with the total of using 2020 dataset from Twitter. According to these results, a lesson learned for continued improvement for Indonesia's Government or authorities can be acquired in ending the COVID-19 pandemic.


Introduction
On March 11, 2020, WHO declared the coronavirus or COVID-19 as a pandemic. This decision was made based on positive cases outside China which increased thirteen times in 114 countries [1]. Until now the pandemic has not ended, referring to data from three years ago (from 2019 to 2021) the coronavirus has undergone several mutation processes like the delta and omicron variant viruses [3], [4], [10]. Those types of variant already exist in more than 104 countries including Indonesia and are currently being discussed widely because they cause a high spike in positive and death rates, thus requiring further study, particularly in Indonesia which have a huge population [4], [6], [23].
According to the data (No.146/U66/099/COVID-19/BNPB/11062020) from the Acceleration Task Force for the fastest response of handling COVID-19 in Indonesia, in November 2020, it was stated that there were +/-450,000 Indonesians exposed to  In precluding the sustained spread of the COVID-19 virus, the action that has the most influence on this is vaccination. The purpose of vaccination is to provide immunity against disease [5]. References [4], [7], [9], [10] state that other measures can monitor the reaction of vaccine recipients by monitoring through social media, such as Twitter [12], [19], [26]. Furthermore, in references [13]- [18] it is stated that there was a method that can answer problems regarding massive and rapid response to vaccine recipients using machine learning and/or deep learning algorithms. Indonesia has a generational-level population that is dominated by generation z and millennials, so the number of reactions or opinions regarding sentiment, effectiveness, and administration of vaccine types can be described quickly by studying tweet data from Twitter [16], [17], [20], [25], especially considering Twitter is one of the popular social media in Indonesia with 19.5 million active users.
Previous studies related to this such as those conducted by Nasiba [18] in May 2021 considered the monitoring of vaccine progress reactions using four machine learning algorithms namely Decision Tree, K-nearest neighbors (KNN), Random Tree, and Naive Bayes, obtained results with maximum accuracy of 99% to determine the reaction of the COVID-19 vaccine. Furthermore, Kazi [8] in December 2021 said that research using the LSTM architecture acquired an accuracy of 90.59%, while the Bi-LSTM method demonstrated an accuracy of 90.83% for the problem of sentiment analysis of the COVID-19 vaccine response from Twitter data.
The difference between this study and others is that the output produced is in the form of a prediction classification of public opinion regarding the COVID-19 vaccine in Indonesia. In other words, in this study, there is a novelty regarding the model that was built, which focuses on the three deep learning frameworks used, namely IndoBERT, IndoBERTweet, and CNN-LSTM in classifying Indonesian people's opinions regarding the reaction, effectiveness, and types of the COVID-19 vaccine through posts on Twitter. There are two ways to collect data, namely by using data that has been researched by IndoLEM and Indo SMSA, which is then combined with the results of crawling from posting data on Twitter based on keywords that will become the classification topic.
Referring to previous research, this study aims 1. To describe the use of a deep learning frameworks approach with the best accuracy in studying the sentiment of the COVID-19 vaccine in Indonesia. 2. Classifying the opinion of the Indonesian people on the effectiveness of the COVID-19 vaccine. 3. Classifying the opinion of the Indonesian people towards the provision of free and paid COVID-19 vaccines in Indonesia.

Research Methods
This study considered data collection at first and then preprocessing. Output from this step will be used as input in dividing data labeled or not. After that, the whole data was elaborated using three deep frameworks to be classified as negative or positive. This outcome continued to be processed using three experiment scenarios. At last, the response positive or negative is classified with accuracy as an indicator of the performance system. The complete stages as illustrated in Figure 1.
The scenarios of this study then to calculate the accuracy of three approach deep learning frameworks (IndoBERTweet, BERT, and CNN-LSTM) in classifying opinions from Twitter users about the response to using vaccines in Indonesia, and then observe the types of each vaccine, which one is categorized positive, negative, or neutral in Indonesian's body. After that, this research will be measured also the response to using the free or paid vaccine in preventing COVID-19.

Data Collection
Collecting the dataset was done by merging from several sources. This aim is to train the system in having good predictions about the sentiment public. First data that has been used from IndoLEM. The data is provided in this link https://github.com/indolem/indolem/tree/ main/sentiment/data [13]. Meanwhile, the second data from the IndoSMA can be found at this link https://github.com/indobenchmark/indonlu/tree/master /dataset/smsa_doc-sentiment-prosa [14]. Some of those data can be seen in Table 1.

Negative
After all data is obtained from Twitter, we label the data manually using a random sampling technique. This labeling uses the voting method with an agreed value of 0.95 (which is almost perfect) [3]. The total data that we label is 536 encompasses the balance label of each of labels 0 and 1 is 268.
Dataset is divided into training and testing. Training data is formed from all data developed by IndoLEM and Indo SMSA, with the addition of 50% from the data which we label ourselves. Testing data is the remaining 50% of the data which are not included in the training.

Data Pre-Processing
Data pre-processing is the process of cleaning and deleting non-textual substances [8]. Several techniques of pre-processing data include these steps: Special Character Removal dan Lower Casing, Change, Mistype and Slang Word, Stop Word Removal, Stemming using Sastrawi, and Tokenization. The result of this stage indicates in Table 3. The dataset that had been used in this study is in a balanced proportion between positive and negative labels. This aims to avoid the measurement becoming biased. In this study, IndoBERT was used as a BERT base. In other words, it can be said that IndoBERT is a BERT model that has been trained using Masked Language Modeling [16], [19]. During the experiments, because we need to classify into two categories, the hyperparameters use the addition of one dense layer and SoftMax activation. The architecture of it was described in Figure 2. IndoBERTweet is an additional feature from the existing IndoBERT. This enhancement made IndoBERTweet more detailed in fine-tuning for instance by using more than 26 million tweets. By having the massive data in this study, we add the layer above this model. It was used as one of the domainspecific in Twitter.

IndoBERTweet
We set this environment to analyze the whole performance of the model. The visualization of this architecture can be seen in Figure 3.
In this model, the architecture consists of 128-layer, drops-out of 0.5, and using sigmoid as an activation function. The consideration of that scheme is because this method will cultivate a sentence by passing through the tokenization layer. Its purpose is to divide the sentences to be tokens that have 128 output dimensions; thus, it can pass the embedding layer from library Keras. After that, the process will continue to the onedimensional layer convolutional using RelU as an  Table 4. These kinds of settings are to get the best result during the classification progress. During the experiments all the scenarios we use the same parameters to obtain a stable result.

Result
This study learned to classify the response of public opinion about vaccine COVID-19 based on the keyword that had been defined before and already had a sentiment label or not. According to the aims of this study, we got the result as stated in Table 5. These performance results indicate that all the frameworks succeed in classifying opinion into negative, positive, or neutral by having a performance system of around 60% to 70% either for accuracy or F1measure. IndoBERTweet has the first rank in terms of accuracy and an F1 score of 0.73, and then it is followed by IndoBERT in second place with around 0.64 accuracies and a 0.68 F1 score. Meanwhile, CNN-LSTM has a third place with a value of 0.66 accuracies and 0.61 for the F1 score.
A more detailed explanation of these measurements will be described in chapter 3.2. which explains Indonesian public opinion on the effectiveness of the COVID-19 vaccine, continuing in chapter 3.3. will describe Indonesian public opinion against the provision of free and paid vaccines, and then it will enlighten the issue about 3.4. Indonesian public opinion regarding the type of covid-19 vaccine in Indonesia. Model Accuracy Bi-LSTM [8] 0.90 LSTM [8] 0.90 Whereas the comparison of our study with the previous research can be seen in Table 6. By using Bi-LSTM and LSTM the references from [8] got 0.9 for their accuracy. It is a comparison to get to know the best model we used. Considering its result, the imbalanced data train that had been used, which made an impact in recognizing three categories of positive, negative, and neutral worked well.

Indonesian Public Opinion about the Effectiveness of the usage of COVID-19 Vaccine
Effectiveness is a condition that affects efficacy or success [12]. While the effectiveness of the vaccine can be interpreted as the efficacy or success of the use of the COVID-19 vaccine. Trust regarding the effectiveness of vaccines is very important because vaccines are a form of prevention from the COVID-19 virus. Because of that, this experiment emerged to get to know their effects.
In Indonesia, the distribution of vaccines is relatively new or has occurred recently, because vaccination begin to be administered in early 2021. Responding to this action, many people have given opinions using their Twitter accounts regarding the effectiveness of the vaccines they have received.
The public opinion used to conduct sentiment analysis was taken from Twitter with the keyword's such as "vaksin", "vaksinasi", "vaksinisasi", "covid", "corona"," covid-19"," covid19", "berguna"," bermanfaat"," berhasil"," menghidupkan"," gagal"," mematikan"," tidak berguna" dan "merugikan". Based on implementing the three deep learning frameworks, several classification results (like positive or negative opinion) can be figured out in Table 7. Referring to Figure 5, people tend to have positive sentiments about the effectiveness of the COVID-19 vaccine in Indonesia. This is supported by the percentage given by the CNN-LSTM algorithm which is 47% negative and 53% positive, while the IndoBERT model is 20% negative and 80% positive, and then IndoBERTweet is 32% negative and 68% positive from the use of 2020 total dataset. Free and paid COVID-19 vaccines are quite important because differences in the use of vaccine types will cause new social inequalities and social classes.
In addition, this also has another side that may have an impact on the speed of the vaccination process itself.
These differences will certainly invite public opinion which is usually posted on Twitter. The example of tweets about these were "vaksin"," vaksinasi"," vaksinisasi", "covid", "corona"," covid-19"," covid19", "gratis"," murah"," mahal"," bayar"," berbayar", dan" tanpa biaya". The classification result of negative or positive can be noticed in Table 8. Meanwhile, Figure 65 demonstrates each of the three algorithms in dealing with paid or non-paid ratio prediction. Based on these yields, the Indonesian tweets trend was negative sentiments toward paid vaccines. it means that people refuse to pay for their vaccine, as it is a part of government duty.
This can be seen from the percentages produced by the CNN-LSTM model 80% negative and 20% positive, while the IndoBERT model is 78% negative and 22% positive, and then IndoBERTweet is 74% negative and 26% positive from a total of 1127 datasets.
On the other hand, though in the IndoBERT there are 4% differences in responding positive of free vaccine, the measurement of the response relative stated negative, particularly as detected in CNN-LSTM and IndoBERTweet. It can be said, perhaps people do not get the information that the vaccine is free.
This can be analyzed from the percentages constructed by the CNN-LSTM approach amount of people's sentiment 59% negative and 41% positive using a total of 2975 datasets, whilst the result in the IndoBERT model is 53% negative and 47% positive.

Indonesian Public Opinion Regarding the Type of COVID-19 Vaccine in Indonesia
During the pandemic, several types of vaccines appeared. In Indonesia, there are Sinovac, AstraZeneca, Sinopharm, and Pfizer choices to be injected into the body [11]. In general, the availability of a variety of vaccines confuses people to decide which one is the best to choose. This behavior was expressed using their Twitter account. The diversity of its term was found as "vaksin", "vaksinasi"," vaksinisasi", "covid", "corona"," covid-19"," covid19", "sinovac", "astrazeneca", "sinopharm", "pfizer". By using their posting afterward three deep learning frameworks can be classified well in recognizing as negative or positive opinions. Their output is depicted in Table 9. Furthermore, the observations of the percentage prediction about vaccine brands had been developed with their result indicated in Figure 7. It can be inferred that people incline to have positive sentiments toward all types of vaccines.
Using each brand of vaccine from 4704 datasets, the percentage of response positive 4% higher by injecting Sinovac vaccine based on the CNN-LSTM method, while the IndoBERT method got the highest positive percentage that 85% been recognized, but it was only 68% positive classified using IndoBERTweet approach.
Whereas the prediction calculations for the Pfizer vaccine used a total of 2091 datasets, the CNN-LSTM technique was found to be 42% negative and 58% positive, whilst the IndoBERT technique was 18% negative and 82% positive, then IndoBERTweet was 28% negative and 72% positive.
Meanwhile, the use of 1467 datasets in responding to the AstraZeneca vaccine with the CNN-LSTM algorithm acquired predictive results of 63% positive yet using the IndoBERTweet and IndoBERT caught up until 85% to 80% positive respectively.
After that, the prediction percentage from using 633 datasets for the Sinopharm Vaccine involved high responses positive from data posted on Twitter using IndoBERT filter with 93% suited with it, and 88% of Twitter users said it fit with their body and then 76% response positive came from CNN-LSTM classification.
At last, by using the 1288 dataset of the Moderna vaccine represented using deep learning CNN-LSTM got the response with 39% negative and 61% positive, whereas IndoBERT was 21% negative and 79% positive, and then at last IndoBERTweet exemplify there were 26% negative and 74% positive sentiments relating this type of injection.

Discussion
Based on the sentiment analysis of the opinion public about the COVID-19 vaccine, Indonesia's citizens prefer to reply positively. It means that the vaccine has received a good response or can be accepted well. From their posting, the community also prefers to use a free vaccine to prevent getting COVID-19.
Until this study finished, the people said they are comforted with the Sinopharm vaccine, because its vaccination did not trigger other reactions like fever or getting cold after getting it. Nevertheless, other types of vaccines also can be accepted well by the Indonesian body.

Conclusion
In reducing the positive and mortality rate because COVID-19 in Indonesia, it already had been implemented several policies such as social distancing, PKKMB, cleaning education, and even a suggestion of vaccination. One of the best solutions to preventing infection by its virus was by getting the vaccine. The reaction of the public after vaccinated is one of the important things to conceive the response. Because it illustrates the condition of what an individual feel.
Public opinion is usually expressed through social media Twitter because this social media has around 19.5 million users and is one of the most frequently used applications by the Indonesian people, so posting on Twitter can be used as data material to conduct sentiment analysis related to public opinion on COVID-19 vaccination. Based on the experiments that have been carried out, it indicated that IndoBERTweet is the model that provides the best accuracy. In other words, it means that this framework has better capabilities compared to the IndoBERT and CNN-LSTM According to the analysis of Twitter posted from 19-January-2021 to 25-March-2022 the public prefers to have positive sentiments towards the vaccine COVID-19. It was reflected by three classifications form that has been done, namely the effectiveness of vaccines, free and paid vaccines, and vaccine brands. Based on these three classification topics, the effectiveness of vaccines and all vaccine brands get more positive sentiments, because each model gets positive sentiments that exceed 50%. Meanwhile, the trend between paid vaccination implied negative as shown in Figure 6, because only free vaccines with the IndoBERT model have its pattern. Furthermore, the outcome of responses about the type or brand of vaccines reveals that the positive tendency has a high value in vaccines Sinopharm with the percentage of its more than 80% with the second popular type of vaccination was AstraZeneca.
Although the percentage of accuracy did not reach 100%, we found a variety of new responses to the opinion on the COVID-19 vaccine using the Indonesian language. This setting can be used either in other cases using the Indonesian language or in other languages by increasing the use of data validation. Another aspect as a suggestion to enhance future work is doing more labeling data manually and then adding a filter data approach while crawling data from Twitter.