Application of Naïve Bayes Algorithm Variations On Indonesian General Analysis Dataset for Sentiment Analysis

Indonesian General Analysis Dataset is a dataset sourced from social media twitter by using keywords in the form of conjunctions to get a dataset that does not only focus on a particular topic. The use of Indonesian language datasets with general topics can be used to test the accuracy of the classification model so as to provide additional reference in choosing the right methods and parameters for sentiment analysis. One of the algorithms which in several studies produces the highest level of accuracy is naive Bayes which has several variations. This study aims to obtain the method with the best accuracy from the naive Bayes variation by setting the minimum and maximum document frequency parameters on the Indonesian General Analysis Dataset for sentiment analysis. The naive Bayes classifier variations used include Bernoulli naive Bayes, gaussian naive Bayes, complement naive Bayes and multinomial naive Bayes. The research stage begins with downloading the dataset. Preprocessing becomes the next stage which consists of tokenizing, stemming, converting abbreviations and eliminating conjunctions. In the preprocessed data, feature extraction is carried out by converting the dataset into vectors and applying the TF-IDF method before entering the sentiment analysis classification stage. Tests in this study were carried out by applying the minimum document frequency (min-df) and maximum document frequency (max-df) for each variation of naive Bayes to obtain the appropriate parameters. The test uses k-fold cross validation of the dataset to divide the training data and sentiment analysis test data. The next confusion matrix is made to evaluate the level of accuracy.


Introduction
Sentiment analysis has become one of the main technologies in obtaining information from social media. The field of sentiment analysis has grown and it is possible to explore various fields such as marketing, health, banking and politics [1]. Machine learning is a commonly used approach in the application of sentiment analysis [2]. There are two types of machine learning, namely supervised learning and unsupervised learning. Several studies have shown that supervised learning approaches such as support vector machine algorithms and nave Bayes algorithms are most often used and have the highest level of accuracy [3] [4]. The nave Bayes algorithm and support vector machine are often compared for accuracy in the application of sentiment analysis which in several studies shows that nave Bayes has a higher level of accuracy. [5][6] [7]. Naïve Bayes has several classic variants, namely multinomial, Bernoulli and Gaussian [8]. In addition, there is also the development of the classic variant of nave bayes and one of them is complement nave bayes which is the development and adaptation of multinomial nave bayes [9].
The application of nave Bayes in sentiment analysis requires a dataset that already has a sentiment label as training data to form patterns in classifying and predicting the sentiment of a text. [10]. These datasets are generally obtained from social media and the process of determining the label is done subjectively and has no concrete value. The data is based on human opinion which can differ from one another. This of course tends to be difficult in the machine learning training process [2]. In addition, most of the data set that are available or used in research and publications are only intended for certain topics whose labeling is subjective based on the opinion of the author on the topic so that data set affect the level of accuracy if it is used to classify or predict the sentiment of texts on other topics. Availability data set which topics are general in nature, not in English, are still minimal, especially datasets in Indonesian. One of the studies that discusses the creation of Indonesian language datasets whose general topics are made by Ferdiana, Redi et al under the name Indonesian General Sentiment Analysis Data set [11]. Data set compiled containing text originated from twitter a total of 10,806 tweets and have been labeled with positive, negative and neutral values. The dataset is tested by comparing its accuracy value with a comparison dataset using an algorithm support vector machine,k-nearest neighbors and stochastic gradient descent. The results of the tests carried out show accuracy Indonesian General Sentiment Analysis Data set and comparable comparison datasets.
Use data set Indonesian language with general topics needs to be studied with several methods and parameters to get an effective model in its application. With the appropriate model, this dataset can be used as training data in applying sentiment analysis to various topics on social media.
In encouraging the use of this Indonesian language dataset and to contribute to the development of Indonesian sentiment analysis, this study aims to apply variations of the algorithm naive bayeson Indonesian General Sentiment Analysis Dataset by setting parameters minimum document frequency (min-df) and maximum document frequency (max-df) to find out and compare the resulting accuracy. Validation of accuracy level is done with 10 fold cross validation for training data and test data. The results of this study certainly provide additional references in choosing the appropriate method or approach if using the Indonesian General Sentiment Analysis Data set as training data in future sentiment analysis.

Research methods
In achieving the research objectives, the authors have designed and implemented the research stages which can be seen in Figure 1.
This research stage begins with preparing the downloaded dataset, then pre-processing which aims to reduce noise when classifying. Feature extraction from the preprocessing results is carried out with a count vectorizer, determination of min-df and max-df and weighting with TF-IDF. The results of term-frequency tweets are then divided into training data and test data using 10 fold cross validation to be tested on naive Bayes variations.
The following is a description of each stage of the research.

Indonesian General Analysis Datasets
The primary data used in this study is an Indonesian language dataset. The dataset comes from the results of research made by Ferdiana, Redi et al totaling 10,408 tweets with general topics downloaded via the linkhttp://ugm.id/idsadataset [11]. This downloaded dataset has been labeled with a value of 0 for neutral, 1 for positive and -1 for negative as shown in table 1 with a ratio of 2:1:1. The dataset provided has gone through several pre-processing stages such as cleaning symbols and disturbing characters (noise), stemming and deleting conjunctions or stop words.  is still carried out. The preprocessing carried out is further divided into several stages as follows:

Tokenizing
Tokenizing is the stage to break the sentence into the words that compose it [12]. This stage is used to make it easier to carry out the next pre-processing stage which is word-oriented.

Stemming
Stemming is the process of changing affixed words into root words. This process is often used in research related to text mining. Stemming has an effect on increasing the accuracy of sentiment analysis [13]. One method that can be used for stemming Indonesian is the Nazief and Adriani algorithm. In several studies, this algorithm has the highest accuracy compared to other stemming algorithms [14][15] [16]. In this study, to implement the Nazief and Adriani algorithm, the author uses the Sastrawi python Library

Abbreviation Conversion
Limiting the number of characters to 280 in messages that can be uploaded on Twitter makes users tend to use abbreviations. Commonly used abbreviations as in table 2 will be used to convert these abbreviations into standard words. One of the methods in preprocessing text analysis is the removal of conjunctions or stop words removal. Words that are considered general and have little effect on text analysis will be removed. The application of stop words removal in preprocessing can improve the accuracy and performance of sentiment analysis classification [17]. In this study, the author uses the stop word remover function from the literary python library to implement the removal of conjunctions at the preprocessing stage. Table 3 shows an example of removing conjunctions in this study.

Feature Extraction
In the implementation of the classification method in sentiment analysis, features are needed that become indicators in determining the class of a sentence. This feature is obtained by performing feature extraction that begins with tokenizing. Tokenizing aims to change sentences into simpler forms which in this study are formed into words or terms. In this study, the tokenizing results are placed in an array as shown in table 4. After the tokenizing results are obtained, the next step in feature extraction is the countvectorizer and term frequency -inverse document frequency (TF -IDF).
Here's the description:

Countvectorizer
This stage is used to get the frequency of occurrence of words in a sentence and placed in a vector [18]. In the count vectorizer, words that rarely appear in tweets tend to be covered even though these words are important words in sentiment labeling in feature extraction, this is generally handled using TF-IDF [19]. An example of the form of the count vectorizer in this study can be seen in table 5.

Term Frequency -Inverse Document Frequency(TF-IDF)
TF-IDF is a method that aims to give weight to frequently used words. Term Frequency is the number of words in a word vector of a sentence divided by the total words in the word vector. As for the Inverse Document Frequency, it aims to reduce the weight of words if they exist in all documents [20]. In several studies, the use of TF-IDF in sentiment analysis shows an increase in accuracy [11] [21]. Similar to the countvectorizer, the implementation of TF-IDF in this study uses the Tfidf Transformer function from the sklearn python library.

10 Fold Cross Validation
Datasetused as primary data from this research amounted to 10,408 tweets that went through the preprocessing and feature extraction stages. To measure the level of accuracy of the classification model used, a test validation method is needed, namely K-fold cross validation because the number of datasets is quite large. The data will be tested on four variations of nave Bayes, namely Bernoulli, Gaussian, complement and multinomial nave Bayes. To share the training data and testing data from the dataset, K-Fold Cross Validation is used. K-Fold Cross Validation is a testing method by dividing the entire data into training data and testing data [22]. To see the performance of each nave Bayes variation, the author sets the value of K=10 on fold cross validation with 10 iterations of training and testing.
Testing results from each iteration of 10 fold cross validation on the Nave Bayes variation are then evaluated. The method used in the evaluation of this research model is the confusion matrix to get the accuracy value.

Results and Discussion
Data set which has gone through preprocessing and feature extraction with TF-IDF will be tested on each

Bernaoulli Naive Bayes
The results of applying Bernoulli Nave Bayes to the dataset can be seen in the table 6. This nave Bayes variation produces the highest accuracy value of 0.6337 which is found at min-df 0.001 and max-df 0.1. The lowest accuracy was obtained at min-df 0 and 0.00001 and max-df 0.1 with a value of 0.5967.

Gaussian Naive Bayes
The results of applying gaussian nave bayes to the dataset can be seen in table 7. The application of gaussian nave bayes on the dataset produces the highest value for accuracy at min-df 0.0025 and max-df 0.5 with a value of 0.5051. Meanwhile, the lowest accuracy value is 0.4154 at mindf 0.0005 and max-df 0.5.

Complement Naive Bayes
The results of the nave Bayes complement evaluation of the dataset can be seen in the following table.

Nave Bayes Multinomial
The results of the evaluation of the application of multinomial nave Bayes can be seen in table 9. The highest accuracy value from the application of multinomial nave Bayes was obtained at min-df of 0.0005 and max-df 0.5 with an accuracy value of 0.6374. The lowest accuracy value is 0.6105 at min-df 0.0005 and max-df 0.5.

Comparison of Evaluation Results
The results of the application of nave Bayes variation obtained the highest accuracy value on multinomial nave Bayes of 0.6374 at min-df 0.0005 and max-df 0.5. The highest accuracy value was then obtained with Bernoulli nave Bayes with an accuracy of 0.6337, then complement of nave Bayes of 0.6235 and the lowest was Gaussian nave Bayes with an accuracy of 0.5051. The following graph shows the comparison of the highest accuracy resulting from the determination of min-df and max-df .

Conclusion
Based on the results of the application of nave Bayes variations in the Indonesian General Analysis Dataset with the determination of min-df and max-df, the highest accuracy value is obtained in multinomial nave Bayes. The average value of min-df which has high accuray is 0.0005 and max-df 0.5. For further research, optimization on the pre-processing side can be done by normalizing spelling errors and slang words in Indonesian General Analysis Datasets.