Buzzer Detection on Indonesian Twitter using SVM and Account Property Feature Extension

The rapid use of Twitter social media in recent times has an impact on the faster dissemination of disinformation which is very dangerous to followers. Detection of disinformation is very important to do and can be done manually by conducting in-depth information analysis. But given the huge amount of information, this approach is less effective. Another, more effective approach is to use a machine learning-based approach. Several studies on hoax information detection based on machine learning have been carried out where some studies analyze the content of a tweet and some others analyze hashtags which are the context of a tweet. The feature usually used to analyze hashtag sentiment data is the property feature of the creator's account. The creator accounts of disinformation are called buzzer accounts. This research proposes account property feature expansion of buzzer accounts combined with the SVM classifier which in several previous similar studies has a very good performance to detect the buzzer hashtag. The experimental results show that expanding the proposed feature can increase SVM's performance in detecting hashtag buzzers by more than 24% compared to using the baseline feature, and the average F1 score obtained from the combination of methods is 84%.


Introduction
The internet is currently growing rapidly in Indonesia where data shown by [1] shows that in January 2021 around 170 million people in Indonesia have accessed the internet and social media applications. Social media applications are one of the most effective media in disseminating information widely. In 2020, the 5 most used social media in Indonesia are Youtube, Whatsapp, Facebook, Instagram, and Twitter [2].
At the beginning of the presence of modern online social media such as Facebook (2004), Twitter (2006), Instagram (2010), and other social media, issues related to hoax information in social media have not become an important issue. But at this time, when the internet is growing rapidly and most people already have personal social media, the issue of handling hoax information on social media is very important.
Studies on fake news and hoax information on various social media platforms have become a research topic that has attracted many researchers to date, as was conducted by [3] on Facebook, [4] on Instagram, and also by [5] on Twitter. Studies using social media data are mostly done on Twitter social media, this is due to the ease of crawling research data on Twitter in general, it is easier for the public to do than crawling other social media data. Another reason is that the level of hoax information on Twitter is higher than on other social media. Research conducted by [6] shows that from 2015-2018 the trend of fake news sites on Twitter showed an increasing trend, while Facebook showed a declining trend even though it had an upward trend at the beginning of that period. The emergence of fake news and hoax information is generally caused by the presence of a buzzer and also fake accounts or robots that are intended to produce invalid information.
At first, the term buzzer came from the marketing field, namely buzz marketing or a technique of marketing goods or services to generate business by moving information by word of mouth [7]. In online social media, a buzzer was originally defined as a social media account whose job is to disseminate, campaign, and broadcast a message or content to amplify a message so that the content becomes public opinion [8]. But in recent times, the term buzzer has shifted to a social media account that has a large number of followers and participates in political campaigns by spreading various hoax news and hate speech to the opposing party Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol.  [4], [9], [10]. In a study conducted by Bradshaw [11] buzzer in Indonesia itself has been categorized as a low cyber troop army. This means that buzzers in Indonesia often carry out disinformation and spread the disinformation widely through certain hashtags created by them. The effect of the buzzer in Indonesia has reached such a bad stage that it can divide society [9].
Research on the detection of bot accounts or robots on Twitter was conducted by [12]. This research proposes a bot account classification system based on entropy components, spam detection components, account properties components, and classification based on decision-makers. The features used are extracted from unknown users. The latest development, account detection on Twitter is not only limited to bot account detection but develops towards detecting buzzer accounts on Twitter [13] [14]. Regarding buzzer detection in elections, there are several similar studies on Twitter social media conducted by [5], [10], [13]- [15] where the problem domain is buzzer detection in elections. Some important findings obtained from some of these researches include bot accounts having a lower ratio of the number of followers or followings compared to real accounts, posting times made by bot accounts are more regular, and bot accounts, in general, have a younger account age than the original account. In detecting buzzer accounts, the original user metadata and activity features are generally used, both on Instagram [4] and Twitter [5], [16]. But considering the phenomenon of the emergence of buzzer accounts not only in these periods, but research on the detection of the buzzer account in different periods is also interesting to do. Another interesting thing to be developed is in features used so far, namely original user metadata and activity features.
In this paper, buzzer hashtag classification is carried out based on the property of the buzzer account. Data containing hashtag buzzer is retrieved not only in the election period. The classifier method used is the Support Vector Machine (SVM). The use of SVM is based on the fact that this classifier has a fairly good performance in text classification based on several previous studies [17]- [19]. SVM model has several kernel models and parameter variations such as C and epsilon where a lot of experiments are needed to obtain optimal results. In this research, optimization (tuning) of SVM kernel parameters is carried out. Tuning techniques are also widely used in several Data Mining classifiers from various researches, such as [20]- [23].
The SVM model tuning technique in this research is implemented using several Python language functions in the Scikit-Learn library, namely SVC, NuSVC, and LinearSVC (LSVC).
The contribution of this research is the property feature expansion model of the buzzer account on Twitter based on descriptive statistical measures, Another contribution is knowing the effectiveness of SVM in the process of detecting buzzer accounts on Twitter social media. The process of detecting the buzzer account is conducted by classifying the hashtags created by the buzzer account as natural hashtags (non-buzzer) or buzzer hashtags.

Research Methods
There are several stages carried out in this research to achieve the research objectives including data crawling, data labeling, account property feature extraction, account property feature expansion, data normalization, SVM Classification and Evaluation.

Data Crawling
In this stage, the process of collecting research data from Twitter social media data is carried out. In Twitter, the data crawling process is carried out by downloading account data and their properties as well as the contents of tweets from the Twitter server using the Twitter Application Programming Interface (API).
Data crawling was carried out during April 2021 -September 2021. In this research, 202 thousand tweets were successfully crawled from 202 hashtags. Each hashtag contains 1000 tweets which are hashtags that are trending topics at that time. Some examples of hashtags crawled in this research include #Adzanbukamainan, #Hajat4nAsetku, #KRINanggala402, #ReshufflePresident, #17an, and others.
In this crawling process, information related to a Twitter account is obtained such as user id, the time when the account was created, Tweet content, and others. There are 17 types of information that have been extracted from the crawling process. All this information can be seen in Table 1.

Data Labelling
This process is carried out to determine the label of a hashtag on Twitter as a buzzer hashtag or a non-buzzer hashtag (natural). Labeling was carried out by 3 respondents where each respondent was asked to determine the label of each hashtag. Before labeling, respondents gain knowledge about the definition of buzzer from several references that will be used as knowledge for respondents to determine the hashtag label. Some property information from the account that created the hashtag was also given to the respondents. An example of a hashtag buzzer labeling form can be seen in Figure 1. The result of hashtag labeling is that 63% of hashtags are labeled the same by the three respondents, while the rest are labeled by a majority vote. The labeling results are balanced as shown in Figure 2. Some examples of hashtags categorized as buzzer hashtags and natural hashtags can be seen in Table 2.

Account Property Feature Extraction
This process is carried out to generate features based on information obtained from the data crawling process. Not all of the information obtained in Table 2 can be used as a feature because in general, the generated features must-have characteristics that can direct whether a hashtag is a buzzer or not. For example, information such as id_tweet, user_id, username are some examples of information that cannot be used as attributes because the information is unique so it cannot lead to a particular hashtag pattern. The results of the account property feature extraction can be seen in Table 3. Some simple features extracted can be obtained directly with simple computations from the information obtained from Table 1, meanwhile other features must be generated based on hashtag data containing 1000 tweets that have been collected for each account.
This account property feature extraction process is carried out according to research from [12] where the account property features, namely feature number 1 to 51% 49%  Table 3, are obtained from the results of data extraction of 1000 tweets in each hashtag. This account property feature is also known as a community feature or attribute, which is an attribute that describes the position of an account in its community [16].

Twitter Account Property Feature Expansion
This is a process to expand the basic account property features to a wider extent using descriptive statistical measures including mean, quartiles (1,2,3), and range. This feature expansion process was carried out on 9 previously obtained features, namely feature number 11 (reputation) to feature number 19 (Status_rate). The results of feature expansion in these 5 statistical measures can be seen in Table 4. The descriptive statistical formula used in the feature expansion process in this study can be seen in equations 1 and 2: = ∑ (1) Q1 is the first quartile, which is a value that has the property that the percentage of sorted data below this value is 25%. Q2: The second quartile represents the middle value of the sorted data, while Q3 is the third quartile which states that the percentage of sorted data below this value is 75%.

Data Normalization
The normalization process is carried out so that the range of each feature becomes uniform, namely [0,1]. The normalization formula for a feature X used in this research can be seen in equation 3.
Where Min is the minimum value and Max is the Maximum value of the feature X.

SVM Classification
The classifier model used in this study is Support Vector Machine (SVM) model where this model is one of the machine learning algorithms that can separate two different classes, which can group positive and negative classes using a hyperplane as a separating method. The optimal hyperplane is obtained by maximizing the distance between classes or margins. In the case of binary linear classification, the SVM hyperplane equation can be written in the form of equation 4 [24], [25]: x is the data attribute vector, w is the coefficient of x and b is the parameter to be searched in order to obtain the optimal model. Class-y classification based on is defined as equation 5:.
The explanation of equation (5) is that if the calculation result of ( ) for an attribute vector produces a value greater than or equal to 0, then the data will be classified in class "1", and if the calculation of ( ) is less than 0 then xi will be classified in class "-1" [26].
The problem in determining the hyperplane is the determination of the optimal hyperplane. This problem is related to the determination of parameter C which represents the cost of the penalty. The value of C must be determined precisely and cannot be too large or small. This Hyperplane optimization equation can be seen in equation (6) Where, is slack variable and represents the misclassified sample of the appropriate margin hyperplane.
SVM uses Kernel functions to transform the original data to a higher dimensional vector space. There are some kernel functions in SVM but there are 4 kernel functions that are generally used the most, namely Linear, Polynomial, Radial Basis Function (RBF), and Sigmoid [25], [27]. Kernel formulas and related parameters for the four kernel models can be seen in Table 5. Table 5. Frequently used SVM Kernel functions and their parameters [28] No Fungsi Kernel Parameter 1 Linear 〈x, x′〉 The implementation of SVM in this research uses the python programming language and the scikit-learn machine learning library. The SVM model is implemented in Python using several functions namely SVC, NuSVC, and LinearSVC. All these functions are executed using parameters by default.
SVC, NuSVC and LSVC are capable of performing binary classification as is the case in this research and also support multi-class classification. SVC stands for C-Support Vector Classification which is implemented based on libSVM and has C parameter with a default value = 1. NuSVC stands for Nu-Support Vector Classification. NuSVC. NuSVC is similar to SVC but has different parameters with SVC namely nu with a default value = 0.5. LSVC stands for Linear-Support Vector Classification. LinearSVC is a faster implementation of SVC based on liblinear and only uses linear kernel.

Performance Evaluation
The performance of each SVM function is calculated using the F1-score. The F1-score was chosen considering that this measure is often used in research on text mining and is quite robust and related to imbalanced data. The F1-score formula is calculated based on the recall and precision values where the precision, recall, and F1 formulas can be seen in equations (7,8,9), namely: Where information about TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) can be seen in the confusion matrix for binary classification case in Figure 2. Positive in this case states that the data has a Buzzer label, while Negative states that the data is labeled as non-buzzer. Based on the confusion matrix, it can be explained that TP is a condition when the predicted result is a buzzer and the actual label is also a buzzer, FP is a condition when the predicted result is a buzzer and the actual label is non-buzzer, FN is a condition when the predicted result is a non-buzzer and the actual label is buzzer whereas TN is the condition when the predicted result is non-buzzer and the actual label is also non-buzzer.

Results and Discussions
In this experiment, we compare the performance of the three SVM functions using basic and full features. Feature analysis is also added using correlation values.

Classification Results using Basic Features
The experimental results on buzzer account classification using basic features as the baseline from this research with the F1-score performance measure can be seen in Table 6.. Based on the experimental results in Table 6, the use of basic features in hashtag buzzer detection using various SVM models gives poor results. The highest F1 value obtained is 60% using the NSVC model.
Some other interesting findings are about the effectiveness of the SVC function in python in classifying this hashtag buzzer. The NuSVC function has an average value of F1 compared to the other two functions, although when viewed from the performance of each fold, NuSVC has a performance that is still comparable to LSVC. In addition, LSVC, which is a faster implementation of SVC, actually has a better F1 performance than SVC in almost every fold.

Classification Results using Complete Features
Complete Features are the basic features as shown in Table 3 plus the results of the expansion of the basic features as shown in Table 4. The results of the hashtag buzzer classification using complete features with the F1-score performance measure can be seen in Table 7.
Based on the experimental results in Table 7, it can be concluded that feature expansion using statistical valuebased feature expansion is very effective in improving the F1 performance of the SVM model in detecting hashtag buzzers. All SVM models used provide the same F1 performance, which is an average of 84%. In some values of k, the performance obtained reaches 100%. This value is more increased than the use of the basic features of account properties which only have an average F1 performance value ranging from 42% -60%.

Feature Analysis
To further analyze the effect of the account property expansion features used, we also use another performance measure, namely the correlation value (r 2 ) to see how far the contribution of the proposed new features to the overall features used. The correlation value has a range of values [0,1]. The correlation value of feature X is 1, which means that feature X affects the classification result of a hashtag as a hashtag buzzer or not. The results of the calculation of the correlation of 10 features with the highest correlation value can be seen in Table 8. The results of the correlation analysis in Table 8 show that 60% of the features that have the highest correlation are new features resulting from the feature expansion process on account properties. This result is also in line with the results of the previous experiment in Table 7, that the addition of new features has been proven to be able to improve the performance of the SVM model in detecting hashtag buzzers between 24-42%.

Conclusion
Based on the results of the experiments and analysis carried out, several conclusions were obtained, namely, the feature expansion based on descriptive statistics carried out in this study was proven to be effective in improving the performance of the SVM model in detecting the hashtag buzzer compared to the use of basic features (baseline). The combination of features and SVM classifier also proved effective in detecting hashtags buzzer. Another interesting thing obtained in this research is related to the implementation of SVM in the python language. The selection of the NuSVC function gives better results than the SVC and LinearSVC functions, especially when the features used are not very complete.
The future work of this research can be done by simulating all the SVM parameters completely to see the effect of the parameters on the performance of this SVM classifier. In addition, this research was also can be developed for different domains, for example in detecting hoax news or detecting buzzer accounts for other social media such as Instagram or Tiktok which are currently also very popular.