Indonesian Online News Topics Classification using Word2Vec and K-Nearest Neighbor

News is information disseminated by newspapers, radio, television, the internet, and other media. According to the survey results, there are many news titles from various topics spread on the internet. This of course makes newsreaders have difficulty when they want to find the desired news topic to read. These problems can be solved by grouping or so-called classification. The classification process is carried out of course by using a computerized process. This study aims to classify several news topics in Indonesian language using the KNN classification model and word2vec to convert words into vectors which aim to facilitate the classification process. The use of KNN in this study also determines the optimal K value to be used. In addition to using the classification model, this study also uses a word embedding-based model, namely word2vec. The results obtained using the word2vec and KNN models have an accuracy of 89.2% with a value of K=7. The word2vec and KNN models are also superior to the support vector machine, logistic regression, and random forest classification models.


Introduction
News is a term that refers to information disseminated by newspapers, radio, television, the internet, and other media [1].Hundreds of news articles are written every day on various online-based Indonesian news portals, due to the large number of news portals that switch to print media as electronic media that can be accessed online using the internet [2,3].According to the Indonesian Digital Association (IDA), 96% of urban residents in Indonesia consume online information [4].Meanwhile, a survey conducted by UC Browser in 2016 reported that 56.5% of internet users in Indonesia generally read 4-12 information articles per day [5].According to the survey results in [2][3][4][5], there are lots of news headlines from various topics spread on the internet.This certainly raises a significant problem for newsreaders.News readers will have difficulty in finding a news topic that they want to read.These problems can be solved by grouping or so-called classification.The classification process is carried out of course by using a computerized process.Computerized classification proved to be more effective than manual classification [6].
Several previous studies have carried out the classification process of news text.The following are some examples of existing literature reviews for comparison of contributions.Research [3] raises the topic of how to form a classification of large Indonesian news data accurately using various computerized models such as Neural Network, SVM, Naïve Bayes, and KNN.In research [7] used Mutual Information (MI) as feature selection, for the classification method of Indonesian news text using Bayesian Network.Paper [8] discusses the classification of multilabel text that can group four labels from news articles with the proposed model of deep learning.In paper [9] made a multilabel classification model on Indonesian news topics using the K-Nearest Neighbor (KNN) method.Study [10] used the word insertion method Doc2vec, on the Turkish Text Classification 3600 dataset consisting of Turkish news texts classified based on deep learning.
Research [11] This research will apply the Porter Stemmer Enhancement algorithm in the stemming process and the Likelihood method for news classification by category and topic identification.A study [12] presents the implementation of multilabel classification using semantic features based on word2vec.In research [13] used word2vec to process Indonesian news headlines, the results of which were used to predict stock prices.Study [14] tested whether word2Vec can be used as input for deep learning in categorizing web news.Paper [15] discusses the special classification process for Indonesian sports news using the BM25 and KNN methods.Research [16] aims to classify Indonesian news titles based on positivenegative sentiments using the word2vec, LSTM, LSTM-CNN, and CNN-LSTM methods.Paper [17] discusses the multi-label classification using the Pseudo Nearest Neighbor Rule (PNNR) algorithm variant of the k-Nearest Neighbor (k-NNR) algorithm.Study [18] focused on the multilabel classification of Arabic text using the Bidirectional Long Short-Term Memory networks (BiLSTM) method, which showed superior results.Paper [19] implements a categorization model that uses a hybrid model consisting of BiLSTM and ANN which classifies news articles into selected topics using hypernyms and hyponyms of the words in them.In a study [20], the news headlines were classified using the NLP algorithm, namely LSTM.This study proposes three models to analyze the semantic similarity of Arabic question pairs using the XGBoost algorithm and word embedding [21].
The paper [22] has used text mining techniques to analyze ancient and modern English.We have introduced the Common-Words Counting algorithm and vector processing using TF-IDF.Study [23] presents an overview of concepts, application of search and answer (SQA), and issues regarding text mining for surah Qur'an (ITQ) with tokenization and stemming techniques.Research [24] aims to improve the Indonesian language stemming algorithm which is suitable for Indonesian text data with slang from social media.Research [25] created a computational environment that allows for the mining of the Qur'anic text, which aims to facilitate people to understand each verse in the Qur'an.The classification method used is SVM, Naïve Bayes, KNN, and J48.In paper [26] tries to measure the ability of the algorithm by applying it to text classification.Paper [26] aims to compare the exact modeling of Deep Learning Neural Network results with two other commonly used algorithms, namely Naϊve Bayes and Support Vector Machine (SVM).
The study [27] aims to detect cyberbullies based on text and user credibility analysis and inform them about the dangers of cyberbullying using SVM and KNN methods.Research [28] aims to find the right algorithm to automatically classify a news article in Indonesian using the Naïve Bayes and SVM methods, the dataset comes from the website www.cnnindonesia.com.Research [29] developed an Indonesian hoax filter based on text vector representation based on Term Frequency and Document Frequency as well as SVC classification techniques.Study [30] uses a sentiment classification system which includes several steps such as text preprocessing, feature extraction, and SVM classification.Paper [31] experiment uses text classification to predict personality based on text written by Twitter users.The languages used are English and Indonesian.The classification method applied is Naive Bayes, K-Nearest Neighbors, and Support Vector Machine.
Word2Vec is an efficient tool for transferring words to distribution representation and transferring words into vectors in K dimensions [32].The word2vec concept in [13] is a cluster that measures the cluster's proximity of words to each other.The advantage of the Word2Vec model is that it can reduce dimensions efficiently and contains a lot of semantic meaning [32].In general, the form of word2vec can be seen in Figure 1.The Word2Vec model trains words based on the idea of a distributed representation.It uses two types of models, namely the CBOW model and the Skip-gram model (Figure 1).The CBOW model uses the w(t) word context to predict the current word, and the skip-gram model uses the w(t) word to predict its context [32].While the notion of KNN is an instance-based method of lazy learning that does not have an offline training phase [34].
Its main computation is an online assessment of training documents given a test document to find the k nearest neighbors [34].
So, this research aims to classify several Indonesian news topics using the word2vec model and K-Nearest Neighbor (KNN).In table 1 is a comparison of contributions with previous studies.No KNN [12] Yes Semantics Feature [13] Yes Neural Network [14] Yes Deep Learning [16] Yes LSTM, CNN This Study Yes KNN

Research Method
Figure 2 is the proposed design for this research.

Dataset
The first process is carried out by collecting data on various Indonesian news topics through the website [34].
Nur Ghaniaviyanto Ramadhan Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 5  The topics used in this research are the most discussed today in Indonesia.

Preprocessing
This stage is carried out by cleaning sentences if they have characters such as (!?<'") and others.After these characters are removed then the word2vec process is carried out.

Word2Vec Model
At this stage, the word2Vec process will be carried out which will change the news titles on each topic into x and y vectors.The word2vec process can be seen in Figure 3.The process starts from a corpus containing a collection of texts.Then case folding is done to change all letters in the corpus to only letters a-z which are accepted, in other words, other than letters are omitted (Table 4).Next, the process of dividing the text or sentence into certain parts is called tokenization.The last step is the process of finding a basic word from a word by removing the affix on the word.So that it will produce a vector matrix value that can be used for classification.Table 3 is an example of the form of news data used.Table 4 is a news title that has gone through the cleaning and case folding process.In this process the sentence in the news title will be cleaned of special characters and the sentence will be changed to lower case for each letter.5 is an example of a dataset that has been tokenized.Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens [36].The purpose of tokenization is the exploration of words in a sentence.The results are in the form of a list of tokens that will be input for further processing such as parsing or text mining.

Social
The tokenization results can be seen in table 5 where the news title turns into a collection of several words that were originally one sentence.The results of the tokenization will then be transformed from words to stemming process.the same root.
After tokenization, the next process is stemming.Stemming is a process without variations of the word form into a representative general form [36].For example, the word: "hilangnya" can be reduced to a general representation of "hilang".This process is widely used in offering texts for information retrieval (IR) based on the assumption that asking questions with a presentation implies an interest in the document containing the wording of the presentation and being presented.Table 6 is a representative result of the stemming process.

Social
In the stemming process, there are usually 2 possible errors that occur, namely over-stemming and understemming [32].Over-stemming is when two words with different stems come from the same root.This is also known as a false positive.Under-stemming is when the two words must not come from the same root.
After the stemming process is complete, then the next step is to change the word form into a vector.Table 7 is a vector representation of the results of each word.dimensional spaces so that we can visualize them [37].
Where Pj|i is a conditional probability.σi is the Gaussian variance centered on the data point xi.For the lowdimensional counterparts yi and yj of the highdimensional data points xi and xj, it is possible to calculate similar conditional probabilities.This research set the cube of the Gaussian variance used in the calculation of the conditional probability q j|i to 1√2.Formula (2) [32] is used to calculate the low dimension.
This study uses a single degree of freedom, because it has good properties, namely (1 + −|| − || 2 ) −1 approaches the inverse square law for large pairwise distances yi yj on the dimension low.Formula (3) [29] is used to calculate gradient descent.
Nur Ghaniaviyanto Ramadhan Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 5 No. 6 (2021) 1083 -1089 DOI: https://doi.org/10.29207/resti.v5i6.3547Creative Commons Attribution 4.0 International License (CC BY 4.0) 1087 Where this study uses the gradient between two lowdimensional data points yi and yj as a function of the paired Euclidean distance in a high-dimensional and low-dimensional space that is, as a function of xi xj and yi yj.same root.
Table 8 is a 2D vector-matrix obtained from the reduced dimensions in the vector-matrix table 7. The vectors produced in this study are the values of x and y which interpret a 2D vector.So, it can be concluded that the variable x in word2vec is the independent variable, while the variable y is the dependent variable.For example, it can be seen in Table 9 is the result obtained from a sentence Preparing learning models in the digitalization era "Mempersiapkan model pembelajaran pada era digitalisasi".This study uses a supervised learning classification model, namely K-Nearest Neighbor (KNN) to see the results of the accuracy of the classification of news topics.This algorithm works by finding the most optimal K value or the closest value to the results.To measure similarity efficient in KNN, use the following formula (4), ( 5), (6), and (7) [34]: Where d1 and d2 are vector documents used.
Furthermore, each neighbor is given a weight using the similarity in each neighbor to d0, as shown in formula (5).
Finally, to decide on KNN, can use formula (7).
The KNN algorithm was chosen because it proved to be capable of not only being used for text classification, but also being used in leaf image classification [38].

Result and Discussion
To calculate the accuracy results obtained can use formula (8).

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 (8)
Table 10 is the result of classification using KNN with several K values experiments.It can be seen from the results in table 10, the most optimal K value used for this research case is 7.The selection of the right K value in the KNN model is very important and affects the classification results.The results also show that if the value of K is getting smaller then the accuracy is increasing, otherwise, if the value of K is getting bigger then the accuracy is decreasing.Figure 5 shows the confusion matrix K=27.This proves that it is true that if the value of K is greater, there will be an error in the classification.the value of 11 on the predicted label 3 and the actual label 3 tends to be less likely to guess at that label compared to the value of K=7 (Figure 4).
Table 11 is the result of a comparison using other classification models.The comparison model used consists of a support vector machine which is based on finding the largest margin (hyperplane), logistic regsression based on regression values, and tree-based random forest.So, based on the results obtained in table 11, the selection of the KNN model used in this study is correct.This can be seen from the results of comparisons using other machine learning models, where KNN with a value of K=7 is still superior.
This proves that the algorithm used in this study can produce high classification accuracy.

Conclusion
This research aims to classify several Indonesian news topics using the word2vec model and K-Nearest Neighbor (KNN).So, based on the results of the experiments and analysis carried out, it is concluded that the word2vec and KNN models are a combination that can be used in the case of text-based multilabel classification.The selection of the K value in the KNN model also affects the classification results.The value of K used would be better to use a small value.The results of the KNN accuracy are superior to the support vector machine, logistic regression, and random forest models.
For further research, you can use more news topics, use news topics from other languages, and use word embedding algorithms and other classifications.

Figure 4
Figure4is a plot for the confusion matrix generated using KNN with a value of K=7.The confusion matrix can be used as additional material for the analysis of the classification results, for example, how many values are predicted true and actually true on label 2 or how many

Table 3 .
Example of Dataset

Table 7 .
Word Matrix Vector Furthermore, the vector results obtained in table 6 will be reduced to 2 dimensions (x and y) to facilitate the visualization.The technique used is T-Distributed Stochastic Neighbor Embedding (T-SNE), T-SNE is a dimension reduction technique used to represent highdimensional datasets in two-or three-dimensional low-

Table 11 .
Comparison to Other Methods