Indonesian Online News Topics Classification using Word2Vec and K-Nearest Neighbor
Abstract
News is information disseminated by newspapers, radio, television, the internet, and other media. According to the survey results, there are many news titles from various topics spread on the internet. This of course makes newsreaders have difficulty when they want to find the desired news topic to read. These problems can be solved by grouping or so-called classification. The classification process is carried out of course by using a computerized process. This study aims to classify several news topics in Indonesian language using the KNN classification model and word2vec to convert words into vectors which aim to facilitate the classification process. The use of KNN in this study also determines the optimal K value to be used. In addition to using the classification model, this study also uses a word embedding-based model, namely word2vec. The results obtained using the word2vec and KNN models have an accuracy of 89.2% with a value of K=7. The word2vec and KNN models are also superior to the support vector machine, logistic regression, and random forest classification models.
Downloads
References
Li-Juan, Zhu, et al. "A classification method of Vietnamese news events based on maximum entropy model." 2015 34th Chinese Control Conference (CCC). IEEE, 2015.
Rizaldy, Adhy, and Heru Agus Santoso. "Performance improvement of Support Vector Machine (SVM) With information gain on categorization of Indonesian news documents." 2017 International Seminar on Application for Technology of Information and Communication (iSemantic). IEEE, 2017.
Irham, Lalu Gias, Adiwijaya Adiwijaya, and Untari Novia Wisesty. "Klasifikasi Berita Bahasa Indonesia Menggunakan Mutual Information dan Support Vector Machine." JURNAL MEDIA INFORMATIKA BUDIDARMA 3.4. 284-292. 2019
www.ida.or.id. [Online]. Available: www.ida.or.id. [Accessed: September 2021]
www.pcplus.co.id. [Online]. Available: www.pcplus.co.id/. [Accessed: September 2021]
Dadgar, Seyyed Mohammad Hossein, Mohammad Shirzad Araghi, and Morteza Mastery Farahani. "A novel text mining approach based on TF-IDF and Support Vector Machine for news classification." 2016 IEEE International Conference on Engineering and Technology (ICETECH). IEEE, 2016.
Nurfikri, Fahmi Salman, and Mohamad Syahrul Mubarok. "News topic classification using mutual information and bayesian network." 2018 6th International Conference on Information and Communication Technology (ICoICT). IEEE, 2018.
Sari, Winda Kurnia, Dian Palupi Rini, and Reza Firsandaya Malik. "Multilabel Classification for News Article Using Long Short-Term Memory." Sriwijaya Journal of Informatics and Applications 1.1. 2020.
Isnaini, Nikmah, Mohamad Syahrul Mubarok, and Muhammad Yuslan Abu Bakar. "A multi-label classification on topics of Indonesian news using K-Nearest Neighbor." Journal of Physics: Conference Series. Vol. 1192. No. 1. IOP Publishing, 2019.
Doğru, Hasibe Büşra, et al. "Comparative Analysis of Deep Learning and Traditional Machine Learning Models for Turkish Text Classification."
Rukmi, Alvida Mustika, Devi Andriyani, and Imam Mukhlas. "Identification of topics in News Articles Using Algorithm of Porter Stemmer Enhancement and Likelihood Classifier." Journal of Physics: Conference Series. Vol. 1490. No. 1. IOP Publishing, 2020.
Rahmawati, Dyah, and Masayu Leylia Khodra. "Word2vec semantic representation in multilabel classification for Indonesian news article." 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA). IEEE, 2016.
Ramadhan, Nur Ghaniaviyanto, and Imelda Atastina. "Neural Network on Stock Prediction using the Stock Prices Feature and Indonesian Financial News Titles." International Journal on Information and Communication Technology (IJoICT) 7.1. 54-63. 2021
Kato, Ryoma, and Hiroyuki Goto. "Categorization of web news documents using word2vec and deep learning." Proceedings of the 2016 International Conference on Industrial Engineering and Operations Management Kuala Lumpur, Malaysia. 2016.
Septrinas, Enggar, and Arief Andy Soebroto Indriati. "Klasifikasi Berita Olahraga Berbahasa Indonesia Menggunakan Metode BM25 dan K-Nearest Neighbor." Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer e-ISSN 2548 964X. 2020.
Hermanto, Dedi Tri, Arief Setyanto, and Emha Taufiq Luthfi. "Algoritma LSTM-CNN untuk Binary Klasifikasi dengan Word2vec pada Media Online." Creative Information Technology Journal 8.1. 64-77. 2021
Kakulapati, V., and S. Mahender Reddy. "Multimodal Detection of COVID-19 Fake News and Public Behavior Analysis—Machine Learning Prospective." Intelligent Healthcare. Springer, Cham. 225-241. 2021.
AlBatayha, Duha. "Multi-Topic Labelling Classification Based on LSTM." 2021 12th International Conference on Information and Communication Systems (ICICS). IEEE, 2021.
Sanagavarapu, Sowmya, Sashank Sridhar, and S. Chitrakala. "News Categorization using Hybrid BiLSTM-ANN Model with Feature Engineering." 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 2021.
Bhuiyan, Md Rafiuzzaman, et al. "An Approach for Bengali News Headline Classification Using LSTM." Emerging Technologies in Data Mining and Information Security. Springer, Singapore. 299-308. 2021.
Hammad, Mahmoud, et al. "Using deep learning models for learning semantic text similarity of Arabic questions." International Journal of Electrical & Computer Engineering (2088-8708) 11.4. 2021.
Alam, Saqib, and Nianmin Yao. "Big data analytics, text mining and modern english language." Journal of Grid Computing 17.2 (2019): 357-366.
Putra, Syopiansyah Jaya, Teddy Mantoro, and Muhamad Nur Gunawan. "Text mining for Indonesian translation of the Quran: A systematic review." 2017 International Conference on Computing, Engineering, and Design (ICCED). IEEE, 2017.
Maylawati, Dian Sa’adillah, et al. "An improved of stemming algorithm for mining indonesian text with slang on social media." 2018 6th International Conference on Cyber and IT Service Management (CITSM). IEEE, 2018.
Hidayat, Rahmat, and Sekar Minati. "Comparative Analysis of Text Mining Classification Algorithms for English and Indonesian Qur’an Translation." IJID (International Journal on Informatics for Development) 8.1. 47-51. 2019.
Mariel, Wahyu Calvin Frans, Siti Mariyah, and Setia Pramana. "Sentiment analysis: a comparison of deep learning neural network algorithm with SVM and naϊve Bayes for Indonesian text." Journal of Physics: Conference Series. Vol. 971. No. 1. IOP Publishing, 2018.
Nurrahmi, Hani, and Dade Nurjanah. "Indonesian twitter cyberbullying detection using text classification and user credibility." 2018 International Conference on Information and Communications Technology (ICOIACT). IEEE, 2018.
Wongso, Rini, et al. "News article text classification in Indonesian language." Procedia Computer Science 116. 137-143. 2017.
Prasetijo, Agung B., et al. "Hoax detection system on Indonesian news sites based on text classification using SVM and SGD." 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE). IEEE, 2017.
Naradhipa, Aqsath Rasyid, and Ayu Purwarianti. "Sentiment classification for Indonesian message in social media." 2012 International Conference on Cloud Computing and Social Networking (ICCCSN). IEEE, 2012.
Pratama, Bayu Yudha, and Riyanarto Sarno. "Personality classification based on Twitter text using Naive Bayes, KNN and SVM." 2015 International Conference on Data and Software Engineering (ICoDSE). IEEE, 2015.
Wu, Chunzi, and Bai Wang. "Extracting topics based on Word2Vec and improved Jaccard similarity coefficient." 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC). IEEE, 2017.
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781. 2013.
Tan, Songbo. "An effective refinement strategy for KNN text classifier." Expert Systems with Applications 30.2. 290-298. 2006.
www.id.news.search.yahoo.com. [Online]. Available: www.id.news.search.yahoo.com/. [Accessed: October 2021].
Kannan, Subbu, et al. "Preprocessing techniques for text mining." International Journal of Computer Science & Communication Networks 5.1. 7-16. 2014.
Van der Maaten, Laurens, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9.11, 2008.
Ahmad, Andani, and Abdul Latief. "Perbandingan Metode KNN Dan LBPH Pada Klasifikasi Daun Herbal." Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi) 5.3. 557-564. 2021.
Copyright (c) 2021 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;