Text-Preprocessing Model Youtube Comments in Indonesian
Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia
Abstract
YouTube is the most widely used in Indonesia, and it’s reaching 88% of internet users in Indonesia. YouTube’s comments in Indonesian languages produced by users has increased massively, and we can use those datasets to elaborate on the polarization of public opinion on government policies. The main challenge in opinion analysis is preprocessing, especially normalize noise like stop words and slang words. This research aims to contrive several preprocessing model for processing the YouTube commentary dataset, then seeing the effect for the accuracy of the sentiment analysis. The types of preprocessing used include Indonesian text processing standards, deleting stop words and subjects or objects, and changing slang according to the Indonesian Dictionary (KBBI). Four preprocessing scenarios are designed to see the impact of each type of preprocessing toward the accuracy of the model. The investigation uses two features, unigram and combination of unigram-bigram. Count-Vectorizer and TF-IDF-Vectorizer are used to extract valuable features. The experimentation shows the use of unigram better than a combination of unigram and bigram features. The transformation of the slang word to standart word raises the accuracy of the model. Removing the stop words also contributes to increasing accuracy. In conclusion, the combination of preprocessing, which consists of standard preprocessing, stop-words removal, converting of Indonesian slang to common word based on Indonesian Dictionary (KBBI), raises accuracy to almost 3.5% on unigram feature.
Downloads
References
H. Bhuiyan, J. Ara, R. Bardhan, and Md. Rashedul Islam, “Retrieving YouTube Video by Sentiment Analysis on User Comment,” in International Conference on Signal and Image Processing Applications (IEEE ICSIPA), 2017, no. 1, pp. 474–478.
M. Thelwall, “Social Media Analytics for YouTube Comments : Potential and Limitations,” International Journal of Social Research Methodology, vol. 5579, no. October, pp. 1–14, 2017, doi: 10.1080/13645579.2017.1381821.
A. Musdholifah and E. Rinaldi, “FVEC Feature and Machine Learning Approach for Indonesian Opinion Mining on YouTube Comments,” in Proceeding of EECSI, 2018, pp. 724–729.
A. A. L. Cunha, M. C. Costa, and M. A. C. Pacheco, “Sentiment Analysis of YouTube Video Comments Using Deep Neural Networks,” International Conference on Artificial Intelligence and Soft Computing (ICAISC), pp. 561–570, 2019, doi: 10.1007/978-3-030-20912-4.
D. H. Jayani, “Orang Indonesia Habiskan Hampir 8 Jam untuk Berinternet,” 26 February 2020, 2020. [Online]. Available: https://databoks.katadata.co.id/datapublish/2020/02/26/indonesia-habiskan-hampir-8-jam-untuk-berinternet. [Accessed: 20-Mar-2020].
A. S. Aribowo et al., “Systematic Literature Review : Sentiment And Emotion Analysis Techniques On Twitter Political Domain,” Opcion, vol. 34, no. 86, pp. 2051–2060, 2018.
B. Liu, Sentiment Analysis and Opinion Mining. Morgan & Claypoll Publisher, 2012.
F. Rahutomo and A. R. T. H. Ririd, “Evaluasi Daftar Stopword Bahasa Indonesia,” vol. 6, no. 1, 2019, doi: 10.25126/jtiik.201861226.
A. W. Pradana and M. Hayaty, “The Effect of Stemming and Removal of Stopwords on the Accuracy of Sentiment Analysis on Indonesian-language Texts,” Kinetik, vol. 4, no. 4, pp. 375–380, 2019, doi: 10.22219/kinetik.v4i4.912.
T. F. Abidin, M. Hasanuddin, and V. Mutiawani, “N-grams Based Features for Indonesian Tweets Classification Problems,” in International Conference on Electrical Engineering and Informatics (ICELTICs), 2017, pp. 307–310.
T. G. Prahasiwi and R. Kusumaningrum, “Implementation of negation handling techniques using modified syntactic rule in Indonesian sentiment analysis,” Journal of Physics: Conference Series, vol. 1217, no. 1, 2019, doi: 10.1088/1742-6596/1217/1/012115.
C. S. Dian Sa’adillah Maylawati, Wildan Budiawan Zulfikar, “An Improved of Stemming Algorithm for Mining Indonesian Text with Slang on Social Media,” in International Conference on Cyber and IT Service Management (CITSM), 2018, doi: 10.1109/CITSM.2018.8674054.
H. L. Ardi, E. Sediono, and R. Kusumaningrum, “Support Vector Machine Classifier for Sentiment Analysis of Feedback Marketplace with a Comparison Features at Aspect Level,” International Journal of Innovative Research in Advanced Engineering, vol. 4, no. 11, 2017.
S. Mujilahwati, “Pre-Processing Text Mining Pada Data Twitter,” in SENTIKA, 2016, pp. 49–56, doi: ISSN: 2089-9815.
A. S. Aribowo, H. Basiron, and N. S. Herman, “Systematic Literature Review : Sentiment And Emotion Analysis Techniques On Twitter Political Domain,” Opcion, vol. 86, pp. 2051–2060, 2018.
A. S. Aribowo, H. Basiron, N. S. Herman, and S. Khomsah, “Fanaticism Category Generation Using Tree-Based Machine Learning Method Fanaticism Category Generation Using Tree-Based Machine Learning Method,” Journal of Physics:Conference Series, vol. 1501 01202, 2020, doi: 10.1088/1742-6596/1501/1/012021.
N. Cahyana, S. Khomsah, and A. S. Aribowo, “Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting,” in ICSITech, 2019, pp. 217–222.
D. Tiwari and N. Singh, “Ensemble Approach for Twitter Sentiment Analysis,” I.J. Information Technology and Computer Science, no. August, pp. 20–26, 2019, doi: 10.5815/ijitcs.2019.08.03.
Copyright (c) 2020 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;