Text-Preprocessing Model Youtube Comments in Indonesian

Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia

  • Siti Khomsah Insitut Teknologi Telkom Purwokerto
  • Agus Sasmito Aribowo UPN Veteran Yogyakarta
Keywords: YouTube comments, sentiment analysis, text preprocessing, slang-word, N-Gram


YouTube is the most widely used in Indonesia, and it’s reaching 88% of internet users in Indonesia. YouTube’s comments in Indonesian languages produced by users has increased massively, and we can use those datasets to elaborate on the polarization of public opinion on government policies. The main challenge in opinion analysis is preprocessing, especially normalize noise like stop words and slang words. This research aims to contrive several preprocessing model for processing the YouTube commentary dataset, then seeing the effect for the accuracy of the sentiment analysis. The types of preprocessing used include Indonesian text processing standards, deleting stop words and subjects or objects, and changing slang according to the Indonesian Dictionary (KBBI). Four preprocessing scenarios are designed to see the impact of each type of preprocessing toward the accuracy of the model. The investigation uses two features, unigram and combination of unigram-bigram. Count-Vectorizer and TF-IDF-Vectorizer are used to extract valuable features. The experimentation shows the use of unigram better than a combination of unigram and bigram features. The transformation of the slang word to standart word raises the accuracy of the model. Removing the stop words also contributes to increasing accuracy. In conclusion, the combination of preprocessing, which consists of standard preprocessing, stop-words removal, converting of Indonesian slang to common word based on Indonesian Dictionary (KBBI), raises accuracy to almost 3.5% on unigram feature.


Download data is not yet available.


H. Bhuiyan, J. Ara, R. Bardhan, and Md. Rashedul Islam, “Retrieving YouTube Video by Sentiment Analysis on User Comment,” in International Conference on Signal and Image Processing Applications (IEEE ICSIPA), 2017, no. 1, pp. 474–478.

M. Thelwall, “Social Media Analytics for YouTube Comments : Potential and Limitations,” International Journal of Social Research Methodology, vol. 5579, no. October, pp. 1–14, 2017, doi: 10.1080/13645579.2017.1381821.

A. Musdholifah and E. Rinaldi, “FVEC Feature and Machine Learning Approach for Indonesian Opinion Mining on YouTube Comments,” in Proceeding of EECSI, 2018, pp. 724–729.

A. A. L. Cunha, M. C. Costa, and M. A. C. Pacheco, “Sentiment Analysis of YouTube Video Comments Using Deep Neural Networks,” International Conference on Artificial Intelligence and Soft Computing (ICAISC), pp. 561–570, 2019, doi: 10.1007/978-3-030-20912-4.

D. H. Jayani, “Orang Indonesia Habiskan Hampir 8 Jam untuk Berinternet,” 26 February 2020, 2020. [Online]. Available: https://databoks.katadata.co.id/datapublish/2020/02/26/indonesia-habiskan-hampir-8-jam-untuk-berinternet. [Accessed: 20-Mar-2020].

A. S. Aribowo et al., “Systematic Literature Review : Sentiment And Emotion Analysis Techniques On Twitter Political Domain,” Opcion, vol. 34, no. 86, pp. 2051–2060, 2018.

B. Liu, Sentiment Analysis and Opinion Mining. Morgan & Claypoll Publisher, 2012.

F. Rahutomo and A. R. T. H. Ririd, “Evaluasi Daftar Stopword Bahasa Indonesia,” vol. 6, no. 1, 2019, doi: 10.25126/jtiik.201861226.

A. W. Pradana and M. Hayaty, “The Effect of Stemming and Removal of Stopwords on the Accuracy of Sentiment Analysis on Indonesian-language Texts,” Kinetik, vol. 4, no. 4, pp. 375–380, 2019, doi: 10.22219/kinetik.v4i4.912.

T. F. Abidin, M. Hasanuddin, and V. Mutiawani, “N-grams Based Features for Indonesian Tweets Classification Problems,” in International Conference on Electrical Engineering and Informatics (ICELTICs), 2017, pp. 307–310.

T. G. Prahasiwi and R. Kusumaningrum, “Implementation of negation handling techniques using modified syntactic rule in Indonesian sentiment analysis,” Journal of Physics: Conference Series, vol. 1217, no. 1, 2019, doi: 10.1088/1742-6596/1217/1/012115.

C. S. Dian Sa’adillah Maylawati, Wildan Budiawan Zulfikar, “An Improved of Stemming Algorithm for Mining Indonesian Text with Slang on Social Media,” in International Conference on Cyber and IT Service Management (CITSM), 2018, doi: 10.1109/CITSM.2018.8674054.

H. L. Ardi, E. Sediono, and R. Kusumaningrum, “Support Vector Machine Classifier for Sentiment Analysis of Feedback Marketplace with a Comparison Features at Aspect Level,” International Journal of Innovative Research in Advanced Engineering, vol. 4, no. 11, 2017.

S. Mujilahwati, “Pre-Processing Text Mining Pada Data Twitter,” in SENTIKA, 2016, pp. 49–56, doi: ISSN: 2089-9815.

A. S. Aribowo, H. Basiron, and N. S. Herman, “Systematic Literature Review : Sentiment And Emotion Analysis Techniques On Twitter Political Domain,” Opcion, vol. 86, pp. 2051–2060, 2018.

A. S. Aribowo, H. Basiron, N. S. Herman, and S. Khomsah, “Fanaticism Category Generation Using Tree-Based Machine Learning Method Fanaticism Category Generation Using Tree-Based Machine Learning Method,” Journal of Physics:Conference Series, vol. 1501 01202, 2020, doi: 10.1088/1742-6596/1501/1/012021.

N. Cahyana, S. Khomsah, and A. S. Aribowo, “Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting,” in ICSITech, 2019, pp. 217–222.

D. Tiwari and N. Singh, “Ensemble Approach for Twitter Sentiment Analysis,” I.J. Information Technology and Computer Science, no. August, pp. 20–26, 2019, doi: 10.5815/ijitcs.2019.08.03.

How to Cite
Khomsah, S., & Agus Sasmito Aribowo. (2020). Text-Preprocessing Model Youtube Comments in Indonesian. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 4(4), 648 - 654. https://doi.org/10.29207/resti.v4i4.2035
Artikel Rekayasa Sistem Informasi