The Accuracy Comparison Between Word2Vec and FastText On Sentiment Analysis of Hotel Reviews

Siti Khomsah; Rima Dias Ramadhani; Sena Wijaya

doi:10.29207/resti.v6i3.3711

Siti Khomsah Institut Teknologi Telkom Purwokerto https://orcid.org/0000-0002-9967-4341
Rima Dias Ramadhani Telkom Institute of Technology Purwokerto
Sena Wijaya Institut Teknologi Telkom Purwokerto

DOI: https://doi.org/10.29207/resti.v6i3.3711

Keywords: word2vec, fast text, sentiment analysis, hotel review

Abstract

Word embedding vectorization is more efficient than Bag-of-Word in word vector size. Word embedding also overcomes the loss of information related to sentence context, word order, and semantic relationships between words in sentences. Several kinds of Word Embedding are often considered for sentiment analysis, such as Word2Vec and FastText. Fast Text works on N-Gram, while Word2Vec is based on the word. This research aims to compare the accuracy of the sentiment analysis model using Word2Vec and FastText. Both models are tested in the sentiment analysis of Indonesian hotel reviews using the dataset from TripAdvisor.Word2Vec and FastText use the Skip-gram model. Both methods use the same parameters: number of features, minimum word count, number of parallel threads, and the context window size. Those vectorizers are combined by ensemble learning: Random Forest, Extra Tree, and AdaBoost. The Decision Tree is used as a baseline for measuring the performance of both models. The results showed that both FastText and Word2Vec well-to-do increase accuracy on Random Forest and Extra Tree. FastText reached higher accuracy than Word2Vec when using Extra Tree and Random Forest as classifiers. FastText leverage accuracy 8% (baseline: Decision Tree 85%), it is proofed by the accuracy of 93%, with 100 estimators.

Downloads

Download data is not yet available.

References

A. Nurdin, B. A. S. Aji, A. Bustamin, and Z. Abidin, “Perbandingan Kinerja Word Embedding Word2Vec , Glove ,” Jurnal TEKNOKOMPAK, vol. 14, no. 2, pp. 74–79, 2020.

P. Linguistics, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017, [Online]. Available: https://transacl.org/ojs/index.php/tacl/article/view/999.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781 [cs]. Available at: http://arxiv.org/abs/1301.3781, no. January 2013, 2014.

S. Thavareesan and S. Mahesan, “Sentiment Lexicon Expansion using Word2vec and fastText for Sentiment Prediction in Tamil texts,” Mercon 2020 - 6th International Multidisciplinary Moratuwa Engineering Research Conference, Proceedings, pp. 272–276, 2020, DOI: 10.1109/MERCon50084.2020.9185369.

M. S. Saputri, R. Mahendra, and M. Adriani, “Emotion Classification on Indonesian Twitter Dataset,” Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018, pp. 90–95, 2019, DOI: 10.1109/IALP.2018.8629262.

E. Sazany and I. Budi, “Deep Learning-Based Implementation of Hate Speech Identification on Texts in Indonesian: Preliminary Study,” Proceedings of ICAITI 2018 - 1st International Conference on Applied Information Technology and Innovation: Toward A New Paradigm for the Design of Assistive Technology in Smart Home Care, pp. 114–117, 2018, DOI: 10.1109/ICAITI.2018.8686725.

N. A. Hasanah, N. Suciati, and D. Purwitasari, “Identifying Degree-of-Concern on COVID-19 topics with text classification of Twitters,” Register: Jurnal Ilmiah Teknologi Sistem Informasi, vol. 7, no. 1, pp. 50–62, 2021, DOI: 10.26594/register.v7i1.2234.

M. A. Riza and N. Charibaldi, “Emotion Detection in Twitter Social Media Using Long Short-Term Memory (LSTM) and Fast Text,” International Journal of Artificial Intelligence & Robotics (IJAIR), vol. 3, no. 1, pp. 15–26, 2021, DOI:10.25139/ijair.v3i1.3827.

S. Khomsah, “Sentiment Analysis On YouTube Comments Using Word2Vec and Random Forest,” Telematika, vol. 18, no. 1, p. 61, 2021, doi: 10.31315/telematika.v18i1.4493.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12, 2013.

X. Rong, “word2vec Parameter Learning Explained,” pp. 1–21, 2014.

C. Mccormick, “Word2Vec Tutorial - The Skip-Gram Model,” 2016.

Samsir et al., “Naives Bayes Algorithm for Twitter Sentiment Analysis,” Journal of Physics: Conference Series, vol. 1933, no. 1, p. 012019, 2021, doi: 10.1088/1742-6596/1933/1/012019

B. Kuyumcu, C. Aksakalli, and S. Delil, “An automated new approach in fast text classification (FastText): A case study for Turkish text classification without preprocessing," in ACM International Conference Proceeding Series, 2019, pp. 1–4, DOI: 10.1145/3342827.3342828.

S. Tiun, U. A. Mokhtar, S. H. Bakar, and S. Saad, “Classification of functional and non-functional requirement in software requirement using Word2vec and fast Text,” Journal of Physics: Conference Series, vol. 1529, no. 4, 2020, DOI: 10.1088/1742-6596/1529/4/042077.

A. S. More and D. P. Rana, “Review of random forest classification techniques to resolve data imbalance,” Proceedings - 1st International Conference on Intelligent Systems and Information Management, ICISIM 2017, vol. 2017-Janua, pp. 72–78, 2017, DOI: 10.1109/ICISIM.2017.8122151.

H. S. Batubara, Ambiyar, Syahril, Fadhilah, and R. Watrianthos, “Sentiment Analysis of Face-To-Face Learning Based on Social Media,” Jurnal Pendidikan Teknologi Kejuruan, vol. 4, no. 3, pp. 102–106, 2021

M. Akhtar and R. S. Parihar, “An Hybrid Data Mining Approach to detection and classification of Health Care Data,” International Journal of Electrical, Electronics and Computer Engineering, 2017.

A. K. Mohamad, M. Jayakrishnan, and N. H. Nawi, “Employ Twitter Data to Perform Sentiment Analysis in the Malay Language,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 2, pp. 1404–1412, 2020, DOI: 10.30534/ijatcse/2020/76922020.

A. S. Aribowo, H. Basiron, N. F. A. Yusof, and S. Khomsah, “Cross-Domain Sentiment Analysis Model On Indonesian Youtube Comment,” International Journal of Advances in Intelligent Informatics, vol. 7, no. 1, pp. 12–25, 2021, DOI: 10.26555/ijain.v7i1.554.

D. Tiwari and N. Singh, “Ensemble Approach for Twitter Sentiment Analysis,” I.J. Information Technology and Computer Science, vol. 8, no. August, pp. 20–26, 2019, DOI: 10.5815/ijitcs.2019.08.03.

Samsir, Kusmanto, Abdul Hakim Dalimunthe, Rahmad Aditiya, and Ronal Watrianthos, “Implementation Naïve Bayes Classification for Sentiment Analysis on Internet Movie Database,” Building of Informatics, Technology and Science (BITS), vol. 4, no. 1, pp. 1–6, Jun. 2022

A. S. Aribowo, H. Basiron, N. S. Herman, and S. Khomsah, “An Evaluation of Preprocessing Steps and Tree-based Ensemble Machine Learning for Analysing Sentiment on Indonesian YouTube Comments,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 5, pp. 7078–7086, 2020, DOI: 10.30534/ijatcse/2020/29952020.

D. Ganatra and D. Nilkant, “Ensemble methods to improve accuracy of a classifier,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 3, pp. 3434–3439, 2020, DOI: 10.30534/ijatcse/2020/145932020.

Zulhanif, “Algoritma AdaBoost Dalam Pengklasifikasian,” in Prosiding Seminar Nasional Matematika dan Pendidikan Matematika UMS 2015, 2015, pp. 559–569, doi: 10.1017/cbo9781139028462.008.