Big Five Personality Assessment Using KNN method with RoBERTA
Abstract
Personality is the general way a person responds to and interacts with others. Personality is also often defined as the quality that distinguishes individuals. Social media was created to help people communicate remotely and easily. These personalities fall into five categories known as the Big Five personality traits, namely Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (OCEAN). The use of K-Nearest Neighbour (KNN) is a method of classifying objects based on the training data closest to them. To overcome the data imbalance during training data, we use K-Means SMOTE (Synthetic Minority Oversampling Technique). Other features such as LIWC (Linguistic Inquiry Word Count), Information Gain, Robustly Optimized BERT Approach (RoBERTa), and hyperparameter tuning can improve the performance of the systems we build. The focus of this study is to present an analysis of Twitter user behavior that can be used to predict the personality of the Big Five Personality using the KNN method. The Important aspect to consider when using this method, namely accuracy in classifying the Big Five Personalities. The experimental results show that the accuracy of the KNN method is 72.09%, which is 95.28% gain above the specified baseline.
Downloads
References
N. Febrianto, I. Prasetia, and A. Wijaya, “Pembuatan Sistem Prediksi Kepribadian ‘The Big Five Traits’ dari Media Sosial Twitter.” [Online]. Available: http://semiocast.com/en/publications/2012_07_30_Twitter_reaches_half_a_billion_
M. G. Tambunan1 and E. B. Setiawan, “Prediksi Kepribadian DISC Pada Twitter Menggunakan Metode Decision Tree C4.5 dengan Pembobotan TF-IDF dan TF-RF.”
R. Ellandi, E. Budi, S. S. Si, N. Fida, S. Nugraha, and M. P. Psi, “Prediksi kepribadian Big Five dengan Term-Frequency Inverse Document Frequency Menggunakan Metode k-Nearest Neighbor pada Twitter.”
G. D. Salsabila and E. B. Setiawan, “Semantic Approach for Big Five Personality Prediction on Twitter,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 5, no. 4, pp. 680–687, Aug. 2021, doi: 10.29207/resti.v5i4.3197.
Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” Jul. 2019, [Online]. Available: http://arxiv.org/abs/1907.11692
F. Celli and B. Lepri, “Is Big Five better than MBTI? A personality computing challenge using Twitter data.” [Online]. Available: https://twitter.com/search-advanced
C. Yuan, J. Wu, H. Li, and L. Wang, Personality Recognition based on User Generated Content. IEEE, 2018.
J. Eka Sembodo, E. Budi Setiawan, and Z. Abdurahman Baizal, “Data Crawling Otomatis pada Twitter,” Sep. 2016, pp. 11–16. doi: 10.21108/indosc.2016.111.
“Big Five Personality Test.” https://bigfive-test.com/ (accessed Jul. 09, 2022).
B. Yudha Pratama NRP, A. Ec Ir Riyanarto Sarno, and R. A. Nur Esti, “Personality Classification Based on Twitter Text Using Naive Bayes, KNN and SVM.”
L. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2, p. 1883, 2009, doi: 10.4249/scholarpedia.1883.
D. Faraj and M. Abdullah, “SarcasmDet at SemEval-2021 Task 7: Detect Humor and Offensive based on Demographic Factors using RoBERTa Pre-trained Model.”
S. Lei, “A feature selection method based on information gain and genetic algorithm,” in Proceedings - 2012 International Conference on Computer Science and Electronics Engineering, ICCSEE 2012, 2012, vol. 2, pp. 355–358. doi: 10.1109/ICCSEE.2012.97.
F. Last, G. Douzas, and F. Bacao, “Oversampling for Imbalanced Learning Based on K-Means and SMOTE,” Nov. 2017, doi: 10.1016/j.ins.2018.06.056.
H. Hairani, K. E. Saputro, and S. Fadli, “K-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes,” Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 89–93, Apr. 2020, doi: 10.14710/jtsiskom.8.2.2020.89-93.
R. Ghawi and J. Pfeffer, “Efficient Hyperparameter Tuning with Grid Search for Text Categorization using kNN Approach with BM25 Similarity,” Open Computer Science, vol. 9, no. 1, pp. 160–180, Jan. 2019, doi: 10.1515/comp-2019-0011.
Willy, E. B. Setiawan, and F. N. Nugraha, “Implementation of Decision Tree C4.5 for Big Five Personality Predictions with TF-RF and TF-CHI2 on Social Media Twitter,” in 2019 International Conference on Computer, Control, Informatics, and its Applications: Emerging Trends in Big Data and Artificial Intelligence, IC3INA 2019, Oct. 2019, pp. 114–119. doi:10.1109/IC3INA48034.2019.8949601.
K. Prameswari and E. B. Setiawan, “Analisis Kepribadian Melalui Twitter Menggunakan Metode Logistic Regression dengan Pembobotan TF-IDF dan AHP.”
Copyright (c) 2022 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;