Deteksi Emosi Wicara pada Media On-Demand menggunakan SVM dan LSTM

  • Ainurrochman Universitas Narotama
  • Derry Pramono Adi Universitas Narotama
  • Agustinus Bimo Gumelar Universitas Narotama
Keywords: Speech Emotion Detection, Media On-Demand, SVM, LSTM, Deep Learning


To date, there are many speech data sets with emotional classes, but with impromptu or intentional actors. The native speakers are given a stimulus in each emotion expression. Because natural conversation from secretly recorded daily communication still raises ethical issues, then using voice data that takes samples from movies and podcasts is the most appropriate step to take the best insights from speech. Professional actors are trained to induce the most real emotions close to natural, through the Stanislavski acting method. The speech dataset that meets this qualification is the Human voice Natural Language from On-demand media (HENLO). Within HENLO, there are basic per-emotion audio clips of films and podcasts originating from Media On-Demand, a motion video entertainment media platform with the freedom to play and download at any time. In this paper, we describe the use of sound clips from HENLO, then conduct learning using Support Vector Machine (SVM) and Long Short-Term Memory (LSTM). In these two methods, we found the best strategy by training LSTMs first, then then feeding the model to SVM, with a data split strategy at 80:20 scale. The results of the five training phases show that the last accuracy results increased by more than 17% compared to the first training. These results mean both complement and methods are important for improving classification accuracy.


Download data is not yet available.


Daftar Rujukan

R. M. Nesse, “Evolutionary Explanations of Emotions,” Hum. Nat., vol. 1, no. 3, pp. 261–289, Sep. 1990.

A. B. Gumelar, Eko Mulyanto Yuniarno, Wiwik Anggraeni, Indar Sugiarto, A. A. Kristanto, and M. H. Purnomo, “Kombinasi Fitur Multispektrum Hilbert dan Cochleagram untuk Identifikasi Emosi Wicara,” J. Nas. Tek. Elektro dan Teknol. Inf., vol. 9, no. 2, pp. 180–189, May 2020.

M. A. W. Martens, M. J. Janssen, W. A. J. J. M. Ruijssenaars, M. Huisman, and J. M. Riksen-Walraven, “Fostering Emotion Expression and Affective Involvement with Communication Partners in People with Congenital Deafblindness and Intellectual Disabilities,” J. Appl. Res. Intellect. Disabil., vol. 30, no. 5, pp. 872–884, Sep. 2017.

N. Adibsereshki, M. Shaydaei, and G. Movallali, “The Effectiveness of Emotional Intelligence Training on the Adaptive Behaviors of Students with Intellectual Disability,” Int. J. Dev. Disabil., vol. 62, no. 4, pp. 245–252, Oct. 2016.

K. An and M. Chung, “Cognitive Face Analysis System for Future Interactive TV,” IEEE Trans. Consum. Electron., vol. 55, no. 4, pp. 2271–2279, Nov. 2009.

M. Uitto, K. Jokikokko, and E. Estola, “Virtual Special Issue on Teachers and Emotions in Teaching and Teacher Education (TATE) in 1985–2014,” Teach. Teach. Educ., vol. 50, pp. 124–135, Aug. 2015.

K. Kucharska-Pietura, M. L. Philips, W. Gernand, and A. S. David, “Perception of Emotions from Faces and Voices Following Unilateral Brain Damage,” Neuropsychologia, vol. 41, no. 8, pp. 1082–1090, Jan. 2003.

P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, “Speech Emotion Recognition Using Spectrogram and Phoneme Embedding,” in Interspeech 2018, 2018, pp. 3688–3692.

B. Gambäck and U. K. Sikdar, “Using Convolutional Neural Networks to Classify Hate-Speech,” no. 7491, pp. 85–90, 2017.

D. P. Adi, A. B. Gumelar, and R. P. Arta Meisa, “Interlanguage of Automatic Speech Recognition,” in 2019 International Seminar on Application for Technology of Information and Communication (iSemantic), 2019, pp. 88–93.

A. B. Gumelar et al., “Human Voice Emotion Identification Using Prosodic and Spectral Feature Extraction Based on Deep Neural Networks,” IEEE 7th Int. Conf. Serious Games Appl. Heal., pp. 1–8, Aug. 2019.

J. Ahmad, M. Sajjad, S. Rho, S. il Kwon, M. Y. Lee, and S. W. Baik, “Determining Speaker Attributes from Stress-affected Speech in Emergency Situations with Hybrid SVM-DNN Architecture,” Multimed. Tools Appl., vol. 77, no. 4, pp. 4883–4907, 2018.

A. Huang and P. Bao, “Human Vocal Sentiment Analysis,” pp. 1–16, 2019.

X. Zhengqiao and Z. Dewei, “Research on Clustering Algorithm for Massive Data Based on Hadoop Platform,” in 2012 International Conference on Computer Science and Service System, 2012, pp. 43–45.

K. R. Scherer, Approaches To Emotion. Psychology Press, 2014.

P. Ekman and R. J. Davidson, The Nature of Emotion: Fundamental Questions. Oxford University Press USA, 1994.

R. Plutchik, “The Nature of Emotions: Human Emotions Have Deep Evolutionary Roots, a Fact That May Explain Their Complexity and Provide Tools for Clinical Practice,” Am. Sci., vol. 89, no. 4, pp. 344–350, 2001.

A. A. Sundawa, A. G. Putrada, and N. A. Suwastika, “Implementasi Dan Analisis Simulasi Deteksi Emosi Melalui Pengenalan Suara Menggunakan Mel-frequency Cepstrum Coefficient Dan Hidden Markov Model Berbasis IoT,” eProceedings Eng., vol. 6, no. 1, 2019.

N. A. Anggraini and N. Fadillah, “Analisis Deteksi Emosi Manusia dari Suara Percakapan Menggunakan Matlab dengan Metode KNN,” InfoTekJar (Jurnal Nas. Inform. dan Teknol. Jaringan), vol. 3, no. 2, pp. 176–179, Mar. 2019.

A. B. Gumelar, M. H. Purnomo, E. M. Yuniarno, and I. Sugiarto, “Spectral Analysis of Familiar Human Voice Based On Hilbert-Huang Transform,” in 2018 International Conference on Computer Engineering, Network and Intelligent Multimedia (CENIM), 2018, pp. 311–316.

P. Ferreira, D. C. Le, and N. Zincir-Heywood, “Exploring Feature Normalization and Temporal Information for Machine Learning Based Insider Threat Detection,” in 2019 15th International Conference on Network and Service Management (CNSM), 2019, pp. 1–7.

B. Lamichhane, U. Großekathöfer, G. Schiavone, and P. Casale, “Towards Stress Detection in Real-Life Scenarios Using Wearable Sensors: Normalization Factor to Reduce Variability in Stress Physiology,” 2017, pp. 259–270.

Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, and B. Schuller, “Speech Emotion Classification Using Attention-Based LSTM,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 11, pp. 1675–1685, Nov. 2019.

F. A. Gers, “Learning to Forget: Continual Prediction with LSTM,” in 9th International Conference on Artificial Neural Networks: ICANN ’99, 1999, vol. 1999, pp. 850–855.

S. An, Z. Ling, and L. Dai, “Emotional Statistical Parametric Speech Synthesis using LSTM-RNNs,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017, pp. 1613–1616.

K.-R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik, “Predicting Time Series with Support Vector Machines,” 1997, pp. 999–1004.

H. Drucker, C. J. Burges, L. Kaufman, A. J. Smola, and V. Vapnik, “Support Vector Regression Machines,” in Advances in Neural Information Processing Systems, 1997, pp. 155–161.

Y. Chavhan, M. Dhore, and Y. Pallavi, “Speech Emotion Recognition Using Support Vector Machines,” Int. J. Comput. Appl., vol. 1, 2010.

How to Cite
Ainurrochman, Adi, D. P., & Gumelar, A. B. (2020). Deteksi Emosi Wicara pada Media On-Demand menggunakan SVM dan LSTM. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 4(5), 799-804.
Artikel Rekayasa Sistem Informasi