Speaker Identification Using a Convolutional Neural Network

  • Suci Dwijayanti Universitas Sriwijaya
  • Alvio Yunita Putri Teknik Elektro Universitas Sriwijaya
  • Bhakti Yudho Suprapto Teknik Elektro Universitas Sriwijaya
Keywords: speaker identification, CNN, spectrogram, feature extraction


Speech, a mode of communication between humans and machines, has various applications, including biometric systems for identifying people have access to secure systems. Feature extraction is an important factor in speech recognition with high accuracy. Therefore, we implemented a spectrogram, which is a pictorial representation of speech in terms of raw features, to identify speakers. These features were inputted into a convolutional neural network (CNN), and a CNN-visual geometry group (CNN-VGG) architecture was used to recognize the speakers. We used 780 primary data from 78 speakers, and each speaker uttered a number in Bahasa Indonesia. The proposed architecture, CNN-VGG-f, has a learning rate of 0.001, batch size of 256, and epoch of 100. The results indicate that this architecture can generate a suitable model for speaker identification. A spectrogram was used to determine the best features for identifying the speakers. The proposed method exhibited an accuracy of 98.78%, which is significantly higher than the accuracies of the method involving Mel-frequency cepstral coefficients (MFCCs; 34.62%) and the combination of MFCCs and deltas (26.92%). Overall, CNN-VGG-f with the spectrogram can identify 77 speakers from the samples, validating the usefulness of the combination of spectrograms and CNN in speech recognition applications.


Download data is not yet available.


T. O. Majekodunmi and F. E. Idachaba, “A review of the fingerprint, speaker recognition, face recognition and iris recognition based biometric identification technologies,” in Proceeding of World Congress in Engineering. 2011, WCE 2011, vol. 2, pp. 1681–1687, 2011.

A. M. Warohma, P. Kurniasari, S. Dwijayanti, and B. Y. Suprapto, “Identification of Regional Dialects Using Mel Frequency Cepstral Coefficients ( MFCCs ) and Neural Network,” in 2018 Int. Semin. Appl. Technol. Inf. Commun., pp. 522–527, 2018.

O. Mamyrbayev, A. Toleu, G. Tolegen, and N. Mekebayev, “Neural architectures for gender detection and speaker identification,” Cogent Eng., vol. 7, no. 1, 2020, doi: 10.1080/23311916.2020.1727168.

O. S. Faragallah, “Robust noise MKMFCC–SVM automatic speaker identification,” Int. J. Speech Technol., vol. 21, no. 2, pp. 185–192, 2018, doi: 10.1007/s10772-018-9494-9.

J. C. Liu, F. Y. Leu, G. L. Lin, and H. Susanto, “An MFCC-based text-independent speaker identification system for access control,” Concurr. Comput. Pract. Exp., vol. 30, no. 2, pp. 1–16, 2018, doi: 10.1002/cpe.4255.

V. M. Sardar and S. D. Shirbahadurkar, “Speaker identification of whispering speech: an investigation on selected timbrel features and KNN distance measures,” Int. J. Speech Technol., vol. 21, no. 3, pp. 545–553, 2018, doi: 10.1007/s10772-018-9527-4.

X. Lu, S. Li, and M. Fujimoto, “Automatic speech recognition,” SpringerBriefs Comput. Sci., pp. 21–38, 2020, doi: 10.1007/978-981-15-0595-9_2.

L. Sun, J. Chen, K. Xie, and T. Gu, “Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition,” Int. J. Speech Technol., vol. 21, no. 4, pp. 931–940, 2018, doi: 10.1007/s10772-018-9551-4.

R. Fan and G. Liu, “CNN-Based Audio Front End Processing on Speech Recognition,” 2018 Int. Conf. Audio, Lang. Image Process., pp. 349–354, 2018.

T. Kawamura, A. Kai, and S. Nakagawa, “Noise Robust Fundamental Frequency Estimation of Speech using CNN-based Discriminative Modeling,” in 2018 5th Int. Conf. on Adv. Informatics: Concept Theory and Applications (ICAICTA), Krabi, Thailand, 2018, pp. 60–65. https://doi.org/10.1109/ICAICTA.2018.8541328

S. Albawi, T. A. M. Mohammed, and S. Alzawi, “Understanding of a Convolutional Neural Network,”in 2017 International Conference on Engineering and Technology (ICET), 2017, pp. 1-6, 2017.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in neural information processing systems 25, pp. 1097-1105, 2012.

A. Vedaldi and C. V May, “MatConvNet Convolutional Neural Networks for MATLAB.” in Proceedings of the 23rd ACM international conference on Multimedia. 2015.

M. Parchami, W. P. Zhu, B. Champagne, and E. Plourde, “Recent developments in speech enhancement in the short-time fourier transform domain,” IEEE Circuits Syst. Mag., vol. 16, no. 3, pp. 45–77, 2016, doi: 10.1109/MCAS.2016.2583681.

How to Cite
Dwijayanti, S., Putri, A. Y., & Suprapto, B. Y. (2022). Speaker Identification Using a Convolutional Neural Network. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 6(1), 140 - 145. https://doi.org/10.29207/resti.v6i1.3795
Information Systems Engineering Articles

Most read articles by the same author(s)