Emotion recognition from speech using deep neural networks

No Thumbnail Available
Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
UMT Lahore
Abstract
Emotions play an important role in the social interactions of humans and it is often said that emotions separate us from machines. Spoken words may have different interpretations depending on how they are uttered. The same sentences can have different meanings under different types of emotional states. The human brain understands different meanings by perceiving underlying emotions in speech. Finding the emotional content from speech signals is desirable because this enables us to teach emotional intelligence to computers. Speech emotion recognition is an important field of study with applications ranging from emotionally intelligent robot creation, audio surveillance, web-based E-learning, computer games, etc. The objective of this dissertation is to identify emotions in an audio speech by using deep learning algorithms including Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) to identify different emotional states of a person. In this regard, the RADVEES dataset, Ryerson Audio-Visual Database of Emotional Speech and Song, is used to study speech emotion recognition. For experiments, we used approximately 1247 audio and song files containing eight different emotions for the classification of audio data. The experimental results show that the best performing model was CNN based model with an accuracy of 74.57% while the RNN model only showed 60.00% accuracy, which is far less in comparison. This work will be extended in the future using different variants of RNNs and other DNNs like autoencoders. Audio is a complex signal with linguistic and paralinguistic features and our future goal is to combine these features with different neural network architectures for developing improved SER systems.
Description
Keywords
Citation
Collections