Speech Emotion Recognition (SER) is a technology that aims to detect and classify emotions in speech signals. It involves the use of machine learning algorithms to analyze audio recordings of human speech and identify the emotions conveyed by the speaker. SER systems typically work by extracting various acoustic features from the speech signal, such as pitch, loudness, and spectral characteristics, and using these features as inputs to a machine learning model. The model is trained on a dataset of speech recordings that have been manually labeled with the corresponding emotions, such as happiness, sadness, anger, or fear. Once trained, the model can be used to classify the emotions in new speech signals. SER has a wide range of potential applications, such as in call centers, where it can be used to monitor the emotional state of customers and improve customer service, or in healthcare, where it can be used to diagnose and monitor mental health conditions such as depression or anxiety. The conventional methods for recognizing emotions from speech signals typically entail the selection and manual design of a range of features, including spectral and prosodic traits, that help capture relevant speech information accurately. These methods rely on expert knowledge and assumptions about the critical features, so their ability to capture the full range of variability in speech can be constrained. However, SER is still an active area of research, and there are many challenges that need to be overcome, such as dealing with variations in accent, background noise, and individual differences in speech patterns.
Deep learning techniques are being used to address the challenges of recognizing emotions in speech. By utilizing deep neural networks, these techniques can analyze raw speech data and identify patterns and connections between them. This enables the system to capture more subtle and nuanced variations in speech that are difficult to capture using traditional approaches.
In a new study published in the peer-reviewed Journal of Applied Acoustics, Professor Zhen-Tao Liu and co-workers from China University of Geosciences, Wuhan, China proposed a new methodology for Speech Emotion Recognition (SER) that employs data augmentation directly on speech samples by resampling, which is used to address the imbalance of sample distribution among emotional categories and provide sufficient inputs for the deep neural network model. To reflect the dynamic characteristics of the speech signal and increase the diversity of input features, the velocity (delta) and acceleration (delta-deltas) coefficients of the log-Mel spectrograms are calculated to construct triple-channel inputs. Time-domain and frequency-domain filters were used to process the triple-channel log-Mel spectrograms, which increased the diversity of features. The model included three main parts: triple-channel log-Mel spectrograms generation, feature extraction using Convolutional Neural Network and Attention-based Bidirectional Long Short-Term Memory network, and output with multi-task learning. The triple-channel log-Mel spectrograms were generated by calculating the velocity and acceleration coefficients of the log-Mel spectrograms. Z-score normalization was applied to the features to alleviate variations among different speakers. Two different shapes of convolutional filter for the triple-channel log-Mel spectrograms to procure relevant information pertaining to frequency-domain and time-domain respectively. Low-level features were retrieved using a convolutional neural network, and high-level features were learned by concatenating the extracted features and feeding them into an attention-based bidirectional long short-term memory network. After the Bidirectional Long Short-Term Memory network, an attention layer was used to produce high-level features. The output was created via multi-task learning, which enabled the model to recognise various speakers’ moods.
The research team evaluated the performance of the deep learning model’s classification using the effects of two methods: balanced augmentation and time-frequency filters. The outcomes demonstrated that using balanced augmentation enhanced the model’s classification performance by raising the weighted accuracy rate (WAR) and unweighted accuracy rate (UAR) for both databases. The weighted accuracy rate and unweighted accuracy rate on the Interactive Emotional Dyadic Motion Capture database were 70.27% and 66.27%, respectively, while they were 60.90% and 61.83% on the Multimodal Signal Processing-IMPROV database, showing that it performed very well and actually better than other works in both categories.
In a nutshell, Professor Zhen-Tao Liu and colleagues introduced a new method to identify speech movements by utilizing deep learning approaches like convolutional neural networks (CNNs) and bidirectional long short-term memory networks (BLSTMs), which have previously shown to be effective in other speech processing domains. The study also used multi-task learning and balanced augmented sampling to address the challenges of speech motion recognition and enhance its generalization performance and accuracy. To further enhance this technique’s effectiveness in actual settings, future studies should apply these learnings to practical applications and assess their impact.
The new SER developed by Professor Zhen-Tao Liu and colleagues has a variety of potential applications in various fields, including diagnosis and monitoring of mental health conditions, such as depression and anxiety, by detecting the emotional states of patients through their speech. It can also be used in therapy sessions to provide real-time feedback to therapists on the emotional states of their patients. Moreover, it can be used in call centers to monitor the emotional state of customers and improve customer service. For example, if a customer is angry, the system can automatically transfer the call to a specialized agent who can handle the situation. Furthermore, the new SER method can be used to improve human-computer interaction by allowing machines to respond to the emotional state of the user. For example, if the user is frustrated, the system can provide more detailed instructions or guidance to help them complete a task. Indeed, SER has many potential applications in various fields, and as the technology continues to improve, it is likely to become even more widespread in the future.
Zhen-Tao Liu, Meng-Ting Han, Bao-Han Wu, Abdul Rehman. Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning. Applied acoustics, Volume 202, Januuary 2023, 109178.