作者(外文):Lin, Hsiao-Pin
論文名稱(外文):Adapt a New Emotion Class Detection by Speech using Mixture of Emotional Experts
指導教授(外文):LEE, CHI-CHUN
外文關鍵詞:speech emotion recognitionmixture of expertsfew shot learning
大多數的語音情緒辨識(SER)都專注在分這四類情緒:中性、生氣、難過、開心,但是要將語音情緒辨識實際應用在生活中,我們就不能忽略其他情緒的研究。然而,人類的情緒多達上百種,若每個情緒都要重新訓練大量的資料,會耗費太多時間。通常要解決這類問題,會利用原本就有的預訓練模型,透過少量的目標資料進行微調。但是用類別情緒標籤(categorical emotion label)當作預訓練模型,沒辦法獲得太好的效果。幸運的是,最近的研究進一步指出,維度情緒標籤(dimensional emotion label)能幫助類別情緒的分類。基於這個想法,本篇研究提出多情緒專家模型(MOEE)去解決小樣本新進語音情緒類別偵測。透過小樣本目標情緒在四類情緒和維度情緒標籤的預訓練模型上微調(fine-tune),和能利用音訊資料結合專家間距離,學出權重的門控網路(gating network)。在IEMOCAP資料庫中,挫折的偵測達到了63.26%的UAR。在MSP-PODCAST資料庫中,驚訝、厭惡、鄙視的偵測則是只需要用10筆資料微調,就能超過全部資料訓練的結果。分析方面,利用MOEE輸出各個專家權重的特性,能將權重結果應用在分析情緒的相似度,做出與其他小樣本學習(few shot learning)的區別。
Most speech emotion recognition focuses on these four types of emotions: neutral, angry, sad, happy, but to actually apply speech emotion recognition to life, we cannot ignore other emotion studies. However, there are hundreds of emotions in humans, which take too time much if each emotion needs to retrain all the data. Usually, such problems are fine-tuned on pre-trained models with a small amount of target data. However, using the category emotion label as a pre-training model, there is hard to get a good result. Fortunately, recent research further points out that dimensional emotion labels help classify categorical emotions. This study proposes a mixture-of-emotional-experts (MOEE) to solve the new emotion class detection in few samples. Fine-tuning the pre-training model of the four types of emotion and dimensional emotion labels through a small sample of target emotions, and a gating network that learns weights using audio data combined with the distance between experts. In the IEMOCAP dataset, we achieved 63.26% UAR in the frustration detection. In the MSP-PODCAST dataset, surprise, disgust, and contempt detection, we can just fine-tune 10 training data to exceed the all data training. In analysis, using the expert weights output from MOEE, the weight results can be applied to analyze emotion similarity and make a difference from other few shot learning.
摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Figures vii
List of Tables viii
Chapter 1 Introduction 1
Chapter 2 Database and Feature 5
2.3 Feature 8
Chapter 3 Methodology 9
3.1 Framework 9
3.1.1 Deep Neural Networks (DNN) and Gate Recurrent Unit (GRU) 10
3.1.2 Network of emotional experts 12
3.2 Training of emotional experts 14
3.3 Distance of emotional experts 15
3.4 Gating Network 16
Chapter 4 Experiment 17
4.1 Experimental Setup 17
4.1.1 Network Configurations 19
Chapter 5 Results and Analysis 20
5.1.1 Exp. 1-1 Comparison of Only Train by New Emotion and Tuning-Based Transfer 20
5.1.2 Exp. 1-2 Comparison of “An Enroll-to-Verify Approach for Cross-Task Unseen Emotion Class Recognition” 23
5.2 Exp. 2 Comparison of Different Distance of Experts 24
5.3 Exp. 3 Comparison of Different Combinations of Experts 26
5.4 Exp. 4 Comparison of Ensemble Approaches 27
5.5 Analysis 28
5.5.1 Effects on the Number of Fine-tune Samples 28
5.5.2 VAD Statistic 29
5.5.3 Weight Analysis 31
Chapter 6 Conclusions 34
Reference 36
Appendix 41

