作者(外文):Lu, Chih-Chuan
論文名稱(外文):Observe Critical Data in Emotion Recognition Using a Speech Front-End Network Learned from Media Data In-the-Wild
指導教授(外文):Lee, Chi-Chun
口試委員(外文):Tsao, Yu
Hu, Min-Chun
Lai, Ying-Hui
外文關鍵詞:speech emotion recognitionconvolutional neural networkspeech front-end networkinitialization fine-tuning
The rapid development of deep learning technology bring benefit to progression of speech emotion recognition (SER), though the complexity of emotion still exists to cause problems of the difficulties in rapidly obtaining large scale annotated data and hardly handled high variability across different domains. The initialization - fine-tuning strategy is a common solution in deep learning research. However, simply applying abundant media can still has high discrepancy between it and SER problem. An emotion guidance introduces would help solving it. In this work, we propose to learn an initialization speech front-end network on a large-scaled media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information; and then, to build the SER prediction model by fine-tuning with the assistant of initialization-oriented sampling method. The result shows that the integration of both speech front-end network and sampling method can achieve better performance than random initialization.
摘要 ii
誌謝 iv
目錄 v
表目錄 vii
圖目錄 viii
第一章 緒論 1
1.1 前言 1
1.2 研究動機/目的 3
1.3 論文架構 3
第二章 資料庫與預處理 5
2.1 資料庫介紹 5
2.1.1 背景資料庫:TED-LIUM 5
2.1.2 目標資料庫:IEMOCAP 6
2.2 資料前處理 6
2.2.1 語音資料 6
2.2.2 標記資料 9
第三章 研究方法 10
3.1 代理標記(Proxy Label) 10
3.1.1 規則式激發標記 10
3.1.2 字典式向性標記 11
3.2 類神經網路 12
3.2.1 深度神經網路(Deep Neural Network,DNN) 13
3.2.2 卷積神經網路(Convolutional Neural Network,CNN) 15
3.3 初始化與微調(Initialization – Fine-tuning) 16
3.4 語音前端網路訓練與應用 17
3.4.1 初始化網路 17
3.4.2 取樣方法與微調網路 19
第四章 實驗設計與結果分析 21
4.1 實驗設計 21
4.2 實驗一:前端網路架構 22
4.3 實驗二:以微量資料微調目標資料庫 24
4.4 實驗三:取樣重點資料 25
4.5 實驗四:不同取樣參數 28
4.6 實驗結果分析 30
第五章 結論與未來展望 32
參考文獻 34
