帳號:guest(3.142.114.245)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):林曉彬
作者(外文):Lin, Hsiao-Pin
論文名稱(中文):使用多情緒專家模型偵測新進語音情緒類別
論文名稱(外文):Adapt a New Emotion Class Detection by Speech using Mixture of Emotional Experts
指導教授(中文):李祈均
指導教授(外文):LEE, CHI-CHUN
口試委員(中文):冀泰石
曹昱
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061631
出版年(民國):111
畢業學年度:111
語文別:英文
論文頁數:41
中文關鍵詞:語音情緒辨識多專家模型小樣本學習
外文關鍵詞:speech emotion recognitionmixture of expertsfew shot learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:391
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
大多數的語音情緒辨識(SER)都專注在分這四類情緒:中性、生氣、難過、開心,但是要將語音情緒辨識實際應用在生活中,我們就不能忽略其他情緒的研究。然而,人類的情緒多達上百種,若每個情緒都要重新訓練大量的資料,會耗費太多時間。通常要解決這類問題,會利用原本就有的預訓練模型,透過少量的目標資料進行微調。但是用類別情緒標籤(categorical emotion label)當作預訓練模型,沒辦法獲得太好的效果。幸運的是,最近的研究進一步指出,維度情緒標籤(dimensional emotion label)能幫助類別情緒的分類。基於這個想法,本篇研究提出多情緒專家模型(MOEE)去解決小樣本新進語音情緒類別偵測。透過小樣本目標情緒在四類情緒和維度情緒標籤的預訓練模型上微調(fine-tune),和能利用音訊資料結合專家間距離,學出權重的門控網路(gating network)。在IEMOCAP資料庫中,挫折的偵測達到了63.26%的UAR。在MSP-PODCAST資料庫中,驚訝、厭惡、鄙視的偵測則是只需要用10筆資料微調,就能超過全部資料訓練的結果。分析方面,利用MOEE輸出各個專家權重的特性,能將權重結果應用在分析情緒的相似度,做出與其他小樣本學習(few shot learning)的區別。
Most speech emotion recognition focuses on these four types of emotions: neutral, angry, sad, happy, but to actually apply speech emotion recognition to life, we cannot ignore other emotion studies. However, there are hundreds of emotions in humans, which take too time much if each emotion needs to retrain all the data. Usually, such problems are fine-tuned on pre-trained models with a small amount of target data. However, using the category emotion label as a pre-training model, there is hard to get a good result. Fortunately, recent research further points out that dimensional emotion labels help classify categorical emotions. This study proposes a mixture-of-emotional-experts (MOEE) to solve the new emotion class detection in few samples. Fine-tuning the pre-training model of the four types of emotion and dimensional emotion labels through a small sample of target emotions, and a gating network that learns weights using audio data combined with the distance between experts. In the IEMOCAP dataset, we achieved 63.26% UAR in the frustration detection. In the MSP-PODCAST dataset, surprise, disgust, and contempt detection, we can just fine-tune 10 training data to exceed the all data training. In analysis, using the expert weights output from MOEE, the weight results can be applied to analyze emotion similarity and make a difference from other few shot learning.
摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Figures vii
List of Tables viii
Chapter 1 Introduction 1
Chapter 2 Database and Feature 5
2.1 IEMOCAP 5
2.2 MSP-PODCAST 6
2.3 Feature 8
Chapter 3 Methodology 9
3.1 Framework 9
3.1.1 Deep Neural Networks (DNN) and Gate Recurrent Unit (GRU) 10
3.1.2 Network of emotional experts 12
3.2 Training of emotional experts 14
3.3 Distance of emotional experts 15
3.4 Gating Network 16
Chapter 4 Experiment 17
4.1 Experimental Setup 17
4.1.1 Network Configurations 19
Chapter 5 Results and Analysis 20
5.1.1 Exp. 1-1 Comparison of Only Train by New Emotion and Tuning-Based Transfer 20
5.1.2 Exp. 1-2 Comparison of “An Enroll-to-Verify Approach for Cross-Task Unseen Emotion Class Recognition” 23
5.2 Exp. 2 Comparison of Different Distance of Experts 24
5.3 Exp. 3 Comparison of Different Combinations of Experts 26
5.4 Exp. 4 Comparison of Ensemble Approaches 27
5.5 Analysis 28
5.5.1 Effects on the Number of Fine-tune Samples 28
5.5.2 VAD Statistic 29
5.5.3 Weight Analysis 31
Chapter 6 Conclusions 34
Reference 36
Appendix 41

[1] M. Ayadi, M. S.Kamel and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern Recognition, Volume 44, Issue 3, pp. 572-587, March 2011.
[2] S. Ramakrishnan and I. M. M. E. Emary, "Speech emotion recognition approaches in human computer interaction.," Telecommunication Systems 52, p. 1467–1478 , 2 September 2011.
[3] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, "Speech recognition using deep neural networks: A systematic review," IEEE Access vol. 7, p. 19143–19165, 2019.
[4] K. Oh, D. Lee, B. Ko and H. Choi, "A Chatbot for Psychiatric Counseling in Mental Healthcare Service Based on Emotional Dialogue Analysis and Sentence Generation," in IEEE International Conference on Mobile Data Management (MDM), 2017.
[5] P. Valentina and R. M. Hannah, "Alexa, she's not human but… Unveiling the drivers of consumers' trust in voice-based artificial intelligence," Psychology Marketing, 20 January 2021.
[6] B. G. C. Dellaert, S. B. Shu, T. A. Arentze, T. Baker, K. Diehl, B. Donkers, N. J. Fast, G. Häubl, H. Johnson, U. R. Karmarkar, H. Oppewal, B. H. Schmitt, J. Schroeder, S. A. Spiller and Steff, "Consumer decisions with artificially intelligent voice assistants," Marketing Letters, p. 335–347, 17 August 2020.
[7] A. B. Ingale and D. S. Chaudhari, "Speech Emotion Recognition," International Journal of Soft Computing and Engineering (IJSCE), pp. 235-238, March 2012.
[8] M. Swain, A. Routray and P. Kabisatpathy3, "Databases, features and classifiers for speech emotion recognition: a review," International Journal of Speech Technology, p. 93–120 , 19 January 2018.
[9] L. Muda, M. Begam and I. Elamvazuthi, "Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques," JOURNAL OF COMPUTING, VOLUME 2, ISSUE 3, March 2010.
[10] A. Baevski, H. Zhou, A. Mohamed and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," Advances in Neural Information Processing Systems 33, pp. 12449-12460, 2020.
[11] Y.-L. Lin and G. Wei, "Speech emotion recognition based on HMM and SVM," International Conference on Machine Learning and Cybernetics, pp. 4898-4901, 2005.
[12] L. Tarantino, P. N. Garner and A. Lazaridis., "Self-Attention for Speech Emotion Recognition," Interspeech, pp. 2578-2582, 2019.
[13] J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding and V. Tarokh, "Speech emotion recognition with dual-sequence LSTM architecture," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6474-6478, 2020.
[14] S. Latif, R. Rana, S. Khalifa, R. Jurdak and J. Epps., "Direct Modelling of Speech Emotion from Raw Speech," Interspeech 2019, p. 3920–3924, 2019.
[15] J.-L. Li, T.-Y. Huang, C.-M. Chang and C.-C. Lee, "A waveformfeature dual branch acoustic embedding network for emotion recognition," Frontiers in Computer Science, vol. 2, p. 13, 2020.
[16] D. Wu, T. D. Parsons, E. Mower and S. Narayanan, "Speech emotion estimation in 3D space," 2010 IEEE International Conference on Multimedia and Expo, pp. 737-742, 2010.
[17] S. Gielen, E. Douglas-cowie and R. Cowie, "Acoustic correlates of emotion dimensions in view of speech synthesis," Seventh European Conference on Speech Communication and Technology, 2001.
[18] R. Kehrein, "The prosody of authentic emotions," Speech Prosody 2002, International Conference, 2002.
[19] T. W. Smith, The Book of Human Emotions: An Encyclopedia of Feeling from Anger to Wanderlust, Profile Books, 2015.
[20] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu and H. Zhu, "A Comprehensive Survey on Transfer Learning," Proceedings of the IEEE 109.1 , pp. 43-76, 2020.
[21] J.-L. Li and C.-C. Lee, "An Enroll-to-Verify Approach for Cross-Task Unseen Emotion Class Recognition," IEEE Transactions on Affective Computing, 2022.
[22] R. Xia and Y. Liu, "A multi-task learning framework for emotion recognition using 2D continuous space," EEE Transactions on affective computing, 8(1), pp. 3-14, 2015.
[23] R. Cai, K. Guo, B. Xu, X. Yang and Z. Zhang, "Meta Multi-task Learning for Speech Emotion Recognition," INTERSPEECH 2020, October 2020.
[24] G. Vrbančič and V. Podgorelec, "Transfer Learning With Adaptive Fine-Tuning," IEEE Access, vol. 8, pp. 196197-196211, 2020.
[25] S. Masoudnia and R. Ebrahimpour, "Mixture of experts: a literature survey," Artif Intell Rev 42, p. 275–293, 2014.
[26] J. M. Joyce, "Kullback-Leibler Divergence," International Encyclopedia of Statistical Science, p. 720–722, 01 January 2014.
[27] C. Zhang and Y. Ma, Ensemble Machine Learning: Methods and Applications, Springer, 2012.
[28] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee and S. S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, , pp. 335-359, December 2008.
[29] R. Lotfian and C. Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, October-December 2019.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top

相關論文

1. 透過語音特徵建構基於堆疊稀疏自編碼器演算法之婚姻治療中夫妻互動行為量表自動化評分系統
2. 基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具
3. 一個利用人類Thin-Slice情緒感知特性所建構而成之全時情緒辨識模型新框架
4. 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
5. 基於多模態主動式學習法進行樣本與標記之間的關係分析於候用校長評鑑之自動化評分系統建置
6. 透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統
7. 結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統
8. 針對實體化交談介面開發基於行為衡量方法於自閉症小孩之評估系統
9. 一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究
10. 整合文本多層次表達與嵌入演講屬性之表徵學習於強健候用校長演講自動化評分系統
11. 利用聯合因素分析研究大腦磁振神經影像之時間效應以改善情緒辨識系統
12. 利用LSTM演算法基於自閉症診斷觀察量表訪談建置辨識自閉症小孩之評估系統
13. 利用多模態模型混合CNN和LSTM影音特徵以自動化偵測急診病患疼痛程度
14. 以雙向長短期記憶網路架構混和多時間粒度文字模態改善婚 姻治療自動化行為評分系統
15. 透過表演逐字稿之互動特徵以改善中文戲劇表演資料庫情緒辨識系統
 
* *