帳號:guest(3.143.4.205)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):呂志娟
作者(外文):Lu, Chih-Chuan
論文名稱(中文):利用多媒體資料建構的語音前端網路觀察情緒辨識重要資料
論文名稱(外文):Observe Critical Data in Emotion Recognition Using a Speech Front-End Network Learned from Media Data In-the-Wild
指導教授(中文):李祈均
指導教授(外文):Lee, Chi-Chun
口試委員(中文):曹昱
胡敏君
賴穎暉
口試委員(外文):Tsao, Yu
Hu, Min-Chun
Lai, Ying-Hui
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:106061525
出版年(民國):109
畢業學年度:108
語文別:中文
論文頁數:41
中文關鍵詞:語音情緒辨識卷積神經網路語音前端網路初始化微調
外文關鍵詞:speech emotion recognitionconvolutional neural networkspeech front-end networkinitialization fine-tuning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:377
  • 評分評分:*****
  • 下載下載:61
  • 收藏收藏:0
語音情緒辨識近年來為深度學習所賜,成果越來越斐然,然而情緒構成複雜度造成的情緒資料庫蒐集問題仍然存在:語音情緒資料難以快速累積,以及在不同語境間變異度高。初始化微調是深度學習中一個常見的解決方法,然而若是純粹以多媒體背景資料則會與語音情緒辨識存在太大的差異,還是需要不論是在初始化時給予情緒的引導,或是微調時提供更準確的方法。因此本論文提出利用大量隨手可得的多媒體資料,伴隨從其聲音及文字資訊產生的情緒激發、向性代理標記,學習給語音情緒辨識問題應用的初始化語音前端網路;接著以此初始化網路取向的取樣方法輔助微調,建立目標資料庫的語音情緒辨識模型。結果顯示在語音前端網路的及取樣方法輔助下,結果都可以勝過隨機初始化,有卓越提升的表現。
The rapid development of deep learning technology bring benefit to progression of speech emotion recognition (SER), though the complexity of emotion still exists to cause problems of the difficulties in rapidly obtaining large scale annotated data and hardly handled high variability across different domains. The initialization - fine-tuning strategy is a common solution in deep learning research. However, simply applying abundant media can still has high discrepancy between it and SER problem. An emotion guidance introduces would help solving it. In this work, we propose to learn an initialization speech front-end network on a large-scaled media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information; and then, to build the SER prediction model by fine-tuning with the assistant of initialization-oriented sampling method. The result shows that the integration of both speech front-end network and sampling method can achieve better performance than random initialization.
摘要 ii
ABSTRACT iii
誌謝 iv
目錄 v
表目錄 vii
圖目錄 viii
第一章 緒論 1
1.1 前言 1
1.2 研究動機/目的 3
1.3 論文架構 3
第二章 資料庫與預處理 5
2.1 資料庫介紹 5
2.1.1 背景資料庫:TED-LIUM 5
2.1.2 目標資料庫:IEMOCAP 6
2.2 資料前處理 6
2.2.1 語音資料 6
2.2.2 標記資料 9
第三章 研究方法 10
3.1 代理標記(Proxy Label) 10
3.1.1 規則式激發標記 10
3.1.2 字典式向性標記 11
3.2 類神經網路 12
3.2.1 深度神經網路(Deep Neural Network,DNN) 13
3.2.2 卷積神經網路(Convolutional Neural Network,CNN) 15
3.3 初始化與微調(Initialization – Fine-tuning) 16
3.4 語音前端網路訓練與應用 17
3.4.1 初始化網路 17
3.4.2 取樣方法與微調網路 19
第四章 實驗設計與結果分析 21
4.1 實驗設計 21
4.2 實驗一:前端網路架構 22
4.3 實驗二:以微量資料微調目標資料庫 24
4.4 實驗三:取樣重點資料 25
4.5 實驗四:不同取樣參數 28
4.6 實驗結果分析 30
第五章 結論與未來展望 32
參考文獻 34
[1] C.Lisetti and C.LeRouge, “Affective computing in tele-home health,” in Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004, pp. 8 pp.-.
[2] A.Luneski, P. D.Bamidis, and M.Hitoglou-Antoniadou, “Affective computing and medical informatics: state of the art in emotion-aware medical applications.,” Studies in health technology and informatics, vol. 136, p. 517, 2008.
[3] K.L.B and L. P.GG, “Student Emotion Recognition System (SERS) for e-learning Improvement Based on Learner Concentration Metric,” Procedia Computer Science, vol. 85, pp. 767–776, 2016.
[4] K. C.Lin, T.Huang, J. C.Hung, N. Y.Yen, and S. J.Chen, “Facial emotion recognition towards affective computing‐based learning,” Library Hi Tech, vol. 31, no. 2, pp. 294–307, 2013.
[5] Q.Luo, “Emotion Recognition in Modern Distant Education System by Using Neural Networks and SVM,” in Applied Computing, Computer Science, and Advanced Communication, Springer, 2009, pp. 240–247.
[6] N.Jascanu, V.Jascanu, and S.Bumbaru, “Toward Emotional E-Commerce: The Customer Agent,” in Knowledge-Based Intelligent Information and Engineering Systems, 2008, pp. 202–209.
[7] M.Shanmugam, S.Sun, A.Amidi, F.Khani, and F.Khani, “The Applications of Social Commerce Constructs,” International Journal Inf. Manag., vol. 36, no. 3, pp. 425–432, 2016.
[8] M. A.Anusuya and S. K.Katti, “Speech Recognition by Machine: A Review,” International journal of computer science and Information Security (IJCSIS), vol. 6, no. 3, pp. 181–205, 2009.
[9] K.Han, D.Yu, and I.Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” in Proceedings of the International Speech Communication Association (Interspeech), 2014.
[10] P.Harár, R.Burget, and M. K.Dutta, “Speech emotion recognition with deep learning,” in 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), 2017, pp. 137–140.
[11] W. Q.Zheng, J. S.Yu, and Y. X.Zou, “An experimental study of speech emotion recognition based on deep convolutional neural networks,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, pp. 827–831.
[12] J.Kim, G.Englebienne, K. P.Truong, and V.Evers, “Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition,” in Proceedings of the 2017 ACM on Multimedia Conference, 2017, pp. 1006–1013.
[13] H.-C.Shin et al., “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1285–1298, 2016.
[14] B.Zhou, A.Lapedriza, J.Xiao, A.Torralba, and A.Oliva, “Learning deep features for scene recognition using places database,” in Advances in neural information processing systems, 2014, pp. 487–495.
[15] X.Zhuang, A.Ghoshal, A.-V.Rosti, M.Paulik, and D.Liu, “Improving DNN Bluetooth Narrowband Acoustic Models by Cross-Bandwidth and Cross-Lingual Initialization,” in Proceedings of the International Speech Communication Association (Interspeech), 2017, pp. 2148–2152.
[16] J.Yosinski, J.Clune, Y.Bengio, and H.Lipson, “How transferable are features in deep neural networks?,” in Advances in neural information processing systems, 2014, pp. 3320–3328.
[17] K.Simonyan and A.Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” pp. 1–14, 2014.
[18] Q.Chen, X.Xu, S.Hu, X.Li, Q.Zou, and Y.Li, “A transfer learning approach for classification of clinical significant prostate cancers from mpMRI scans,” in Medical Imaging 2017: Computer-Aided Diagnosis, 2017, vol. 10134, p. 101344F.
[19] T.Liu, S.Xie, J.Yu, L.Niu, and W.Sun, “Classification of thyroid nodules in ultrasound images using deep model based transfer learning and hybrid features,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 919–923.
[20] H. M.Fayek, M.Lech, and L.Cavedon, “On the Correlation and Transferability of Features Between Automatic Speech Recognition and Speech Emotion Recognition.,” in Proceedings of the International Speech Communication Association (Interspeech), 2016, pp. 3618–3622.
[21] J.Deng, Z.Zhang, E.Marchi, and B.Schuller, “Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition,” in 2013 International Conference on Affective Computing and Intelligent Interaction (ACII), 2013, pp. 511–516.
[22] Z.Huang, W.Xue, Q.Mao, and Y.Zhan, “Unsupervised domain adaptation for speech emotion recognition using PCANet,” Multimedia Tools and Applications, vol. 76, no. 5, pp. 6785–6799, 2017.
[23] M.Neumann and N. T.Vu, “Cross-lingual and Multilingual Speech Emotion Recognition on English and French,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
[24] A. M.Badshah, J.Ahmad, N.Rahim, and S. W.Baik, “Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network,” in 2017 International Conference on Platform Technology and Service (PlatCon), 2017, pp. 1–5.
[25] S.-G.Lee, J.Kim, H.-J.Jung, and Y.Choe, “Comparing Sample-Wise Learnability across Deep Neural Network Models,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 9961–9962.
[26] T.Schaul, J.Quan, I.Antonoglou, and D.Silver, “Prioritized experience replay,” in 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings, 2016.
[27] F.Hernandez, V.Nguyen, S.Ghannay, N.Tomashenko, and Y.Estève, “TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation,” in Proceedings of Speech and Computer: 20th International Conference, SPECOM, 2018, pp. 198–208.
[28] C.Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[29] H. M.Fayek, M.Lech, and L.Cavedon, “Evaluating deep learning architectures for Speech Emotion Recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
[30] S.Mariooryad and C.Busso, “Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition,” IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 183–196, 2013.
[31] M.Asgari, G.Kiss, J.vanSanten, I.Shafran, and X.Song, “Automatic measurement of affective valence and arousal in speech,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 965–969.
[32] S. G.Karadoğan and J.Larsen, “Combining semantic and acoustic features for valence and arousal recognition in speech,” in 2012 3rd International Workshop on Cognitive Information Processing (CIP), 2012, pp. 1–6.
[33] D.Bone, C.-C.Lee, and S.Narayanan, “Robust Unsupervised Arousal Rating: A Rule-Based Framework with Knowledge-Inspired Vocal Features,” IEEE Transactions on Affective Computing, vol. 5, no. 2, pp. 201–213, 2014.
[34] B.Pang and L.Lee, “Opinion mining and sentiment analysis,” Foundations and Trends®in Information Retrieval, vol. 2, no. 1--2, pp. 1–135, 2008.
[35] C. J.Hutto and E.Gilbert, “VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text,” in Eighth International AAAI Conference on Weblogs and Social Media, 2014, p. 18.
[36] F.Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
[37] D. E.Rumelhart, G. E.Hinton, and R. J.Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[38] G. E.Hinton and R. R.Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
[39] S.Ioffe and C.Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, vol. 1, pp. 448–456.
[40] W.Lim, D.Jang, and T.Lee, “Speech emotion recognition using convolutional and Recurrent Neural Networks,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1–4.
[41] I.Jindal, M.Nokleby, and X.Chen, “Learning deep networks from noisy labels with dropout regularization,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), 2017, pp. 967–972.
[42] D. P.Kingma and J.Ba, “Adam: A method for stachastic optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
[43] J. A.Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
[44] S.Narayanan and P. G.Georgiou, “Behavioral signal processing: Deriving human behavioral informatics from speech and language,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1203–1233, 2013.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top

相關論文

1. 透過語音特徵建構基於堆疊稀疏自編碼器演算法之婚姻治療中夫妻互動行為量表自動化評分系統
2. 基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具
3. 一個利用人類Thin-Slice情緒感知特性所建構而成之全時情緒辨識模型新框架
4. 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
5. 基於多模態主動式學習法進行樣本與標記之間的關係分析於候用校長評鑑之自動化評分系統建置
6. 透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統
7. 結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統
8. 針對實體化交談介面開發基於行為衡量方法於自閉症小孩之評估系統
9. 一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究
10. 整合文本多層次表達與嵌入演講屬性之表徵學習於強健候用校長演講自動化評分系統
11. 利用聯合因素分析研究大腦磁振神經影像之時間效應以改善情緒辨識系統
12. 利用LSTM演算法基於自閉症診斷觀察量表訪談建置辨識自閉症小孩之評估系統
13. 利用多模態模型混合CNN和LSTM影音特徵以自動化偵測急診病患疼痛程度
14. 以雙向長短期記憶網路架構混和多時間粒度文字模態改善婚 姻治療自動化行為評分系統
15. 透過表演逐字稿之互動特徵以改善中文戲劇表演資料庫情緒辨識系統
 
* *