帳號:guest(216.73.216.146)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):黃昱霖
作者(外文):Huang, Yu-Lin
論文名稱(中文):透過屬性對齊策略進行語音表徵學習
論文名稱(外文):Attribute-Aligned Speech Representation Learning Strategy
指導教授(中文):李祈均
指導教授(外文):Lee, Chi-Chun
口試委員(中文):洪樂文
曹昱
簡仁宗
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061539
出版年(民國):110
畢業學年度:110
語文別:英文
論文頁數:41
中文關鍵詞:語音表徵隱私公平性屬性對齊
外文關鍵詞:speech representationprivacyfairnessattribute alignment
相關次數:
  • 推薦推薦:0
  • 點閱點閱:341
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來,隨著語音科技的蓬勃發展,為我們的生活帶來許多便利的同時,也衍生了很多問題。由於語音訊號含有大量的隱私資訊,在應用時如果沒有多加留意,會在無形中造成隱私洩漏和不公平的預測等問題。在這篇論文中,我們提出一種屬性對齊的學習策略,學習一組將維度拆解並排列的語音表徵,在不同情境下,能透過簡單的屬性去除流程,彈性的處理上述的問題。我們提出了兩種架構:第一個是具有漸層表徵的變分自動編碼器(LR-VAE),我們透過拆解和預先定義的排列順序,促使不同的屬性根據對相關任務的關聯性進行排列;第二個則是特徵評分的變分自動編碼器(FS-VAE),透過拆解和評分機制的設計,學習不同維度對特定任務的重要性。透過屬性對齊,我們知道各個維度所含有的資訊,要針對特定屬性進行保護時,只需要清除相對應的維度即可達成。在這篇論文裡,我們選擇一個情緒資料庫:MSP-Podcast,並將屬性對齊學習策略應用在兩個情境上以進行驗證:「保護身份資訊的情緒辨識系統」和「保護情緒資訊的身份驗證系統」。相較於目前主流的演算法「對抗式學習」,我們提出的演算法在去除身份的情緒辨識上可以獲得據競爭力的效果,並更進一步優化去除身份資訊的身份驗證效果。除了優化的隱私保護效果,我們的屬性對齊學習策略提供了有彈性的隱私保護應用,可以根據特定的情境針對不同的隱私資訊進行去除和保護,並且透過單一的訓練模型和訓練過程,我們提出的方法也降低運算資源的浪費。
Advancement in speech technology has brought convenience to our life. However, several concerns is on the rise as speech signal contains multiple personal attributes, which would lead to privacy leakage or unfair prediction. In this thesis, we propose the concept of attribute-aligned learning strategy to obtain a speech representation that can flexibly address these issues by attribute-selection mechanism. We first propose a layered-representation variational autoencoder (LR-VAE), which factorize speech representation into sensitive attributes and guide the attributes to align in desired order. On the other hand, we propose an attention-base feature-scoring variational autoencoder (FS-VAE), which disentangled the speech representation into mutually independent sensitive attributes with additional scoring machine to capture the importance of each dimension. With properly aligned attributes, we derive non-sensitive representation by simple masking. In this thesis, we evaluate our proposed method on two scenarios, identity-free speech emotion recognition (SER) and emotionless speaker verification (SV), on a large emotion corpus, the MSP-Podcast. We show competitive performances on identity-free SER and improving results on emotionless SV, comparing to the current state-of-the-art method applying adversarial representation learning. Moreover, our proposed learning strategy provides flexibility for multiple privacy-preserving tasks by simple attribute selection, which also reduces computing resources with single model and training process required.
1 Introduction 1
2 Database and Features 5
2.1 Database 5
2.2 Features 6
3 Task1: Human ­Defined Weighting for Attribute­ Alignment 7
3.1 Layered Representation Variational Autoencoder 7
3.1.1 Variational Autoencoder 8
3.1.2 Layered Dropout with Adversarial Multitask Learning 9
3.2 Experimental Setup and Results 11
3.2.1 Experimental Setup 11
3.2.2 Sensitive Attribute Protection 13
3.2.3 Analysis of Aligned Attributes 15
3.2.4 Attribute Characteristics Study 16
4 Task2: Attention­Based Feature­Scoring for Attribute Sorting 19
4.1 Feature Scoring Variational Autoencoder 19
4.1.1 Feature­Scoring Machine 20
4.1.2 Penalization Term 23
4.2 Experimental Setup and Results 26
4.2.1 Experimental Setup 26
4.2.2 Sensitive Attribute Protection 28
4.2.3 Sensitive Attribute Alignment 30
4.2.4 Ablation Study 32
5 Conclusion 37
References 39
[1] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
[2] D. Yu and L. Deng, AUTOMATIC SPEECH RECOGNITION. Springer, 2016.
[3] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance­level aggregation for speaker recognition in the wild,” in ICASSP 2019­2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795, IEEE, 2019.
[4] D. Braga, A. M. Madureira, L. Coelho, and R. Ajith, “Automatic detection of parkinson' s disease based on acoustic analysis of speech,” Engineering Applications of Artificial Intelligence, vol. 77, pp. 148–158, 2019.
[5] L. Tóth, I. Hoffmann, G. Gosztolya, V. Vincze, G. Szatlóczki, Z. Bánréti, M. Pákáski, and J. Kálmán, “A speech recognition­based solution for the automatic detection of mild cognitive impairment from spontaneous speech,” Current Alzheimer Research, vol. 15, no. 2, pp. 130–138, 2018.
[6] B. Schuller and A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. 10 2013.
[7] J. L. Kröger, O. H.­M. Lutz, and P. Raschke, Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference, pp. 242–258. Cham: Springer International Publishing, 2020.
[8] R. Tatman, “Gender and dialect bias in youtube's automatic captions,” in Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 53–59, 2017.
[9] M. Sap, D. Card, S. Gabriel, Y. Choi, and N. A. Smith, “The risk of racial bias in hate speech detection,” in Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1668–1678, 2019.
[10] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent, Privacy­Preserving Adversarial Representation Learning in ASR: Reality or Illusion?,” in Proc. Interspeech 2019, pp. 3700–3704, 2019.
[11] R. Aloufi, H. Haddadi, and D. Boyle, “Emotionless: privacy­preserving speech analysis for voice assistants,” arXiv preprint arXiv:1908.03632, 2019.
[12] M. Jaiswal and E. M. Provost, “Privacy enhanced multimodal neural representations for emotion recognition,” in AAAI, 2020. 39
[13] M. Xia, A. Field, and Y. Tsvetkov, “Demoting racial bias in hate speech detection,” in Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, (Online), pp. 7–14, Association for Computational Linguistics, July 2020.
[14] W.­N. Hsu, Y. Zhang, and J. Glass, “Learning latent representations for speech generation and transformation,” in Proc. Interspeech 2017, pp. 1273–1277, 2017.
[15] L. Li, D. Wang, Y. Chen, Y. Shi, Z. Tang, and T. F. Zheng, “Deep factorization for speech signal,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5094–5098, 2018.
[16] E. Creager, D. Madras, J.­H. Jacobsen, M. Weis, K. Swersky, T. Pitassi, and R. Zemel, “Flexibly fair representation learning by disentanglement,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 1436–1445, PMLR, 09–15 Jun 2019.
[17] D. P. Kingma and M. Welling, “Auto­Encoding Variational Bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14­16, 2014, Conference Track Proceedings, 2014.
[18] Y.­L. Huang, B.­H. Su, Y.­W. P. Hong, and C.­C. Lee, “An Attribute­Aligned Strategy for Learning Speech Representation,” in Proc. Interspeech 2021, pp. 1179–1183, 2021.
[19] M. Bancroft, R. Lotfian, J. Hansen, and C. Busso, “Exploring the intersection between speaker verification and emotion recognition,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 337– 342, 2019.
[20] R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “x­vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP 2020­-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7169–7173, IEEE, 2020.
[21] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for selfsupervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020.
[22] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, pp. 471–483, October­December 2019.
[23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
[24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State­of­theart natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, (Online), pp. 38–45, Association for Computational Linguistics, Oct. 2020. 40
[25] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” Found. Trends Mach. Learn., vol. 12, no. 4, pp. 307–392, 2019.
[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, p. 1929–1958, Jan. 2014.
[27] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning, pp. 1180–1189, PMLR, 2015.
[28] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention­based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1412–1421, Association for Computational Linguistics, Sept. 2015.
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
[30] L. Tarantino, P. N. Garner, and A. Lazaridis, “Self­attention for speech emotion recognition.,” in Interspeech, pp. 2578–2582, 2019.
[31] N. Gui, D. Ge, and Z. Hu, “Afs: An attention­based mechanism for supervised feature selection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3705–3713, 2019.
[32] B. Skrlj, S. Dzeroski, N. Lavrac, and M. Petkovic, “Feature importance estimation with self­attention networks,” in ECAI 2020 ­ 24th European Conference on Artificial Intelligence, 29 August­8 September 2020, Santiago de Compostela, Spain, August 29 ­September 8, 2020 ­ Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), vol. 325 of Frontiers in Artificial Intelligence and Applications, pp. 1491–1498, IOS Press, 2020.
[33] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self­attentive sentence embedding,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24­26, 2017, Conference Track Proceedings, OpenReview.net, 2017.
[34] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large­margin softmax loss for convolutional neural networks.,” in ICML, vol. 2, p. 7, 2016.
[35] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220, 2017.
[36] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top

相關論文

1. 透過語音特徵建構基於堆疊稀疏自編碼器演算法之婚姻治療中夫妻互動行為量表自動化評分系統
2. 基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具
3. 一個利用人類Thin-Slice情緒感知特性所建構而成之全時情緒辨識模型新框架
4. 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
5. 基於多模態主動式學習法進行樣本與標記之間的關係分析於候用校長評鑑之自動化評分系統建置
6. 透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統
7. 結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統
8. 針對實體化交談介面開發基於行為衡量方法於自閉症小孩之評估系統
9. 一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究
10. 整合文本多層次表達與嵌入演講屬性之表徵學習於強健候用校長演講自動化評分系統
11. 利用聯合因素分析研究大腦磁振神經影像之時間效應以改善情緒辨識系統
12. 利用LSTM演算法基於自閉症診斷觀察量表訪談建置辨識自閉症小孩之評估系統
13. 利用多模態模型混合CNN和LSTM影音特徵以自動化偵測急診病患疼痛程度
14. 以雙向長短期記憶網路架構混和多時間粒度文字模態改善婚 姻治療自動化行為評分系統
15. 透過表演逐字稿之互動特徵以改善中文戲劇表演資料庫情緒辨識系統
 
* *