透過屬性對齊策略進行語音表徵學習__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.146) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	黃昱霖
作者(外文):	Huang, Yu-Lin
論文名稱(中文):	透過屬性對齊策略進行語音表徵學習
論文名稱(外文):	Attribute-Aligned Speech Representation Learning Strategy
指導教授(中文):	李祈均
指導教授(外文):	Lee, Chi-Chun
口試委員(中文):	洪樂文曹昱簡仁宗
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	108061539
出版年(民國):	110
畢業學年度:	110
語文別:	英文
論文頁數:	41
中文關鍵詞:	語音表徵、隱私、公平性、屬性對齊
外文關鍵詞:	speech representation、privacy、fairness、attribute alignment
相關次數:	推薦:0 點閱:341 評分: 下載:0 收藏:0

近年來，隨著語音科技的蓬勃發展，為我們的生活帶來許多便利的同時，也衍生了很多問題。由於語音訊號含有大量的隱私資訊，在應用時如果沒有多加留意，會在無形中造成隱私洩漏和不公平的預測等問題。在這篇論文中，我們提出一種屬性對齊的學習策略，學習一組將維度拆解並排列的語音表徵，在不同情境下，能透過簡單的屬性去除流程，彈性的處理上述的問題。我們提出了兩種架構：第一個是具有漸層表徵的變分自動編碼器（LR-VAE），我們透過拆解和預先定義的排列順序，促使不同的屬性根據對相關任務的關聯性進行排列；第二個則是特徵評分的變分自動編碼器（FS-VAE），透過拆解和評分機制的設計，學習不同維度對特定任務的重要性。透過屬性對齊，我們知道各個維度所含有的資訊，要針對特定屬性進行保護時，只需要清除相對應的維度即可達成。在這篇論文裡，我們選擇一個情緒資料庫：MSP-Podcast，並將屬性對齊學習策略應用在兩個情境上以進行驗證：「保護身份資訊的情緒辨識系統」和「保護情緒資訊的身份驗證系統」。相較於目前主流的演算法「對抗式學習」，我們提出的演算法在去除身份的情緒辨識上可以獲得據競爭力的效果，並更進一步優化去除身份資訊的身份驗證效果。除了優化的隱私保護效果，我們的屬性對齊學習策略提供了有彈性的隱私保護應用，可以根據特定的情境針對不同的隱私資訊進行去除和保護，並且透過單一的訓練模型和訓練過程，我們提出的方法也降低運算資源的浪費。

Advancement in speech technology has brought convenience to our life. However, several concerns is on the rise as speech signal contains multiple personal attributes, which would lead to privacy leakage or unfair prediction. In this thesis, we propose the concept of attribute-aligned learning strategy to obtain a speech representation that can flexibly address these issues by attribute-selection mechanism. We first propose a layered-representation variational autoencoder (LR-VAE), which factorize speech representation into sensitive attributes and guide the attributes to align in desired order. On the other hand, we propose an attention-base feature-scoring variational autoencoder (FS-VAE), which disentangled the speech representation into mutually independent sensitive attributes with additional scoring machine to capture the importance of each dimension. With properly aligned attributes, we derive non-sensitive representation by simple masking. In this thesis, we evaluate our proposed method on two scenarios, identity-free speech emotion recognition (SER) and emotionless speaker verification (SV), on a large emotion corpus, the MSP-Podcast. We show competitive performances on identity-free SER and improving results on emotionless SV, comparing to the current state-of-the-art method applying adversarial representation learning. Moreover, our proposed learning strategy provides flexibility for multiple privacy-preserving tasks by simple attribute selection, which also reduces computing resources with single model and training process required.

1 Introduction 1
2 Database and Features 5
2.1 Database 5
2.2 Features 6
3 Task1: Human Defined Weighting for Attribute Alignment 7
3.1 Layered Representation Variational Autoencoder 7
3.1.1 Variational Autoencoder 8
3.1.2 Layered Dropout with Adversarial Multitask Learning 9
3.2 Experimental Setup and Results 11
3.2.1 Experimental Setup 11
3.2.2 Sensitive Attribute Protection 13
3.2.3 Analysis of Aligned Attributes 15
3.2.4 Attribute Characteristics Study 16
4 Task2: AttentionBased FeatureScoring for Attribute Sorting 19
4.1 Feature Scoring Variational Autoencoder 19
4.1.1 FeatureScoring Machine 20
4.1.2 Penalization Term 23
4.2 Experimental Setup and Results 26
4.2.1 Experimental Setup 26
4.2.2 Sensitive Attribute Protection 28
4.2.3 Sensitive Attribute Alignment 30
4.2.4 Ablation Study 32
5 Conclusion 37
References 39

[1] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
[2] D. Yu and L. Deng, AUTOMATIC SPEECH RECOGNITION. Springer, 2016.
[3] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterancelevel aggregation for speaker recognition in the wild,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795, IEEE, 2019.
[4] D. Braga, A. M. Madureira, L. Coelho, and R. Ajith, “Automatic detection of parkinson＇ s disease based on acoustic analysis of speech,” Engineering Applications of Artificial Intelligence, vol. 77, pp. 148–158, 2019.
[5] L. Tóth, I. Hoffmann, G. Gosztolya, V. Vincze, G. Szatlóczki, Z. Bánréti, M. Pákáski, and J. Kálmán, “A speech recognitionbased solution for the automatic detection of mild cognitive impairment from spontaneous speech,” Current Alzheimer Research, vol. 15, no. 2, pp. 130–138, 2018.
[6] B. Schuller and A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. 10 2013.
[7] J. L. Kröger, O. H.M. Lutz, and P. Raschke, Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference, pp. 242–258. Cham: Springer International Publishing, 2020.
[8] R. Tatman, “Gender and dialect bias in youtube＇s automatic captions,” in Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 53–59, 2017.
[9] M. Sap, D. Card, S. Gabriel, Y. Choi, and N. A. Smith, “The risk of racial bias in hate speech detection,” in Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1668–1678, 2019.
[10] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent, PrivacyPreserving Adversarial Representation Learning in ASR: Reality or Illusion?,” in Proc. Interspeech 2019, pp. 3700–3704, 2019.
[11] R. Aloufi, H. Haddadi, and D. Boyle, “Emotionless: privacypreserving speech analysis for voice assistants,” arXiv preprint arXiv:1908.03632, 2019.
[12] M. Jaiswal and E. M. Provost, “Privacy enhanced multimodal neural representations for emotion recognition,” in AAAI, 2020. 39
[13] M. Xia, A. Field, and Y. Tsvetkov, “Demoting racial bias in hate speech detection,” in Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, (Online), pp. 7–14, Association for Computational Linguistics, July 2020.
[14] W.N. Hsu, Y. Zhang, and J. Glass, “Learning latent representations for speech generation and transformation,” in Proc. Interspeech 2017, pp. 1273–1277, 2017.
[15] L. Li, D. Wang, Y. Chen, Y. Shi, Z. Tang, and T. F. Zheng, “Deep factorization for speech signal,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5094–5098, 2018.
[16] E. Creager, D. Madras, J.H. Jacobsen, M. Weis, K. Swersky, T. Pitassi, and R. Zemel, “Flexibly fair representation learning by disentanglement,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 1436–1445, PMLR, 09–15 Jun 2019.
[17] D. P. Kingma and M. Welling, “AutoEncoding Variational Bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, 2014.
[18] Y.L. Huang, B.H. Su, Y.W. P. Hong, and C.C. Lee, “An AttributeAligned Strategy for Learning Speech Representation,” in Proc. Interspeech 2021, pp. 1179–1183, 2021.
[19] M. Bancroft, R. Lotfian, J. Hansen, and C. Busso, “Exploring the intersection between speaker verification and emotion recognition,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 337– 342, 2019.
[20] R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “xvectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7169–7173, IEEE, 2020.
[21] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for selfsupervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020.
[22] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, pp. 471–483, OctoberDecember 2019.
[23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
[24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: Stateoftheart natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, (Online), pp. 38–45, Association for Computational Linguistics, Oct. 2020. 40
[25] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” Found. Trends Mach. Learn., vol. 12, no. 4, pp. 307–392, 2019.
[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, p. 1929–1958, Jan. 2014.
[27] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning, pp. 1180–1189, PMLR, 2015.
[28] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attentionbased neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1412–1421, Association for Computational Linguistics, Sept. 2015.
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
[30] L. Tarantino, P. N. Garner, and A. Lazaridis, “Selfattention for speech emotion recognition.,” in Interspeech, pp. 2578–2582, 2019.
[31] N. Gui, D. Ge, and Z. Hu, “Afs: An attentionbased mechanism for supervised feature selection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3705–3713, 2019.
[32] B. Skrlj, S. Dzeroski, N. Lavrac, and M. Petkovic, “Feature importance estimation with selfattention networks,” in ECAI 2020 24th European Conference on Artificial Intelligence, 29 August8 September 2020, Santiago de Compostela, Spain, August 29 September 8, 2020 Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), vol. 325 of Frontiers in Artificial Intelligence and Applications, pp. 1491–1498, IOS Press, 2020.
[33] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured selfattentive sentence embedding,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, OpenReview.net, 2017.
[34] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Largemargin softmax loss for convolutional neural networks.,” in ICML, vol. 2, p. 7, 2016.
[35] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220, 2017.
[36] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文