帳號:guest(18.225.195.163)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):邱俊嘉
作者(外文):Chiu, Chun-Chia
論文名稱(中文):增強語音位置隱私保護之權重對抗式去噪器
論文名稱(外文):A Weighted Adversarial Denoiser for Improving Speech Location Privacy Protection
指導教授(中文):李祈均
指導教授(外文):Lee, Chi-Chun
口試委員(中文):賴穎暉
冀泰石
口試委員(外文):Lai, Ying-Hui
Chi, Tai-Shih
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:109061517
出版年(民國):111
畢業學年度:111
語文別:中文
論文頁數:47
中文關鍵詞:語音位置隱私對抗式學習去噪模型權重對抗式去噪器語音情緒辨識語者驗證
外文關鍵詞:speech location privacyadversarial learningdenoise modelweighted adversarial denoiser(WAD)speech emotion recognition(SER)speaker verification(SV)
相關次數:
  • 推薦推薦:0
  • 點閱點閱:437
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
語音中含有許多實用的資訊,這些資訊可拿來做各種應用,像是通話、辨識情緒、辨識出文字、辨識語者和辨識疾病,但語音中也含有許多個人隱私資訊,像是語者身分和位置,當我們將語音提供給他人做使用時,就相當於把隱私資訊洩漏給對方知道,此時就會有語音隱私洩漏的問題,語音隱私一旦洩漏,可能會對個人或是整個社會帶來負面的影響,因此,如何在有效運用語音的同時保護個人隱私成為了現今十分重要的課題。語音隱私包含許多不同的類別,其中關於背景環境音所洩漏的位置隱私卻鮮少有文獻在討論或關注,此外,通常語者在說話時都會處於某個帶有環境音的場所,此時想要辨識語者的所在位置是非常有可能的,因此語音位置隱私洩漏的問題是很容易發生的。在過去,若要想要保護語音位置隱私,可以用對抗式學習方法將語音中的位置資訊去除,或是用去噪模型直接將背景環境音去除,但對抗式學習方法因為產生的是表徵,不是音檔,所以辨識任務表現和應用性較差,而去噪模型則是沒有針對位置隱私做優化。為了更好的保護語音位置隱私,本篇論文提出了權重對抗式去噪器的架構,該架構同時解決了對抗式學習方法和去噪模型的問題,不但可以產生音檔,又可以針對位置隱私做優化。具體來說,該架構使用去噪自動編碼器產生音檔,同時將位置標籤納入考慮並加入對抗式變換器,最終使得去噪自動編碼器產生更無法辨識位置資訊的音檔,此外,為了讓效果更好,還加入了權重機制來調控。語音位置隱私的問題會發生在各式各樣的辨識任務上,比較常見的是語音情緒辨識和語者驗證,因此,本篇論文將會用這兩個辨識任務來驗證該架構的表現。
Speech contains a lot of useful information which can be used for a variety of applications, such as communication, speech emotion recognition(SER), automatic speech recognition(ASR), speaker verification(SV), and recognizing diseases, but speech also includes many personal privacy, such as the speaker's identity and location. When we provide our speech to others, our privacy will be disclosed. In other words, there will be the problem of speech privacy leakage. Once speech privacy is revealed, it may have a negative impact on individuals or the society. Therefore, how to effectively utilize speech while protecting personal privacy has become a very significant issue nowadays. Speech privacy encompasses many different categories, of which there is very little literature focus on location privacy leaked by ambient noise. In addition, the speaker is usually in a place with ambient noise, it is very possible to identify the speaker's location. Consequently, the problem of speech location privacy leakage is easy to occur. In the past, if we want to protect the speech location privacy, we can employ the adversarial learning method to remove the location information in the speech representation, or use the denoise model to directly eliminate the ambient noise. Howerver, the adversarial learning method outputs representations, not audio files, so the applicability and the performance of the recognition task are poor, and the denoise model is not optimized for speech location privacy. In order to better protect the speech location privacy, we propose a weighted adversarial denoiser(WAD). The framework solves the problems of adversarial learning method and denoise model simultaneously, which can not only generate audio files, but also be optimized for speech location privacy. Specifically, we use a denoising autoencoder to generate audio files, and add an adversarial transformer, which makes the denoising autoencoder produce audio files that are less location-recognizable. Besides, we apply a weight mechanism to make better effect. The problem of speech location privacy occurs in a variety of recognition tasks, the more common ones are speech emotion recognition(SER) and speaker verification(SV). Therefore, we will use these two recognition tasks to verify the performance of the proposed framework.
摘要 i
ABSTRACT ii
誌謝 iv
目錄 v
表目錄 viii
圖目錄 ix
第一章 Introduction 1
1.1 Speech Location Privacy 1
1.2 Previous Works 3
1.2.1 Adversarial Learning 3
1.2.2 Denoise Model 4
1.3 In This Work 5
第二章 Methodology 7
2.1 Data Preprocessing 7
2.1.1 BandMask 8
2.1.2 TimeShift 8
2.2 Denoising Autoencoder 8
2.2.1 DEMUCS 9
2.2.2 L1 Loss and STFT Loss 11
2.3 Adversarial Transformer 12
2.3.1 GRL 12
2.3.2 MHAN 12
2.3.3" Weight" 13
2.3.4 Location Loss 14
第三章 Task Definition 15
3.1 Noisy Speech Dataset 16
3.1.1 Clean Speech Dataset + Noise Dataset 17
3.1.1.1 IEMOCAP + TUT2018 17
3.1.1.2 MSP-PODCAST + TUT2018 18
3.1.2 Target SNR for Noisy Speech Dataset 19
3.1.3 Brief Summary for Noisy Speech Dataset 20
3.2 Denoise Model 21
3.3 Location Model 22
3.4 Emotion Model 22
3.5 SV Model 23
第四章 Experiments 27
4.1 Experimental Setup 27
4.1.1 Noisy Speech Dataset 27
4.1.2 Model 28
4.2 Exp 28
4.2.1 Exp 1: Comparison of Different Denoise Models 28
4.2.2 Exp 2: Analysis of the Effect of Adversarial Transformer and Weight 32
4.2.3 Exp 3: Comparison of Different αmax 33
4.2.4 Exp 4: Comparison of Different Tasks for MHAN 34
4.2.5 Exp 5: Comparison of Different Architectures for MHAN 34
4.2.6 Exp 6: Analysis of the Effect of Pretrained Model 34
4.3 Additional Information 35
4.3.1 The Mode of the Wrong Prediction 35
4.3.2 Small Noise 38
4.3.3 Unseen Location Class 39
第五章 Analysis 40
第六章 Conclusion and Future Work 44
參考文獻 45
[1] Y.-L. Huang, B.-H. Su, Y.-W. P. Hong, and C.-C. Lee, "An Attention-Based Method for Guiding Attribute-Aligned Speech Representation Learning," Interspeech, 2022.
[2] J. L. Kröger, O. H.-M. Lutz, and P. Raschke, "Privacy implications of voice and speech analysis–information disclosure by inference," in IFIP International Summer School on Privacy and Identity Management, 2019: Springer, pp. 242-258.
[3] M. Jaiswal and E. M. Provost, "Privacy enhanced multimodal neural representations for emotion recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 05, pp. 7985-7993.
[4] M. Xia, A. Field, and Y. Tsvetkov, "Demoting Racial Bias in Hate Speech Detection," in Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, 2020, pp. 7-14.
[5] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent, "Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion?," in INTERSPEECH 2019-20th Annual Conference of the International Speech Communication Association, 2019.
[6] Y.-L. Huang, B.-H. Su, Y.-W. P. Hong, and C.-C. Lee, "An Attribute-Aligned Strategy for Learning Speech Representation," Interspeech, pp. 1179–1183, 2021.
[7] J. Kim, M. El-Khamy, and J. Lee, "Transformer with gaussian weighted self-attention for speech enhancement," presented at the ICASSP, 2021.
[8] A. Défossez, G. Synnaeve, and Y. Adi, "Real Time Speech Enhancement in the Waveform Domain," Proc. Interspeech 2020, pp. 3291-3295, 2020.
[9] T.-A. Hsieh, C. Yu, S.-W. Fu, X. Lu, and Y. Tsao, "Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement," Interspeech, 2021.
[10] S.-W. Fu et al., "MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement," Interspeech, 2021.
[11] D. S. Park et al., "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition," Proc. Interspeech 2019, pp. 2613-2617, 2019.
[12] Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation," in International conference on machine learning, 2015: PMLR, pp. 1180-1189.
[13] W. Wang, W. Wang, M. Sun, and C. Wang, "Acoustic Scene Analysis with Multi-Head Attention Networks," Proc. Interspeech 2020, pp. 1191-1195, 2020.
[14] C. Busso et al., "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, pp. 335-359, 2008.
[15] A. Mesaros, T. Heittola, and T. Virtanen, "A MULTI-DEVICE DATASET FOR URBAN ACOUSTIC SCENE CLASSIFICATION," in Scenes and Events 2018 Workshop (DCASE2018), p. 9.
[16] R. Lotfian and C. Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, 2017.
[17] C. K. Reddy et al., "The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework."
[18] C. Valentini, "Noisy speech database for training speech enhancement algorithms and TTS Models," University of Edinburgh. School of Informatics. Centre for Speech Research, 2016.
[19] H. Hu et al., "A two-stage approach to device-robust acoustic scene classification," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 845-849.
[20] A. Baevski, S. Schneider, and M. Auli, "vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations," in International Conference on Learning Representations, 2019.
[21] B.-H. Su, C.-M. Chang, Y.-S. Lin, and C.-C. Lee, "Improving Speech Emotion Recognition Using Graph Attentive Bi-Directional Gated Recurrent Unit Network," in INTERSPEECH, 2020, pp. 506-510.
[22] B. Desplanques, J. Thienpondt, and K. Demuynck, "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification," in Interspeech2020, 2020: International Speech Communication Association (ISCA), pp. 3830-3834.
[23] H. Zhang et al., "PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit," Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, 2022.
[24] A. Nagrani, J. S. Chung, and A. Zisserman, "VoxCeleb: A Large-Scale Speaker Identification Dataset," Proc. Interspeech 2017, pp. 2616-2620, 2017.
[25] J. S. Chung, A. Nagrani, and A. Zisserman, "VoxCeleb2: Deep Speaker Recognition," Proc. Interspeech 2018, pp. 1086-1090, 2018.
[26] C. Veaux, J. Yamagishi, and K. MacDonald, "CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit," University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017.
[27] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: an asr corpus based on public domain audio books," in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015: IEEE, pp. 5206-5210.
[28] A. Mesaros et al., "DCASE 2017 challenge setup: Tasks, datasets and baseline system," in DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events, 2017.
[29] G. Dekkers et al., "The SINS database for detection of daily activities in a home environment using an acoustic sensor network," Detection and Classification of Acoustic Scenes and Events 2017, pp. 1-5, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top

相關論文

1. 基於循環式對抗性網路的數據增量及梯度察覺純化防禦機制建立強健式語音情緒辨識模型
2. 透過語音特徵建構基於堆疊稀疏自編碼器演算法之婚姻治療中夫妻互動行為量表自動化評分系統
3. 基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具
4. 一個利用人類Thin-Slice情緒感知特性所建構而成之全時情緒辨識模型新框架
5. 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
6. 基於多模態主動式學習法進行樣本與標記之間的關係分析於候用校長評鑑之自動化評分系統建置
7. 透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統
8. 結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統
9. 針對實體化交談介面開發基於行為衡量方法於自閉症小孩之評估系統
10. 一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究
11. 整合文本多層次表達與嵌入演講屬性之表徵學習於強健候用校長演講自動化評分系統
12. 利用聯合因素分析研究大腦磁振神經影像之時間效應以改善情緒辨識系統
13. 利用LSTM演算法基於自閉症診斷觀察量表訪談建置辨識自閉症小孩之評估系統
14. 利用多模態模型混合CNN和LSTM影音特徵以自動化偵測急診病患疼痛程度
15. 以雙向長短期記憶網路架構混和多時間粒度文字模態改善婚 姻治療自動化行為評分系統
 
* *