作者(外文):Chiu, Chun-Chia
論文名稱(外文):A Weighted Adversarial Denoiser for Improving Speech Location Privacy Protection
指導教授(外文):Lee, Chi-Chun
口試委員(外文):Lai, Ying-Hui
Chi, Tai-Shih
外文關鍵詞:speech location privacyadversarial learningdenoise modelweighted adversarial denoiser(WAD)speech emotion recognition(SER)speaker verification(SV)
Speech contains a lot of useful information which can be used for a variety of applications, such as communication, speech emotion recognition(SER), automatic speech recognition(ASR), speaker verification(SV), and recognizing diseases, but speech also includes many personal privacy, such as the speaker's identity and location. When we provide our speech to others, our privacy will be disclosed. In other words, there will be the problem of speech privacy leakage. Once speech privacy is revealed, it may have a negative impact on individuals or the society. Therefore, how to effectively utilize speech while protecting personal privacy has become a very significant issue nowadays. Speech privacy encompasses many different categories, of which there is very little literature focus on location privacy leaked by ambient noise. In addition, the speaker is usually in a place with ambient noise, it is very possible to identify the speaker's location. Consequently, the problem of speech location privacy leakage is easy to occur. In the past, if we want to protect the speech location privacy, we can employ the adversarial learning method to remove the location information in the speech representation, or use the denoise model to directly eliminate the ambient noise. Howerver, the adversarial learning method outputs representations, not audio files, so the applicability and the performance of the recognition task are poor, and the denoise model is not optimized for speech location privacy. In order to better protect the speech location privacy, we propose a weighted adversarial denoiser(WAD). The framework solves the problems of adversarial learning method and denoise model simultaneously, which can not only generate audio files, but also be optimized for speech location privacy. Specifically, we use a denoising autoencoder to generate audio files, and add an adversarial transformer, which makes the denoising autoencoder produce audio files that are less location-recognizable. Besides, we apply a weight mechanism to make better effect. The problem of speech location privacy occurs in a variety of recognition tasks, the more common ones are speech emotion recognition(SER) and speaker verification(SV). Therefore, we will use these two recognition tasks to verify the performance of the proposed framework.
摘要 i
誌謝 iv
目錄 v
表目錄 viii
圖目錄 ix
第一章 Introduction 1
1.1 Speech Location Privacy 1
1.2 Previous Works 3
1.2.1 Adversarial Learning 3
1.2.2 Denoise Model 4
1.3 In This Work 5
第二章 Methodology 7
2.1 Data Preprocessing 7
2.1.1 BandMask 8
2.1.2 TimeShift 8
2.2 Denoising Autoencoder 8
2.2.1 DEMUCS 9
2.2.2 L1 Loss and STFT Loss 11
2.3 Adversarial Transformer 12
2.3.1 GRL 12
2.3.2 MHAN 12
2.3.3" Weight" 13
2.3.4 Location Loss 14
第三章 Task Definition 15
3.1 Noisy Speech Dataset 16
3.1.1 Clean Speech Dataset + Noise Dataset 17 IEMOCAP + TUT2018 17 MSP-PODCAST + TUT2018 18
3.1.2 Target SNR for Noisy Speech Dataset 19
3.1.3 Brief Summary for Noisy Speech Dataset 20
3.2 Denoise Model 21
3.3 Location Model 22
3.4 Emotion Model 22
3.5 SV Model 23
第四章 Experiments 27
4.1 Experimental Setup 27
4.1.1 Noisy Speech Dataset 27
4.1.2 Model 28
4.2 Exp 28
4.2.1 Exp 1: Comparison of Different Denoise Models 28
4.2.2 Exp 2: Analysis of the Effect of Adversarial Transformer and Weight 32
4.2.3 Exp 3: Comparison of Different αmax 33
4.2.4 Exp 4: Comparison of Different Tasks for MHAN 34
4.2.5 Exp 5: Comparison of Different Architectures for MHAN 34
4.2.6 Exp 6: Analysis of the Effect of Pretrained Model 34
4.3 Additional Information 35
4.3.1 The Mode of the Wrong Prediction 35
4.3.2 Small Noise 38
4.3.3 Unseen Location Class 39
第五章 Analysis 40
第六章 Conclusion and Future Work 44
參考文獻 45
