結合半監督式學習和領域適應以在多樣化環境中進行聲音事件檢測_

帳號：guest(216.73.216.18) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳舫慶
作者(外文):	Chen, Fang-Ching
論文名稱(中文):	結合半監督式學習和領域適應以在多樣化環境中進行聲音事件檢測
論文名稱(外文):	Combining Semi-Supervised Learning and Domain Adaptation for Sound Event Detection in Diverse Environments
指導教授(中文):	劉奕汶
指導教授(外文):	Liu, Yi-Wen
口試委員(中文):	白明憲賴穎輝
口試委員(外文):	Bai, Ming-Sian Lai, Ying-Hui
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	110061547
出版年(民國):	112
畢業學年度:	112
語文別:	英文
論文頁數:	48
中文關鍵詞:	聲音事件檢測、半監督式學習、領域自適應
外文關鍵詞:	Sound event detection、Semi-supervised learning、Domain adaptation
相關次數:	推薦:0 點閱:279 評分: 下載:0 收藏:0

聲音事件檢測（Sound Event Detection）是一項具有挑戰性的任務，需要識別出音檔中發生的事件類別，同時精確確定它們的起始和結束時間。在先前的研究中，探討了各種半監督學習策略 (Semi-supervised learning, SSL)，例如平移一致性訓練（Shift Consistency Training, SCT）、內插一致性訓練（Interpolation Consistency Training, ICT）和偽標籤化（Pseudo-labeling, PL)，以提高平均教師模型(Mean-Teacher, MT)的性能。然而，我們觀察到當試圖將對抗性領域適應（Adversarial Domain Adaptation, ADA）與SSL相結合時，ADA無法提高事件檢測準確性。在這篇論文中討論了先前方法的侷限性並提出了新的貢獻。首先我們提出了改進的SSL策略。此外，我們從實證觀察中發現，使用ICT往往會擴大合成數據和真實數據在t-SNE圖中的分佈差異，造成整體性能的降低。因此我們引入了幾種修改來克服這個問題。最終系統成功地與ADA整合，在DCASE 2020 task 4數據集上取得了更進步的F1 score，達到47.2％，比先前研究的成果中高出了2.1％。此外我們將使用於室內聲音事件檢測（Domestic Sound Event Detection, DSED）的模型擴展到室外環境中的鳥類聲音事件檢測（Bird Sound Event Detecction, BSED)。為了實現這個目標，建立了包含20種鳥類的強標籤合成聲音數據集。隨後，我們使用在DSED領域已被證明效果的一致性訓練策略和領域適應技術，對該數據集進行了一系列實驗。在東北美鳥類聲音數據集上進行的評估結果顯示F1 score超過30％，這個數據接近於使用強標籤數據時可能獲得的上界性能。實驗結果證實了所提出的訓練和模型，在室內聲響偵測跟戶外聲響偵測都具有潛力。

Sound event detection (SED) is a challenging task that involves identifying events occurring within an audio recording and precisely determining their starting and ending times. In a prior work, various semi-supervised learning strategies, such as pseudo-labeling (PL), shift consistency training (SCT) and interpolation consistency training (ICT), were explored to enhance the performance of a mean-teacher model. However, we observed that adversarial domain adaptation (ADA) was unable to enhance event detection accuracy while attempting to integrate it with SSL. The thesis discusses the limitations of previous approaches and presents novel contributions. First, an updated version of SSL strategies was proposed. Additionally, we empirically observed that the use of ICT tended to widen the gap between the distributions of synthetic and real data in t-SNE plots, which lowered the overall performance. Therefore, several modifications were introduced to overcome this issue. Finally, the system was effectively combined with an ADA network, achieving an F1 score of 47.2\% on the DCASE 2020 task 4 dataset, surpassing previous work by 2.1%. Furthermore, we expanded the application from domestic sound event detection (DSED) models to bird sound event detection (BSED) in the natural environment. To achieve the goal, a synthetic strongly-labeled bird sound dataset consisting of 20 species was established. Subsequently, we conducted experiments on the dataset with the consistency training strategies and domain adaptation which was proven effective in the field of DESD. Our evaluation on the Eastern North American bird sound dataset resulted in an F1 score of >30%, which approaches the upper-bound performance that could be obtained when strongly-labeled data are assumed to be available. Thus, the strategies proposed in this thesis exhibit potential for advancing research in both indoor and outdoor SED.

1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work 3
2.1 Approaches in Sound Event Detection . . . . . . . . . . . . . . . . 3
2.2 Tasks related to bio-acoustic vocalization . . . . . . . . . . . . . . 4
3 Methods 6
3.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 The Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Convolution Recurrent Neural Network (CRNN) . . . . . . 11
3.3.2 Feature-Pyramid Convolution Recurrent Neural Network (FP-
CRNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Semi-supervised Learning (SSL) . . . . . . . . . . . . . . . . . . . 12
3.4.1 the Mean-Teacher Model (MT) approach . . . . . . . . . . 13
3.4.2 Pseudo-Labeling (PL) . . . . . . . . . . . . . . . . . . . . 15
3.4.3 Shift Consistency Mean-Teacher Training (SCMT) . . . . . 16
3.4.4 Interpolation Consistency Training (ICT) . . . . . . . . . . 17
3.5 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.1 the Adversarial Domain Adaptation (ADA) approach . . . . 18
3.6 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Data Description 22
4.1 the DCASE 2020 Task 4 Dataset . . . . . . . . . . . . . . . . . . . 22
4.2 the Eastern North American Bird Sound Dataset . . . . . . . . . . . 22
4.3 Bird Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Experiments and Results 27
5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.1 Additional data pre-processing steps for the ENA dataset . . 29
5.3 Evaluation on Domestic Sound Event Detection . . . . . . . . . . . 30
5.4 Evaluation on Bird Sound Event Detection . . . . . . . . . . . . . . 32
6 Discussions 35
6.1 Domain adaptation and its compatibility to semi-supervised learning
strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 t-SNE Visualization . . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Analysis of the incompatibility between the semi-supervised
learning strategies and domain adaptation . . . . . . . . . . 37
6.2 Analysis of applying DSED model to BSED . . . . . . . . . . . . 38
6.2.1 Direct Features Visualization . . . . . . . . . . . . . . . . . 39
6.2.2 Effectiveness of ADA in bio-acoustic environment . . . . . 41
7 Conclusions 43
8 Future works 44
References 45
Appendix 48
Appendix A 48
A.1 Suggestions From the Oral Defense Committees . . . . . . . . . . . 48
A.1.1 白明憲教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.1.2 賴穎暉教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.1.3 劉奕汶教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 48

[1] C.-Y. Koh, “Weakly supervised sound event detection model training strategies,” Master’s thesis, National Tsing Hua University, 2021. https://hdl.handle.net/11296/d749hu.
[2] A. Mesaros, T. Heittola, T. Virtanen, and M. D. Plumbley, “Sound event detection: A tutorial,” IEEE Signal Processing Magazine, vol. 38, no. 5, pp. 67–83, 2021.
[3] T. K. Chan and C. S. Chin, “A comprehensive review of polyphonic sound event detection,” IEEE Access, vol. 8, pp. 103339–103373, 2020.
[4] N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Workshop on Detection and Classification of Acoustic Scenes and Events, 2019.
[5] N. K. Kim and H. K. Kim, “Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function,” IEEE Access, vol. 9, pp. 7564–7575, 2021.
[6] H. Endo and H. Nishizaki, “Peer collaborative learning for polyphonic sound event detection,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 826–830, 2022.
[7] L. Yang, J. Hao, Z. Hou, and W. Peng, “Two-stage domain adaptation for sound event detection.,” in DCASE, pp. 230234, 2020.
[8] X. Zheng, Y. Song, L.-R. Dai, I. McLoughlin, and L. Liu, “An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection,” in Proc. Interspeech 2021, pp. 556–560, 2021.
[9] W. Wei, H. Zhu, E. Benetos, and Y. Wang, “A-crnn: A domain adaptation model for sound event detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 276280, 2020.
[10] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, “Bird detection in audio: A survey and a challenge,” in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, 2016.
[11] D. Stowell and M. D. Plumbley, “Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning,” PeerJ, vol. 2, p. e488, 2014.
[12] C.-Y. Koh, J.-Y. Chang, C.-L. Tai, D.-Y. Huang, H.-H. Hsieh, and Y.-W. Liu, “Bird sound classification using convolutional neural networks,” in CLEF (Working Notes), 2019.
[13] E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, and T. Virtanen, “Convolutional recurrent neural networks for bird audio detection,” in 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1744–1748, IEEE, 2017.
[14] D. Stowell and D. Clayton, “Acoustic event detection for multiple overlapping
similar sources,” in 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5, IEEE, 2015.
[15] Y. R. Pandeya, B. Bhattarai, and J. Lee, “Visual object detector for cow sound event detection,” IEEE Access, vol. 8, pp. 162625–162633, 2020.
[16] M. Crous, Polyphonic Bird Sound Event Detection With Convolutional Recurrent Neural Networks. PhD thesis, 07 2019.
[17] A. G. A. Parrilla and D. Stowell, “Polyphonic sound event detection for highly dense birdsong scenes,” 2022.
[18] L. M. Chronister, T. A. Rhinehart, A. Place, and J. Kitzes, An Annotated Set of Audio Recordings of Eastern North American Birds Containing Frequency, Time, and Species Information. Wiley Online Library, 2021.
[19] C.-Y. Koh, Y.-S. Chen, Y.-W. Liu, and M. R. Bai, “Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 376–380, 2021.
[20] F.-C. Chen, K.-D. Chen, and Y.-W. Liu, “Domestic sound event detection by shift consistency mean-teacher training and adversarial domain adaptation,” in Proc. Int. Congress on Acoustics, Gyeongju, Korea, 2022.
[21] M. Huzaifah, “Comparison of time-frequency representations for environmental sound classification using convolutional neural networks,” arXiv preprint arXiv:1706.07156, 2017.
[22] H. Fayek, “Speech processing for machine learning: Filter banks, mel-frequency cepstral coefficients (mfccs) and what’s in-between,” 2016. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html.
[23] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
[24] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International Conference on Machine Learning, pp. 933–941, PMLR, 2017.
[25] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[26] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125, 2017.
[27] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176, 2017.
[28] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, 2015.
[29] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J. P. Bello, “Birdvox-full-night: A dataset and benchmark for avian flight call detection,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 266–270, 2018.
[30] S. Kahl, T. Wilhelm-Stein, H. Klinck, D. Kowerko, and M. Eibl, “Recognizing birds from sound-the 2018 birdclef baseline system,” arXiv preprint arXiv:1804.07177, 2018.
[31] F. J. Bravo Sanchez, M. R. Hossain, N. B. English, and S. T. Moore, “Bioacoustic classification of avian calls from raw sound waveforms with an open-source deep learning architecture,” Scientific Reports, vol. 11, no. 1, pp. 1–12, 2021.
[32] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 344–348, 2017.
[33] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
[35] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987.
[36] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.,” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
[37] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[38] C. K. Catchpole and P. J. B. Slater, Bird song: biological themes and variations (2nd ed.). Cambridge University Press, 2008.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文