帳號:guest(3.133.124.123)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):高智遠
作者(外文):Koh, Chih-Yuan
論文名稱(中文):弱監督聲音事件偵測模型訓練策略
論文名稱(外文):Weakly Supervised Sound Event Detection Model Training Strategies
指導教授(中文):劉奕汶
指導教授(外文):Liu, Yi-Wen
口試委員(中文):陳宜欣
白明憲
蘇文鈺
口試委員(外文):Chen, Yi-Shin
Bai, Ming-Sian
Su, Wen-Yu
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061515
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:53
中文關鍵詞:聲音事件辨識弱監督式學習半監督式學習領域自適應
外文關鍵詞:Sound event detectionWeakly supervised learningSemi-supervised learningDomain adaptation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:497
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在一段音訊對出現的聲音事件做分類並估測出現的時間點,這樣的任務被稱為
聲音事件辨識。近年來人工智慧以及大數據的崛起,數據驅動導向的方法已被廣
泛使用來解這個問題,透過大量的音訊以及對應聲音事件種類和時間點的標註,
我們可以訓練出高準確度的偵測系統。然而由於產生完整標籤的高成本,使用弱
標籤資料和無標籤資料來訓練偵測系已漸漸成為主要的研究趨勢。為了有效利用
大量的無標籤資料,我們提出三種半監督式學習策略。第一種是內差一致性訓練
(Interpolation Consistency Training, ICT),幫助模型可以從一些介於兩類之間模糊的資料中學習;第二種是平移一致性訓練(Shift Consistency Training, SCT),提升模型辨識出現在不同時間點的聲音事件之穩健性。第三種是弱偽標籤化(Weakly Pseudo-labeling),透過另外訓練好的聲頻加標系統來產生無標籤資料的弱標籤。研究已證實加入合成的強標籤資料做訓練可以進一步在提升模型的表現,這也激發我們去深入了解領域自適應的方法,希望能對齊真實資料與合成資料的分布,此研究提出了兩階段對抗式領域自適應策略,並且應用在clip-level和frame-level
兩種特徵上。另外,我們也提出一個從不同尺度來學習時序性資訊的創新模型架構:
特徵金字塔式卷積循環神經網絡(Feature-pyramid Convolution Recurrent Neural Network, FP-CRNN),作為測試訓練策略的模型之一。此研究的實驗皆在DCASE 2020 task 4所提供的資料集上進行,實驗結果證實了所提出的訓練測練皆有助於提升模型的效能。最後,在針對一些訓練策略的分析中,也驗證了它們的效果。
Sound event detection (SED) is a task of classifying and localizing the occurrences of sound events in an audio clip. Data-driven
approaches are widely used as a solution, that is, a SED system is trained with a large amount of audio recordings with annotations of existed sound event classes and temporal boundaries. Due to the
high cost of large-scale strong labeling, SED using weakly-labeled
and unlabeled data has drawn increasing attention in recent years. To exploit large amount of unlabeled in-domain data efficiently, we applied three semi-supervised learning strategies. First, interpolation consistency training (ICT) allows a model to distinguish ambiguous samples. Second, shift consistency training (SCT) increases a model’s robustness against sound events’ positions in an audio clip. Third, weakly pseudo-labeling provides reliable weak pseudo-label for unlabeled data from an additional well-trained
audio tagging (AT) system. Previous research has verified that adding synthetic strongly-labeled data further improves the performance. It inspires us to investigate domain adaptation which aims to align the distribution of data in real and synthetic domains. Two-stage adversarial-based domain adaptation methods on
clip-level and frame-level features are proposed. In addition, a novel network architecture, feature-pyramid convolution recurrent neural network (FP-CRNN) which leverages temporal information by utilizing features at different scales, is proposed as one of the backbone model. Experiments are conducted on DCASE 2020 task
4, and the result shows that those training strategies are able to help models obtain better performance. Last, We do several analyses on training strategies and validate their effects.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Works 4
2.1 Multiple Instance Learning (MIL) . . . . . . . . . . . . . . . . . . 4
2.2 Semisupervised
Learning . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Methods 7
3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Convolutional Recurrent Neural Network (CRNN) . . . . . 10
3.2.2 Feature-pyramid Convolution Recurrent Neural Network (FP-CRNN)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Semi-supervised Learning Strategies . . . . . . . . . . . . . . . . . 14
3.3.1 Mean-teacher approach . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Interpolation Consistency Training (ICT) . . . . . . . . . . 16
3.3.3 Shift Consistency Training (SCT) . . . . . . . . . . . . . . 18
3.3.4 Weakly Pseudo-labeling
. . . . . . . . . . . . . . . . . . . 19
3.4 Domain Adaptation Methods . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Adversarial Domain Adaptation . . . . . . . . . . . . . . . 21
3.4.2 Domain Adaptation at Clip- and Frame- level . . . . . . . . 22
3.4.3 Two-stage Domain Adaptation . . . . . . . . . . . . . . . . 24
3.5 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Experiments and Results 27
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Different Network Architectures . . . . . . . . . . . . . . . 30
4.4.2 Semi-supervised Learning Strategies . . . . . . . . . . . . . 31
4.4.3 Domain Adaptation Methods . . . . . . . . . . . . . . . . . 32
4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.1 Event’s Position Analysis for SCT . . . . . . . . . . . . . . 34
4.5.2 Embedded feature Analysis for Domain Adaptation . . . . . 35
5 Conclusions 40
6 Future works 42
References 43
Appendix 46
Appendix A 46
A.1 Discussion on Mutual Compatibility between Domain Adaptation
and Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 46
A.1.1 Embedded Feature Analysis on the Proposed Semi-supervised
Learning Strategies . . . . . . . . . . . . . . . . . . . . . . 46
A.1.2 The Survey for Mutual Compatibility between Consistency
Regularization and Adversarial Domain Adaptation . . . . . 51
A.2 Suggestions From the Oral Defense Committees . . . . . . . . . . . 53
A.2.1 陳宜欣教授. . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.2.2 白明憲教授. . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.2.3 蘇文鈺教授. . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.2.4 劉奕汶教授. . . . . . . . . . . . . . . . . . . . . . . . . . 53
[1] D. Christian, M. Andreas, S. Sergey, N. Maria, F. Nicolaos, and B. Alxeander, “Monitoring activities of daily living in smart homes: Understanding human behavior,” IEEE Signal Processing Magazine, vol. 33, no. 2, pp. 81–94, 2016.
[2] Z. Yaniv, L. Dima, and G. Israel, “A method for automatic fall detection of elderly people using floor vibrations and sound—proof of concept on human mimicking doll falls,” IEEE Transactions on Biomedical Engineering, vol. 56, no. 12, pp. 2858–2867, 2009.
[3] R. Regunathan, D. Ajay, and S. Paris, “Audio analysis for surveillance applications,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 158–161, 2005.
[4] Q. Jin, P. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze, “Event­-based video retrieval using audio,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[5] N. Takahashi, M. Gygli, B. Pfister, and L. V. Gool, “Deep convolutional neural networks and data augmentation for acoustic event recognition,” in International Speech Communication Association (INTERSPEECH), pp. 2982–2986, 2016.
[6] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, and K. Takeda, “Duration-­controlled LSTM for polyphonic sound event detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 11, pp. 2059–2070, 2017.
[7] E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.
[8] A. Kumar and B. Raj, “Audio event detection using weakly labeled data,” in Proceedings of the 24th ACM international conference on Multimedia, pp. 1038–1047, 2016.
[9] Y. Wang, J. Li, and F. Metze, “A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35, 2019.
[10] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, “Large­-scale weakly supervised audio classification using gated convolutional neural network,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125, 2018.
[11] L. Lin, X. Wang, H. Liu, and Y. Qian, “Guided learning for weakly­-labeled semi-­supervised sound event detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630, 2020.
[12] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah, “Large-scale weakly labeled semi-supervised sound event detection in domestic environments,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp. 19–23, November 2018.
[13] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised
deep learning results,” in Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pp. 1195–1204, 2017.
[14] V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz,
“Interpolation consistency training for semi-supervised learning,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), pp. 3635–3641, 7 2019.
[15] W. Wei, H. Zhu, E. Benetos, and Y. Wang, “A-crnn:
A domain adaptation model for sound event detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 276–280, 2020.
[16] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7167–7176, 2017.
[17] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial
training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
[18] H. Park, S. Yun, J. Eum, J. Cho, and K. Hwang, “Weakly labeled sound event detection using tri-training and adversarial learning,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pp. 184–188, 2019.
[19] L. Yang, J. Hao, Z. Hou, and W. Peng, “Two-stage domain adaptation for sound event detection,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), pp. 230–234, 2020.
[20] M. Huzaifah, “Comparison of time-frequency representations for environmental sound classification using convolutional neural networks,” arXiv preprint arXiv:1706.07156, 2017.
[21] H. M. Fayek, “Speech processing for machine learning: Filter banks, mel-frequency cepstral coefficients (mfccs) and what’s inbetween,” 2016.
[22] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[23] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 933–941, 2017.
[24] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125, 2017.
[25] Y. Jie, S. Yan, G. Wu, D. LiRong, M. Ian, and C. Liang, “A region based attention method for weakly supervised sound event detection and classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 755–759, 2019.
[26] Z. Shi, L. Liu, H. Lin, R. Liu, and A. Shi, “Hodgepodge: Sound event detection based on ensemble of semi-supervised learning methods,” arXiv preprint arXiv:1907.07398, 2019.
[27] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations (ICLR), 2018.
[28] R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 86–90, 2020.
[29] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of the International Conference on Machine Learning (ICML), vol. 30, p. 3, 2013.
[30] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in 2015 international joint conference on neural networks (IJCNN), pp. 1–7, 2015.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[32] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016.
[33] Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65, 2020.
[34] L. Van der Maaten and G. Hinton, “Visualizing data using tsne,”
Journal of Machine Learning Research, vol. 9, no. 11, 2008.
[35] Y. Wu, D. Inkpen, and A. ElRoby, “Dual mixup regularized learning for adversarial domain adaptation,” in European Conference on Computer Vision (ECCV), pp. 540–555, Springer, 2020.
[36] W. Lin, M.M. Mak, N. Li, D. Su, and D. Yu, “Multilevel
deep neural network adaptation for speaker verification using mmd and consistency regularization,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6839–6843, IEEE, 2020.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *