帳號:guest(18.221.214.175)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳榆安
作者(外文):Chen, Yu-An
論文名稱(中文):結合陣列信號處理與深度學習於語音活性偵測、音質提升與聲源定位之應用
論文名稱(外文):Learning-based Voice Activity Detection, Speech Enhancement, and Source Localization on the basis of Array Signal Processing
指導教授(中文):白明憲
指導教授(外文):Bai, Ming-Sian R.
口試委員(中文):鄭泗東
李昇憲
田孟軒
口試委員(外文):Cheng, Stone
Li, Sheng-Shian
Tien, Meng-Hsuan
學位類別:碩士
校院名稱:國立清華大學
系所名稱:動力機械工程學系
學號:109033535
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:47
中文關鍵詞:語音活性偵測多通道語音強化聲源定位深度學習
外文關鍵詞:Voice activity detectionMulti-channel speech enhancementSource localizationDeep learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:315
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在現今消費性電子的語音演算法,語音活性檢測是使這些演算法成功應用於現實環境,同時節省成本的關鍵技術,因此,語音活性檢測時常作為前端系統判斷是否有語音出現,並以此為根據決定後端系統的啟動與否,此前端系統往往被要求不能耗費過大的運算量,低複雜度的演算法也是需要考量的要素;除此之外,語音活性檢測在實際應用中需要克服的關鍵技術為在極為吵雜的環境下,能保持敏感度與強健性。本研究以陣列訊號處理為本,提出了全新空間特徵,將此空間特徵結合深度學習模型研發出了一套語音活現檢測系統,結果顯示此特徵能使系統有效偵測暫態的聲音於極為吵雜的背景噪音環境之下;此外,此深度學習模型僅使用881個參數,在達成優秀的語音活性檢測效果之餘,確保實際應用所要求的低成本。
現今研究中,多通道的訊號處理演算法,例如語音強化、聲源定位與語音活性檢測時常被整合為同一個系統,互相提供特徵資訊,提升表現。在語音強化的方法中,將複數運算引入深度學習模型的單通道的模型Deep Complex Convolution Recurrent Network (DCCRN),已被證明非常有效。受此啟發,在此研究中我們提出進階的多通道的DCCRN,以此模型去估計複數濾波器,作出神經網絡的波束成型演算法。此複數濾波器不僅應用於語音強化,甚至可用於聲源定位,在我們所提出的系統中,接續的聲源定位系統,以最小化特定方向的無失真響應做成的損失函數,可以有效完成定位,並強化原先的波束成型語音強化的效果,同時給出語音活性偵測的資訊。結果顯示,此研究中所提出的複數神經網絡波束成型演算法可以有效地完成語音強化,降噪並維持聲訊的品質,同時完成聲源定位與語音活性檢測。
To meet the requirement of consumer electronics, there has been a great interest in real-time audio processing algorithms with low complexity, low latency, and low memory footprint. Voice activity detection (VAD) in the front end aims to determine when to initiate the speech-processing system. The key issue of VAD is how to maintain robust detection performance in the presence of background noises. In this thesis, a novel spatial feature is introduced. The spatial information is useful in detecting transient sounds from noisy environments. The proposed system proves more robust than the systems that make no use of such information, and exhibits remarkable detection performance with a light-weight model. It is a growing trend that multiple signal processing algorithms, including speech enhancement, source localization, and VAD, are integrated into an all-in-one system. In this thesis, a neural beamformer consisting of a beamformer and a novel multichannel Deep Complex Convolution Recurrent Network (DCCRN) is proposed for joint enhancement and localization tasks. Complex-valued filters are estimated as the weight of the beamformer. The proposed network combines a multi-channel DCCRN and an auxiliary network to model the sound field, where a distortionless constraint is employed as a loss function. Experimental results have shown that the proposed complex-valued neural beamformer is capable of high enhancement, localization, and VAD performance, owing to the distortionless constraint.
摘 要 i
ABSTRACT ii
致 謝 iii
CONTENTS iv
CHAPTER 1 INTRODUCTION 1
CHAPTER 2 VOICE ACITIVITY DETECTION 7
2.1 Spatial Feature Extraction 8
2.2 VAD Network 11
CHAPTER 3 NEURAL BEAMFORMER 13
3.1 Multiple-Input-Multiple-Output Deep Complex Convolution Recurrent Network (MIMO-DCCRN) 15
3.2 Localization Module 18
3.2.1 The Signal Processing Localization Module (SPLM) 20
3.3.2 The Neural Localization Module (NLM) 21
3.3 Loss Functions 22
CHAPTER 4 RESULTS AND DISCUSSIONS 24
4.1 Voice Activity Detection (VAD) 24
4.1.1 Data Generation 24
4.1.2 Results 26
4.2 Neural Beamformer 32
4.2.1 Data Generation 32
4.2.2 Training Setup and Baselines 34
4.2.3 Speech Enhancement Performance 35
4.2.4 Localization and VAD Performance 40
CHAPTER 5 CONCLUSIONS 43
REFERENCES 44

[1] L. R. Rabiner and M. R. Sambur, “Voiced-unvoiced-silence detection using Itakura LPC distance measure”, Proc. IEEE ICASSP, pp. 323-326, May 1977.
[2] J. D. Hoyt and H. Wechsler, “Detection of human speech in structured noise”, Proc. IEEE ICASSP, pp. 237-240, May 1994.
[3] L. F. Lamel, L. R. Rabiner, A. E. Rosenberg and J. G. Wilpon, “An improved endpoint detector for isolated word recognition”, IEEE Transactions on Audio Speech and Language Processing, vol. ASSP-29, pp. 777-785, Aug. 1981.
[4] B. Kotnik, Z. Kacic and B. Horvat, “A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm”, Proc. 7th Eurospeech, pp. 197-200, Sep. 2001.
[5] J. A. Haigh and J. S. Mason, “Robust voice activity detection using cepstral features”, IEEE TEN-CON, pp. 321-324, 1993.
[6] N. B. Yoma, F. McInnes and M. Jack, “Robust speech pulse-detection using adaptive noise modeling”, Electron. Letters, vol. 32, July 1996.
[7] R. Tucker, “Voice activity detection using a periodicity measure”, Proc. Inst. Elect. Eng., vol. 139, no. 4, pp. 377-380, Aug. 1992.
[8] J. Sohn and W. Sung, “A voice activity detector employing soft decision based noise spectrum adaptation”, Proc. IEEE ICASSP, vol. 1, pp. 365-368, 1998.
[9] J. Sohn, N. S. Kim and W. Sung, “A statistical model-based voice activity detector”, IEEE Signal Processing Letters, vol. 6, pp. 1-3, Jan. 1999.
[10] D. Enqing, L. Guizhong, Z. Yatong and Z. Xiaodi, “Applying support vector machines to voice activity detection”, Proc. IEEE Int. Conf. Signal Processing, pp. 1124-1127, Aug. 2002.
[11] J. Ramírez, P. Yélamos, J. M. Górriz, J. C. Segura and L. García, “Speech/non-speech discrimination combining advanced feature extraction and SVM learning”, Proc. Interspeech, pp. 1662-1665, 2006.
[12] X.-L. Zhang and Ji Wu, “Deep belief networks based voice activity detection”, IEEE Transactions on Audio Speech and Language Processing, vol. 21, no. 4, pp. 697-710, 2013.
[13] T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection”, Proc. IEEE ICASSP, pp. 7378-7382, May 2013.
[14] S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions”, Proc. IEEE ICASSP, pp. 2519-2523, May 2014.
[15] Y. Guo, K. Li, Q. Fu, and Y. Yan, “A two-microphone based voice activity detection for distant-talking speech in wide range of direction of arrival”, Proc. IEEE ICASSP, pp. 4901-4904, 2012.
[16] J. Park et al., “Dual Microphone Voice Activity Detection Exploiting Interchannel Time and Level Differences”, IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1335-1339, 2016.
[17] J. DiBiase, H. Silverman and M. Brandstein, “Robust Localization in Reverberant Rooms”, Microphone Arrays, pp. 157-180, 2001.
[18] C. Macartney and T. Weyde, “Improved speech enhancement with the wave-u-net”, 2018.
[19] R. Giri, U. Isik and A. Krishnaswamy, “Attention wave-U-Net for speech enhancement”, Proc. WASPAA, pp. 4049-4053, 2019.
[20] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation”, IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 27, no. 8, pp. 1256-1266, 2019.
[21] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition”, Proc. IEEE ICASSP, pp. 7092-7096, 2013.
[22] K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement”, Proc. Interspeech, pp. 3229-3233, 2018.
[23] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement”, Proc. Interspeech, pp. 2472-2476, 2020.
[24] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming”, Proc. IEEE ICASSP, pp. 196-200, 2016.
[25] B. D. van Veen and K. M. Buckley, “Beamforming: A versatile approach to spatial filtering”, IEEE ASSP Mag., vol. 5, no. 2, pp. 4-24, Apr. 1988.
[26] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition”, IEEE Transactions on Audio Speech and Language Processing, vol. 15, no. 5, pp. 1529-1539, 2007.
[27] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL-MVDR: All deep learning MVDR beamformer for target speech separation”, Proc. IEEE ICASSP, pp. 6089-6093, 2021.
[28] Y. Luo, E. Ceolini, Cong Han, Shih-Chii Liu, and N. Mesgarani, “FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing”, IEEE Automatic Speech Recognition and Understanding Workshop, 2019.
[29] Z. Q. Wang and D. L. Wang, “All-neural multi-channel speech enhancement”, Proc. Interspeech, pp. 3234-3238, 2018.
[30] A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement”, arXiv preprint arXiv:2109.00265, 2021.
[31] X. Ren, X. Zhang, L. Chen, X. Zheng, et al., “A causal U-Net based neural beamforming network for real-time multichannel speech enhancement”, Proc. Interspeech 2021, Sep. 2021.
[32] U. Hamid, R. A. Qamar, and K. Waqas, “Performance comparison of time-domain and frequency-domain beamforming techniques for sensor array processing”, Int. Bhurban Conference on Appl. Sci. & Technology, pp. 379-385, Jan. 2014.
[33] B.C. J. Moore, Introduction to the Psychology of Hearing, New York: Academic, 1977.
[34] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[35] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books”, in Proc. IEEE ICASSP, South Brisbane, Australia, pp. 5206–5210, 2015.
[36] Wake Word Data Sets: https://summalinguae.com/data-sets/wake-words/
[37] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework”, Proc. Interspeech, pp. 1816-1820, 2019.
[38] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” in Proc. Int. Society for Music Information Retrieval Conf., Suzhou, China, pp. 316–323, 2017.
[39] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small room acoustics,” JASA, vol. 65, pp. 943–950, 1979.
[40] E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, and X. Serra, “Learning sound event classifiers from web audio with noisy labels”, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 21-25, 2019.
[41] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE ICASSP, Salt Lake City, Utah, USA, pp. 749-752, 2001.
[42] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE ICASSP, Dallas, pp. 4214–4217, 2010.
(此全文20270724後開放外部瀏覽)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *