帳號:guest(18.117.232.38)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳思涵
作者(外文):Chen, Ssu-Han
論文名稱(中文):基於時空資訊之多通道語者標記及分離
論文名稱(外文):Multichannel speaker diarization and separation using spatial-temporal information
指導教授(中文):白明憲
指導教授(外文):Bai, Ming-Sian R.
口試委員(中文):劉奕汶
簡仁宗
口試委員(外文):Liu, Yi-Wen
Chien, Jen-Tzung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:110061534
出版年(民國):112
畢業學年度:111
語文別:英文
論文頁數:47
中文關鍵詞:空間資訊語者標記多通道盲源分離深度學習
外文關鍵詞:Spatial informationSpeaker diarizationMultichannel blind source separationDeep learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:128
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在本論文中,我們提出了一種結合語者標記和語音分離兩個階段的多通道系統。該系統由兩個神經網絡模型組成,結合訊號前處理所得的時間與空間相關信息作為模型的輸入特徵,以求其能夠應對具有挑戰性的聲學環境。在語者標記階段,我們使用輕量的循環神經網絡模型,其能夠更有效率的處理具有時間順序的語音訊號。本階段中,我們使用空間資訊及頻譜資訊兩者作為模型的輸入特徵。其中前者為空間相關矩陣結合“白化”技術,其可以加強此矩陣的特徵表現;而後者為經等效矩形頻寬縮放的訊號頻譜圖,其可以有效降低模型的參數量及複雜度。在語音分離階段,我們選用有效率且實用的捲積循環神經網絡模型,並結合上述語者標記階段所得之結果,作為中間輸入的引導特徵,使系統可以更加具備強健性。在本階段中,我們同樣使用空間資訊及頻譜資訊兩者作為模型的輸入特徵。其中,在空間資訊上,我們選用不同麥克風通道間的相位差,此特徵被廣泛用於語者定位的演算法;在頻譜資訊上,則使用未經壓縮的訊號頻譜圖,以求達成精確的語音分離效果。本文所提出的兩階段系統能夠有效在未知的房間脈沖響應、各式麥克風陣列配置和未經訓練過的噪聲聲環境下完成語者標記及語音分離,非常具有強健性。此外,本系統在計算複雜性方面低於其餘所有在本文中與其相互比較的方法。而本文中包含的各式實驗亦有效驗證所提出的系統在語者標記和分離上表現優異,並且其不需要事先了解麥克風陣列配置或目標與者的資訊,這於實際應用上是非常重要的。
In this thesis, we propose a two-stage spatial-temporal all-neural speaker diarization and separation (TSTAR) system under adverse acoustic conditions. The spatial information across time frames is adopted as input features to neural networks. In the diarization stage, we employ the "whitened" spatial correlation matrix and the equivalent rectangular bandwidth (ERB)-scaled signal from the reference microphone as input features for a lightweight diarization network. In the separation stage, we train a multichannel speaker separation model using interchannel phase differences (IPDs) and framewise activity probabilities predicted from the diarization stage. The proposed TSTAR system demonstrates superior robustness to unseen room impulse responses (RIRs), array configurations, and noisy types. Furthermore, the proposed TSTAR system outperforms the digital signal processing (DSP)-based and neural network (NN)-based baselines in terms of computational complexity. Experimental results show that the TSTAR system is array configuration and target speakers-agnostic in achieving superior diarization and separation performance, which is highly desirable in real-world scenarios.
摘 要 i
ABSTRACT ii
致 謝 iii
CONTENTS iv
LIST OF FIGURES vi
LIST OF TABLES vii
CHAPTER 1 INTRODUCTION 1
CHAPTER 2 SPEAKER DIARIZATION 5
2.1 DSP-based method: GLOSS algorithm 6
2.1.1 Feature Extraction 7
2.1.2 Source Counting and Diarization 8
2.2 The Proposed NN-based methods 9
2.2.1 Feature Extraction 10
2.2.2 Model Architecture 13
CHAPTER 3 DIARIZATION DRIVEN SPEAKER SEPARATION 15
3.1 The DSP-based Methods 16
3.1.1 GLOSS Algorithm: Binary Mask 16
3.1.2 Tikhonov Regularization (TIKR): Filter weight 17
3.2 The proposed NN-based Methods 19
3.2.1 Feature Extraction 20
3.2.2 The Model Architecture 20
3.3 The Hybrid Methods 24
CHAPTER 4 RESULTS AND DISCUSSIONS 25
4.1 Training and Testing sets 25
4.2 Speaker Diarization 29
4.2.1 Evaluation metrics 29
4.2.2 Results 30
4.3 Diarization driven Speaker Separation 33
4.3.1 Evaluation metrics 33
4.3.2 Results 33
4.4 Runtime of the system 41
CHAPTER 5 CONCLUSIONS 42
REFERENCES 43

[1] P. Comon and C. Jutten, eds. "Handbook of Blind Source Separation: Independent component analysis and applications," Academic press, 2010.
[2] T. W. Lee, "Independent component analysis," in Independent Component Analysis. Berlin, Germany: Springer, 1998, pp. 27–66.
[3] A. Hyvärinen and E. Oja, " Independent component analysis: Algorithms and applications," Neural networks, vol. 13, no. 4, pp. 411–430, 2000.
[4] A. Hyvärinen, J. Karhunen, and E. Oja, "Independent Component Analysis," New York: Wiley, 2001.
[5] A. Cichocki, R. Zdunek, A. H. Phan, and S. I. Amari, "Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation," Hoboken, NJ. USA: Wiley, 2009.
[6] J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31-35.
[7] Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256-1266, Aug. 2019.
[8] Y. Liu and D. Wang, "Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 2092-2102, Dec. 2019.
[9] Y. Luo, Z. Chen and T. Yoshioka, "Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46-50.
[10] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi and J. Zhong, "Attention Is All You Need In Speech Separation," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 21-25.
[11] D. Yu, M. Kolbæk, Z. -H. Tan and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 241-245.
[12] M. Kolbæk, D. Yu, Z. -H. Tan and J. Jensen, "Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901-1913, Oct. 2017.
[13] T. v. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani and R. Haeb-Umbach, "All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 91-95.
[14] K. Kinoshita, M. Delcroix, S. Araki and T. Nakatani, "Tackling Real Noisy Reverberant Meetings with All-Neural Source Separation, Counting, and Diarization System," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 381-385.
[15] Y. Zhang et al., "Continuous Speech Separation with Recurrent Selective Attention Network," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6017-6021.
[16] T. Yoshioka et al., "Advances in Online Audio-Visual Meeting Transcription," 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 276-283.
[17] H. Taherian, K. Tan and D. Wang, "Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2791-2800, 2022.
[18] P. Wang et al., "Speech Separation Using Speaker Inventory," 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 230-236.
[19] N. Kanda, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Zhou and T. Yoshioka, "Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers," in Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 36–40.
[20] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland and O. Vinyals, "Speaker Diarization: A Review of Recent Research," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356-370, Feb. 2012.
[21] M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita and T. Nakatani, "Speaker Activity Driven Neural Speech Extraction," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6099-6103.
[22] B. Laufer-Goldshtein, R. Talmon and S. Gannot, "Diarization and Separation Based on a Data-Driven Simplex," 2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 842-846.
[23] B. Laufer-Goldshtein, R. Talmon and S. Gannot, "Source Counting and Separation Based on Simplex Analysis," in IEEE Transactions on Signal Processing, vol. 66, no. 24, pp. 6458-6473, 15 Dec.15, 2018.
[24] B. Laufer-Goldshtein, R. Talmon and S. Gannot, "Global and Local Simplex Representations for Multichannel Source Separation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 914-928, 2020.
[25] B. Laufer-Goldshtein, R. Talmon and S. Gannot, "Audio source separation by activity probability detection with maximum correlation and simplex geometry," EURASIP Journal on Audio, Speech, and Music Processing, 2021, 2021.1:1-16.
[26] Hsu Yicheng, and Mingsian Bai. "Learning-based Robust Speaker Counting and Separation with the Aid of Spatial Coherence." (2023).
[27] J.B. Allen and D.A. Berkley, "Image method for efficiently simulating small-room acoustics, " Journal Acoustic Society of America, 1979, pp. 943.
[28] E. Hadad, F. Heese, P. Vary and S. Gannot, "Multichannel audio database in various acoustic environments," 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014, pp. 313-317.
[29] J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, "The rich transcription 2006 spring meeting recognition evaluation," International Workshop on Machine Learning and Multimodal Interaction, NIST, 2006, pp. 309–322.
[30] C. H. Taal, R. C. Hendriks, R. Heusdens, and J, Jensen, "An algorithm for intelligibility prediction of time-frequency weighted noisy speech," IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125-2136, 2011.
[31] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," in Proc. IEEE ICASSP, Salt Lake City, Utah, USA, 2001, pp. 749-752.
[32] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[33] A.C. Morris, V. Maier, and P. Green. “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.” Eighth International Conference on Spoken Language Processing, 2004.
[34] G. Sell and D. Garcia-Romero, “Speaker diarization with plda i-vector scoring and unsupervised calibration,” 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 413-417.
[35] S. H. Shum, N. Dehak, R. Dehak and J. R. Glass, “Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach, ” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2015-2028, Oct. 2013.
[36] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” in Proc. Interspeech, 2019, pp. 4300–4304.
[37] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu and S. Watanabe, “End-to-End Neural Speaker Diarization with Self-Attention,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 296-303.
[38] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via timefrequency masking,” IEEE Trans. Signal Process,” vol. 52, no. 7, pp. 1830– 1847, Jul. 2004.
[39] B. C. F. Moore. “An introduction to the psychology of hearing. Brill,” 2012.
[40] M.R. Bai, C. Kuo, “ Acoustic source localization and deconvolution-based separation,” Journal of Computational Acoustics, 2015.
[41] K. Tan and D. Wang, “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement,” in Proc. Interspeech, 2018, pp. 3229-3233.
[42] A. C. Morris, J. Barker, and H. Bourlard, “From missing data to maybe useful data: Soft data modelling for noise robust ASR,” in Proc. Workshop Upon Innovation in Speech Processing, Stratford-Upon-Avon, Apr. 2001, pp. 153–164.
[43] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Proc. IEEE ICASSP, South Brisbane, Australia, 2015, pp. 5206-5210.
[44] O. Cetin and E. Shriberg, "Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap," 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 2006, pp. I-I.
[45] N. Turpault, R. Serizel, A. P. Shah, and J. Salamon. “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis.” In Workshop on Detection and Classification of Acoustic Scenes and Events, 2019, October.
[46] M. Ravanelli, and T. Parcollet et al. "SpeechBrain: A general-purpose speech toolkit." 2021.
(此全文20280718後開放外部瀏覽)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *