帳號:guest(18.119.17.177)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):許逸誠
作者(外文):Hsu, Yi-Cheng
論文名稱(中文):基於深度學習聲學陣列信號處理之語音增強及遠距雙耳音訊重現
論文名稱(外文):Deep learning-based acoustic array signal processing for speech enhancement and binaural audio telepresence
指導教授(中文):白明憲
指導教授(外文):Bai, Ming-Sian R.
口試委員(中文):李大嵩
簡仁宗
鄭泗東
曹昱
張禎元
口試委員(外文):Lee, Ta-Sung
Chien, Jen-Tzung
Cheng, S-Tone
Tsao, Yu
Chang, Jen-Yuan
學位類別:博士
校院名稱:國立清華大學
系所名稱:動力機械工程學系
學號:109033809
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:149
中文關鍵詞:盲語音分離個人化語音增強遠距雙耳音訊重現麥克風陣列訊號處理深度神經網路
外文關鍵詞:blind speech separationpersonalized speech enhancementbinaural audio telepresencemicrophone array signal processingdeep neural network
相關次數:
  • 推薦推薦:0
  • 點閱點閱:77
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
本論文結合麥克風陣列訊號處理與深度神經網路技術,應用於語音增強及遠距音訊重現,使其能夠適用於各種聲學環境和不同的陣列配置。主要研究涵蓋三大重要主題: 盲語音分離、個人化語音增強和基於麥克風陣列之雙耳聲場重現技術。盲語音分離之目的是僅通過麥克風觀測訊號來提取每位說話者的訊號。為此,本論文提出一個基於空間-時間活動之語者自動分段和分離演算法,此方法結合數位訊號處理前處理和深度神經網路後端,從而提升語音品質和清晰度。個人化語音增強之主要目的是為抑制非目標語音的干擾,如非目標語者之語音訊號和電視噪音。為使個人化語音增強技術能夠廣泛適用於各式陣列收音裝置,本論文提出不限於特定陣列配置和拓樸架構之與陣列配置無關個人化語音增強演算法。遠距雙耳音訊重現為本論文的另一大重點,基於麥克風陣列之雙耳聲場重現技術旨在將麥克風陣列捕捉到之遠端聲學場景轉換成雙耳訊號,以便在近端透過耳機重現遠端聲場。為此,本論文提出一套與陣列配置無關之創新遠距雙耳音訊重現系統,使其適用於各種不同的麥克風陣列配置。此外,本論文亦引入一個前瞻概念,允許使用者在語音增強及背景環境音保留模式之間自由切換,此設計使提出系統能夠靈活應用於各種不同的遠距音訊重現場景。
This dissertation integrates Microphone Array Signal Processing (MASP) and Deep Neural Network (DNN) to Speech Enhancement (SE) and Audio Telepresence (AT) problems in various acoustic environments and different array configurations. Three main topics are covered: Blind Speech Separation (BSS), Personalized Speech Enhancement (PSE), and array-based binaural rendering. Blind speech separation aims to extract the signal of each speaker using only microphone signals. This dissertation proposes a Spatial-Temporal Activity-Informed Diarization and Separation (STAIDS) algorithm based on a Digital Signal Processing (DSP)-based feature extraction front-end and a DNN-based diarization and separation back-end, which leads to improved speech quality and intelligibility. PSE aims to suppress speech-like interferences such as competing speakers and TV noise. The proposed PSE system is “array agnostic,” meaning that it does not rely on specific array configurations and topologies. Binaural AT (BAT) is another focus of this dissertation, where BAT seeks to transport the far-end acoustic scene captured by a microphone array into binaural signals for headphone reproduction at the near-end. A novel BAT system is proposed that can be adapted to different array setups without reconfiguration. Finally, a “scalable” BAT system is proposed to migrate between two extremes of ambience preservation and signal enhancement to address a full range of AT application scenarios.
致謝 i
摘要 ii
Abstract iii
TABLE OF CONTENTS iv
LIST OF TABLES vii
LIST OF FIGURES x
CHAPTER 1. INTRODUCTION 1
CHAPTER 2. BLIND SPEECH SEPARATION 16
2.1 Problem formulation and signal model 17
2.2 Baseline method: the simplex-based method 18
2.2.1 Spatial feature extraction 19
2.2.2 Speaker diarization 20
2.2.3 Speaker separation 22
2.3 Proposed STAIDS system 25
2.3.1 Spatial feature extraction 26
2.3.2 Speaker diarization 27
2.3.3 Speaker separation 32
2.4 Experimental study 40
2.4.1 Data preparation 40
2.4.2 Implementation details and evaluation metrics 43
2.4.3 Spatial feature robustness 44
2.4.4 Speaker counting performance 47
2.4.5 Speaker diarization performance 50
2.4.6 Speaker separation performance 52
2.5 Summary 56
CHAPTER 3. PERSONALIZED SPEECH ENHANCEMENT 57
3.1 Problem formulation and signal model 58
3.2 Proposed ARCA-PSE system 59
3.2.1 The LSTSC feature 60
3.2.2 Speaker encoder 69
3.2.3 Target speech sifting network 69
3.3 Experimental study 72
3.3.1 Data preparation 72
3.3.2 Training and validation dataset 73
3.3.3 Test set 74
3.3.4 Baselines and implementation details 75
3.3.5 Evaluation metrics 77
3.3.6 Spatial feature robustness 78
3.3.7 Performance with different array configurations 81
3.3.8 Performance with different number of microphones 85
3.4 Summary 87
CHAPTER 4. ARRAY-BASED BINAURAL RENDERING 89
4.1 Problem formulation and signal model 90
4.2 Proposed ACIS-BAT system 92
4.2.1 The SCORE module 93
4.2.2 The BRnet 95
4.2.3 Training procedure and loss function 97
4.2.4 Training the model for both SE and AP tasks 98
4.3 Experimental study 99
4.3.1 Data preparation 100
4.3.2 Baselines and implementation details 104
4.3.3 Spatial feature robustness 105
4.3.4 Objective performance 108
4.3.5 Scalability 111
4.3.6 Subjective performance 113
4.4 Summary 116
CHAPTER 5. CONCLUSIONS 117
CHAPTER 6. FUTURE PERSPECTIVES 120
6.1 Online blind speech separation 120
6.2 Computationally efficient personalized speech enhancement 121
6.3 Global audio telepresence 121
ABBREVIATIONS 123
REFERENCES 127
PUBLICATIONS AND PATENT 147
1. E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speech enhancement, NJ, USA: Wiley, 2018.
2. M. Kawamoto, K. Matsuoka, and N. Ohnishi, “A method of blind separation for convolved nonstationary signals,” Neurocomputing, vol. 22, no. 1, pp. 157-171, 1998.
3. H. Buchner, R. Aichner, and W. Kellermann, “A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics,” IEEE Transactions on Audio, Speech and Language Processing, vol. 13, no. 1, pp. 120-134, 2005.
4. Z. Koldovsky and P. Tichavsky, “Time-domain blind separation of audio sources on the basis of a complete ICA decomposition of an observation space,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 2, pp. 406-416, 2011.
5. T. Kim, T. Eltoft, and T.-W. Lee, “Independent vector analysis: An extension of ICA to multivariate components,” Proc. International Conference on Independent Component Analysis and Signal Separation, pp. 165-172, 2006.
6. T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 3, pp. 1066-1074, 2007.
7. O. Dikmen and A. T. Cemgil, “Unsupervised single-channel source separation using Bayesian NMF,” Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 93-96, 2009.
8. A. Ozerov and C. Fvotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 3, pp. 550-563, 2010.
9. Y. Mitsufuji and A. Roebel, “Sound source separation based on non-negative tensor factorization incorporating spatial cue as prior knowledge,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 71-75, 2013.
10. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: discriminative embeddings for segmentation and separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31-35, 2016.
11. Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246-250, 2017.
12. Y. Luo and N. Mesgarani, “Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 8, pp. 1256-1266, 2019.
13. Y. Lue, Z. Chen, and T. Yoshioka, “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46-50, 2020.
14. D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241-245, 2017.
15. M. Kolbæk, D. Yu, Z. Tan, and J. Jensen, “Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 10, pp. 1901-1913, 2017.
16. J. Zhu, R. A. Yeh, and M. Hasegawa-Johnson, “Multi-decoder DPRNN: Source separation for variable number of speakers,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3420-3424, 2021.
17. K. Kinoshita, L. Drude, M. Delcroix, and T. Nakatani, “Listening to each speaker one by one with recurrent selective hearing networks,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5064-5068, 2018.
18. T. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation counting and diarization for meeting analysis,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 91-95, 2019.
19. Z. Jin, X. Hao, and X. Su, “Coarse-to-fine recursive speech separation for unknown number of speakers,” arXiv:2203.16054, 2022.
20. L. Drude and R. Haeb-Umbach, “Tight integration of spatial and spectral features for BSS with deep clustering embeddings,” Proc. Interspeech, pp. 2650-2654, 2017.
21. Z. Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2018.
22. Z. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 2, pp. 457-468, 2019.
23. Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S. Liu, “FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 260-267, 2019.
24. K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Černocký, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019.
25. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” Proc. Interspeech, pp. 2728-2732, 2019.
26. M. Ge, C. Xu, L. Wang, E. S. Chang, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” Proc. Interspeech, pp. 1406-1410, 2020.
27. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics., vol. 37, no. 4, pp. 1-11, 2018.
28. C. Li and Y. Qian, “Listen, watch and understand at the cocktail party: Audio-visual-contextual speech separation,” Proc. Interspeech, pp. 1426-1430, 2020.
29. Z. Chen, T. Yoshioka, X. Xiao, L. Li, M. L. Seltzer, and Y. Gong, “Efficient integration of fixed beamformers and speech separation networks for multi-channel far-field speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5384-5388, 2018.
30. H. Taherian, K. Tan, and D. Wang, “Multi-channel talker-independent speaker separation through location-based training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2791-2800, 2022.
31. R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “Neural spatial filter: Target speaker speech separation assisted with directional information,” Proc. Interspeech, pp. 4290-4294, 2019.
32. M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, and S. Araki, “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 691-695, 2020.
33. J. Han, W. Rao, Y. Wang, and Y. Long, “Improving channel decorrelation for multi-channel target speech extraction,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6094-6098, 2021.
34. M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, and T. Nakatani, “Speaker activity driven neural speech extraction,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6099-6103, 2021.
35. H. Taherian and D. Wang, “Multi-channel conversational speaker separation via neural diarization,” arXiv: 2311.08630, 2023.
36. B. Laufer-Goldshtein, B. Talmon, and S. Gannot, “Diarization and separation based on a data-driven simplex,” Proc. European Signal Processing Conference (EUSIPCO), pp. 842-846, 2018.
37. B. Laufer-Goldshtein, B. Talmon, and S. Gannot, “Source counting and separation based on simplex analysis,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 66, no. 24, pp. 6458-6473, 2018.
38. B. Laufer-Goldshtein, B. Talmon, and S. Gannot, “Global and local simplex representations for multichannel source separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, no. 1, pp. 914-928, 2020.
39. S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue and P. García, “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 30, pp. 1493-1507, 2022.
40. E. Lehmann and A. Johansson, “Prediction of energy decay in room impulse responses simulated with an image-source model,” The Journal of the Acoustical Society of America, vol. 124, no. 1, pp. 269-277, 2008.
41. E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 313-317, 2014.
42. D. Powers, “Evaluation: From precision, recall and F-measure to ROC,” arXiv:2010.16061, 2020.
43. J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The rich transcription 2006 spring meeting recognition evaluation,” Proc. International Workshop on Machine Learning for Multimodal Interaction, pp. 309-322, 2006.
44. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 749-752, 2001.
45. T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” Proc. Interspeech, pp. 3244-3248, 2018.
46. D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,” arXiv:2008.09586, 2020.
47. T. Li, Q. Lin, Y. Bao, and M. Li, “Atss-net: Target speaker separation via attention-based neural network,” Proc. Interspeech, pp. 1411-1415, 2020.
48. Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accurate time-domain speaker extraction network,” Proc. Interspeech, pp. 1421-1425, 2020.
49. R. Giri, S. Venkataramani, J.-M. Valin, U. Isik, and A. Krishnaswamy, “Personalized percepnet: Real-time, low-complexity target voice separation and enhancement,” Proc. Interspeech, 2021.
50. S. E. Eskimez, T. Yoshioka, H. Wang, X. Wang, Z. Chen, and X. Huang, “Personalized speech enhancement: New models and comprehensive evaluation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 356-360, 2021.
51. M. Thakker, S. E. Eskimez, T. Yoshioka, and H. Wang, “Fast real-time personalized speech enhancement: End-to-end enhancement network (E3Net) and knowledge distillation,” Procs. Interspeech, pp. 991-995, 2022.
52. E. Ceolini, J. Hjortkjær, D. D. Wong, J. O’Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu, and N. Mesgarani, “Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception,” NeuroImage, vol. 223, 2020.
53. G. Li, S. Liang, S. Nie, W. Liu, M. Yu, L. Chen, S. Peng, and C. Li, “Direction-aware speaker beam for multi-channel speaker extraction,” Proc. Interspeech, pp. 2713-2717, 2019.
54. C. Zorilă, M. Li, and R. Doddipatla, “An investigation into the multi-channel time domain speaker extraction network,” Proc. 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 793-800, 2021.
55. H. Taherian, S. E. Eskimez, T. Yoshioka, H. Wang, Z. Chen, and X. Huang, “One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271-275, 2022.
56. B. C. F. Moore. An introduction to the psychology of hearing. Brill, 2012.
57. C. K. A. Reddy, V. Gopal and R. Cutler, “DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 886-890, 2022.
58. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - Half-baked or well done?,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626-630, 2019.
59. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011.
60. F. Z. Kaghat, A. Azough, M. Fakhour, and M. Meknassi, “A new audio augmented reality interaction and adaptation model for museum visits,” Comput. Electr. Eng., vol. 84, no. 106606, 2020.
61. C. Kern and W. Ellermeier, “Audio in VR: Effects of a soundscape and movement-triggered step sounds on presence,” Front. Robot. AI, vol. 7, pp. 1-13, 2020.
62. N. Nagele, V. Bauer, P. G. T. Healey, J. D. Reiss, H. Cooke, T. Cowlishaw, C. Baume, and C. Pile, “Interactive audio augmented reality in participatory performance,” Frontiers Virtual Reality, vol. 1, pp. 1-14, 2021.
63. J. Yang, A. Barde, and M. Billinghurst, “Audio augmented reality: A systematic review of technologies, applications, and future research directions,” Journal of the Audio Engineering Society, vol. 70, pp. 788-809, 2022.
64. M. Zaunschirm, C. Schörkhuber, and R. Höldrich, “Binaural rendering of Ambisonic signals by head-related impulse response time alignment and a diffuseness constraint,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. 3616-3627, 2018.
65. R. Gupta, J. He, R. Ranjan, W. Gan, F. Klein, C. Schneiderwind, A. Neidhardt, K. Brandenburg, and V. Välimäki, “Augmented/mixed reality audio for hearables: Sensing, control, and rendering,” IEEE Signal Processing Magazine, vol. 39, no. 3, pp. 63-89, 2022.
66. B. Rafaely, V. Tourbabin, E. Habets, Z. Ben-Hur, H. Lee, H. Gamper, L. Arbel, T. Abhayapala, and P. Samarasinghe, “Spatial audio signal processing for binaural reproduction of recorded acoustic scenes—review and challenges,” Acta Acustica, vol. 6, no. 47, 2022.
67. M. R. Bai, Y. Chen, Y. Hsu, and T. Wu, “Robust binaural rendering with the time-domain underdetermined multichannel inverse prefilters,” The Journal of the Acoustical Society of America, vol. 146, no. 2, pp. 1302-1313, 2019.
68. F. Lluís, P. Martínez-Nuevo, M. B. Møller, S. E. Shepstone, “Sound field reconstruction in rooms: Inpainting meets super-resolution,” The Journal of the Acoustical Society of America, vol. 148, no. 2, pp. 649-659, 2020.
69. E. Fernandez-Grande, X. Karakonstantis, D. Caviedes-Nozal, and P. Gerstoft, “Generative models for sound field reconstruction,” The Journal of the Acoustical Society of America, vol. 153, no. 2, pp. 1179-1190, 2023.
70. A. Wabnitz, N. Epain, A. McEwan, and C. Jin, “Upscaling Ambisonic sound scenes using compressed sensing techniques,” Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1-4, 2011.
71. M. Kentgens, S. A. Hares and P. Jax, “On the upscaling of higher-order Ambisonics signals for sound field translation,” Proc. European Signal Processing Conference (EUSIPCO), pp. 81-85, 2021.
72. Berkhout, D. de Vries, and P. Vogel, “Acoustic control by wave field synthesis,” The Journal of the Acoustical Society of America, vol. 93, no. 5, pp. 2764-2778, 1993.
73. J. Ahrens and S. Spors, “Wave field synthesis of a sound field described by spherical harmonics expansion coefficients,” The Journal of the Acoustical Society of America, vol. 131, no. 3, pp. 2190-2199, 2012.
74. T. D. Abhayapala and D. B. Ward, “Theory and design of high order sound field microphones using spherical microphone array,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1949-1952, 2002.
75. L. Alon, J. Sheaffer, and B. Rafaely, “Robust plane-wave decomposition of spherical microphone array recordings for binaural sound reproduction,” The Journal of the Acoustical Society of America, vol. 138, no. 3, pp. 1925-1926, 2015.
76. W. Zhang, T. D. Abhayapala, R. A. Kennedy, and R. Duraiswami, “Insights into head-related transfer function: Spatial dimensionality and continuous representation,” The Journal of the Acoustical Society of America, vol. 127, no. 4, pp. 2347-2357, 2010.
77. M. Jeffet, N. R. Shabtai, and B. Rafaely, “Theory and perceptual ecaluation of the binaural reproduction and beamforming trade-off in the generalized spherical array beamformer,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 4, pp. 708-718, 2016.
78. Z. Ben-Hur, D. L. Alon, R. Mehra, and B. Rafaely, “Binaural reproduction based on bilateral ambisonics and ear-aligned hrtfs,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 29, pp. 901-913, 2021.
79. L. Birnie, Z. Ben-Hur, V. Tourbabin, T. Abhayapala, and P. Samarasinghe, “Bilateral-ambisonic reproduction by soundfield translation,” Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 1-5, 2022.
80. C. Borrelli, A. Canclini, F. Antonacci, A. Sarti, and S. Tubaro, “A denoising methodology for higher order ambisonics recordings,” Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 451-455, 2018.
81. M. Lugasi and B. Rafaely, “Speech enhancement using masking for binaural reproduction of ambisonics signals,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, pp. 1767-1777, 2020.
82. A. Herzog and E. A. Habets, “Direction and reverberation preserving noise reduction of Ambisonics signals,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, pp. 2461-2475, 2020.
83. H. Beit-On, M. Lugasi, L. Madmoni, A. Menon, A. Kumar, J. Donley, V. Tourbabin, and B. Rafaely, “Audio signal processing for telepresence based on wearable array in noisy and dynamic scenes,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8797-8801, 2022.
84. I. Ifergan and B. Rafaely, “On the selection of the number of beamformers in beamforming-based binaural reproduction,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 6, 2022.
85. Y. Hsu, C. Ma, and M. R. Bai, “Model-matching principle applied to the design of an array-based all-neural binaural rendering system for audio telepresence,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2023.
86. H. Schröter, A. Maier, A. N. Escalante-B and T. Rosenkranz, “Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio,” Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 1-5, 2022.
87. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” Proc. 32nd AAAI Conf. Artif. Intell., pp. 3942-3951, 2018.
88. B. G. Shinn-Cunningham, S. G. Santarelli, and N. Kopco, “Tori of confusion: Binaural cues for sources within reach of a listener,” The Journal of the Acoustical Society of America, vol. 107, no. 3, pp. 1627-1636, 2000.
89. J. Braasch, “Modelling of binaural hearing,” in Communication Acoustics, New York: Springer, pp. 75-108, 2005.
90. O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830-1847, 2004.
91. S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Transactions on Signal Processing, vol. 49, no. 8, pp. 1614-1626, 2001.
92. C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Signal Processing, vol. 24, no. 4, pp. 320-327, 1967.
93. K. Scharnhorst, “Angles in complex vector spaces,” Acta Applicandae Mathematicae, vol. 69, pp. 95-103, 2001.
94. B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Audio source separation by activity probability detection with maximum correlation and simplex geometry,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 5, 2021.
95. K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” Proc. Interspeech, pp. 3229-3233, 2018.
96. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” Proc. 32nd AAAI Conf. Artif. Intell., pp. 3942-3951, 2018.
97. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence, modeling for time-domain single-channel speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46-50, 2020.
98. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Trans. Graph., vol. 37, no. 4, 2018.
99. V. Panayotoy, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015.
100. M. Ravanelli et al, “SpeechBrain: A general-purpose speech toolkit,” arXiv:2106.04624, 2021.
101. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no.4, pp. 788-798, 2010.
102. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329-5333, 2018.
103. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879-4883, 2018.
104. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” Proc. Interspeech, 2018.
105. S. Braun, H. Gamper, C. KA Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 656-660, 2021.
106. K. Tan and D. Wang,” Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, pp. 380-390 2019.
107. H. Schroter, A. N. Escalante-B, T. Rosenkranz, and A. Maier, “DeepFilterNet: A low complexity speech enhancement framework for fullband audio based on deep filtering,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7407-7411, 2022.
108. J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” arXiv:2007.01216, 2020.
109. M. Defferrard, K. Benzi, P. Vandergheynst and X. Bresson, “FMA: A dataset for music analysis,” Proc. Int. Society for Music Information Retrieval Conf., pp. 316-323, 2017.
110. C. Armstrong, L. Thresh, D. Murphy, and G. Kearney, “A perceptual evaluation of individual and non-individual HRTFs: a case study of the SADIE II database,” Applied Sciences, vol. 8, no. 11, 2018.
111. J. H. DiBiase, H. F. Silverman, and M. S. Brandstein, “Robust localization in reverberant rooms” in Microphone Arrays, Berlin, Germany:Springer, 2001.
112. L. V. Trees, Optimum array processing, detection, estimation, and modulation theory, Part IV, New York: Wiley, 2002.
113. A. Brughera, L. Dunai, and W. M. Hartmann, “Human interaural time difference thresholds for sine tones: the high-frequency limit,” The Journal of the Acoustical Society of America, vol. 133, pp. 2839-2855, 2013.
114. ITU-R Recommendation, “Method for the subjective assessment of intermediate sound quality (MUSHRA),” International Telecommunications Union, BS. 1534-1, Geneva, Switzerland, 2001.
115. C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework,” Proc. Interspeech, pp. 1816-1820, 2019.
(此全文20280128後開放外部瀏覽)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *