|
1. E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speech enhancement, NJ, USA: Wiley, 2018. 2. M. Kawamoto, K. Matsuoka, and N. Ohnishi, “A method of blind separation for convolved nonstationary signals,” Neurocomputing, vol. 22, no. 1, pp. 157-171, 1998. 3. H. Buchner, R. Aichner, and W. Kellermann, “A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics,” IEEE Transactions on Audio, Speech and Language Processing, vol. 13, no. 1, pp. 120-134, 2005. 4. Z. Koldovsky and P. Tichavsky, “Time-domain blind separation of audio sources on the basis of a complete ICA decomposition of an observation space,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 2, pp. 406-416, 2011. 5. T. Kim, T. Eltoft, and T.-W. Lee, “Independent vector analysis: An extension of ICA to multivariate components,” Proc. International Conference on Independent Component Analysis and Signal Separation, pp. 165-172, 2006. 6. T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 3, pp. 1066-1074, 2007. 7. O. Dikmen and A. T. Cemgil, “Unsupervised single-channel source separation using Bayesian NMF,” Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 93-96, 2009. 8. A. Ozerov and C. Fvotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 3, pp. 550-563, 2010. 9. Y. Mitsufuji and A. Roebel, “Sound source separation based on non-negative tensor factorization incorporating spatial cue as prior knowledge,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 71-75, 2013. 10. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: discriminative embeddings for segmentation and separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31-35, 2016. 11. Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246-250, 2017. 12. Y. Luo and N. Mesgarani, “Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 8, pp. 1256-1266, 2019. 13. Y. Lue, Z. Chen, and T. Yoshioka, “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46-50, 2020. 14. D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241-245, 2017. 15. M. Kolbæk, D. Yu, Z. Tan, and J. Jensen, “Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 10, pp. 1901-1913, 2017. 16. J. Zhu, R. A. Yeh, and M. Hasegawa-Johnson, “Multi-decoder DPRNN: Source separation for variable number of speakers,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3420-3424, 2021. 17. K. Kinoshita, L. Drude, M. Delcroix, and T. Nakatani, “Listening to each speaker one by one with recurrent selective hearing networks,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5064-5068, 2018. 18. T. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation counting and diarization for meeting analysis,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 91-95, 2019. 19. Z. Jin, X. Hao, and X. Su, “Coarse-to-fine recursive speech separation for unknown number of speakers,” arXiv:2203.16054, 2022. 20. L. Drude and R. Haeb-Umbach, “Tight integration of spatial and spectral features for BSS with deep clustering embeddings,” Proc. Interspeech, pp. 2650-2654, 2017. 21. Z. Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2018. 22. Z. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 2, pp. 457-468, 2019. 23. Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S. Liu, “FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing,” Proc. IEEE Workshop Automatic Speech Recognition and Understanding, pp. 260-267, 2019. 24. K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Černocký, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019. 25. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” Proc. Interspeech, pp. 2728-2732, 2019. 26. M. Ge, C. Xu, L. Wang, E. S. Chang, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” Proc. Interspeech, pp. 1406-1410, 2020. 27. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics., vol. 37, no. 4, pp. 1-11, 2018. 28. C. Li and Y. Qian, “Listen, watch and understand at the cocktail party: Audio-visual-contextual speech separation,” Proc. Interspeech, pp. 1426-1430, 2020. 29. Z. Chen, T. Yoshioka, X. Xiao, L. Li, M. L. Seltzer, and Y. Gong, “Efficient integration of fixed beamformers and speech separation networks for multi-channel far-field speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5384-5388, 2018. 30. H. Taherian, K. Tan, and D. Wang, “Multi-channel talker-independent speaker separation through location-based training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2791-2800, 2022. 31. R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “Neural spatial filter: Target speaker speech separation assisted with directional information,” Proc. Interspeech, pp. 4290-4294, 2019. 32. M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, and S. Araki, “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 691-695, 2020. 33. J. Han, W. Rao, Y. Wang, and Y. Long, “Improving channel decorrelation for multi-channel target speech extraction,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6094-6098, 2021. 34. M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, and T. Nakatani, “Speaker activity driven neural speech extraction,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6099-6103, 2021. 35. H. Taherian and D. Wang, “Multi-channel conversational speaker separation via neural diarization,” arXiv: 2311.08630, 2023. 36. B. Laufer-Goldshtein, B. Talmon, and S. Gannot, “Diarization and separation based on a data-driven simplex,” Proc. European Signal Processing Conference (EUSIPCO), pp. 842-846, 2018. 37. B. Laufer-Goldshtein, B. Talmon, and S. Gannot, “Source counting and separation based on simplex analysis,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 66, no. 24, pp. 6458-6473, 2018. 38. B. Laufer-Goldshtein, B. Talmon, and S. Gannot, “Global and local simplex representations for multichannel source separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, no. 1, pp. 914-928, 2020. 39. S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue and P. García, “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 30, pp. 1493-1507, 2022. 40. E. Lehmann and A. Johansson, “Prediction of energy decay in room impulse responses simulated with an image-source model,” The Journal of the Acoustical Society of America, vol. 124, no. 1, pp. 269-277, 2008. 41. E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 313-317, 2014. 42. D. Powers, “Evaluation: From precision, recall and F-measure to ROC,” arXiv:2010.16061, 2020. 43. J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The rich transcription 2006 spring meeting recognition evaluation,” Proc. International Workshop on Machine Learning for Multimodal Interaction, pp. 309-322, 2006. 44. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 749-752, 2001. 45. T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” Proc. Interspeech, pp. 3244-3248, 2018. 46. D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,” arXiv:2008.09586, 2020. 47. T. Li, Q. Lin, Y. Bao, and M. Li, “Atss-net: Target speaker separation via attention-based neural network,” Proc. Interspeech, pp. 1411-1415, 2020. 48. Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accurate time-domain speaker extraction network,” Proc. Interspeech, pp. 1421-1425, 2020. 49. R. Giri, S. Venkataramani, J.-M. Valin, U. Isik, and A. Krishnaswamy, “Personalized percepnet: Real-time, low-complexity target voice separation and enhancement,” Proc. Interspeech, 2021. 50. S. E. Eskimez, T. Yoshioka, H. Wang, X. Wang, Z. Chen, and X. Huang, “Personalized speech enhancement: New models and comprehensive evaluation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 356-360, 2021. 51. M. Thakker, S. E. Eskimez, T. Yoshioka, and H. Wang, “Fast real-time personalized speech enhancement: End-to-end enhancement network (E3Net) and knowledge distillation,” Procs. Interspeech, pp. 991-995, 2022. 52. E. Ceolini, J. Hjortkjær, D. D. Wong, J. O’Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu, and N. Mesgarani, “Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception,” NeuroImage, vol. 223, 2020. 53. G. Li, S. Liang, S. Nie, W. Liu, M. Yu, L. Chen, S. Peng, and C. Li, “Direction-aware speaker beam for multi-channel speaker extraction,” Proc. Interspeech, pp. 2713-2717, 2019. 54. C. Zorilă, M. Li, and R. Doddipatla, “An investigation into the multi-channel time domain speaker extraction network,” Proc. 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 793-800, 2021. 55. H. Taherian, S. E. Eskimez, T. Yoshioka, H. Wang, Z. Chen, and X. Huang, “One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271-275, 2022. 56. B. C. F. Moore. An introduction to the psychology of hearing. Brill, 2012. 57. C. K. A. Reddy, V. Gopal and R. Cutler, “DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 886-890, 2022. 58. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - Half-baked or well done?,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626-630, 2019. 59. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011. 60. F. Z. Kaghat, A. Azough, M. Fakhour, and M. Meknassi, “A new audio augmented reality interaction and adaptation model for museum visits,” Comput. Electr. Eng., vol. 84, no. 106606, 2020. 61. C. Kern and W. Ellermeier, “Audio in VR: Effects of a soundscape and movement-triggered step sounds on presence,” Front. Robot. AI, vol. 7, pp. 1-13, 2020. 62. N. Nagele, V. Bauer, P. G. T. Healey, J. D. Reiss, H. Cooke, T. Cowlishaw, C. Baume, and C. Pile, “Interactive audio augmented reality in participatory performance,” Frontiers Virtual Reality, vol. 1, pp. 1-14, 2021. 63. J. Yang, A. Barde, and M. Billinghurst, “Audio augmented reality: A systematic review of technologies, applications, and future research directions,” Journal of the Audio Engineering Society, vol. 70, pp. 788-809, 2022. 64. M. Zaunschirm, C. Schörkhuber, and R. Höldrich, “Binaural rendering of Ambisonic signals by head-related impulse response time alignment and a diffuseness constraint,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. 3616-3627, 2018. 65. R. Gupta, J. He, R. Ranjan, W. Gan, F. Klein, C. Schneiderwind, A. Neidhardt, K. Brandenburg, and V. Välimäki, “Augmented/mixed reality audio for hearables: Sensing, control, and rendering,” IEEE Signal Processing Magazine, vol. 39, no. 3, pp. 63-89, 2022. 66. B. Rafaely, V. Tourbabin, E. Habets, Z. Ben-Hur, H. Lee, H. Gamper, L. Arbel, T. Abhayapala, and P. Samarasinghe, “Spatial audio signal processing for binaural reproduction of recorded acoustic scenes—review and challenges,” Acta Acustica, vol. 6, no. 47, 2022. 67. M. R. Bai, Y. Chen, Y. Hsu, and T. Wu, “Robust binaural rendering with the time-domain underdetermined multichannel inverse prefilters,” The Journal of the Acoustical Society of America, vol. 146, no. 2, pp. 1302-1313, 2019. 68. F. Lluís, P. Martínez-Nuevo, M. B. Møller, S. E. Shepstone, “Sound field reconstruction in rooms: Inpainting meets super-resolution,” The Journal of the Acoustical Society of America, vol. 148, no. 2, pp. 649-659, 2020. 69. E. Fernandez-Grande, X. Karakonstantis, D. Caviedes-Nozal, and P. Gerstoft, “Generative models for sound field reconstruction,” The Journal of the Acoustical Society of America, vol. 153, no. 2, pp. 1179-1190, 2023. 70. A. Wabnitz, N. Epain, A. McEwan, and C. Jin, “Upscaling Ambisonic sound scenes using compressed sensing techniques,” Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1-4, 2011. 71. M. Kentgens, S. A. Hares and P. Jax, “On the upscaling of higher-order Ambisonics signals for sound field translation,” Proc. European Signal Processing Conference (EUSIPCO), pp. 81-85, 2021. 72. Berkhout, D. de Vries, and P. Vogel, “Acoustic control by wave field synthesis,” The Journal of the Acoustical Society of America, vol. 93, no. 5, pp. 2764-2778, 1993. 73. J. Ahrens and S. Spors, “Wave field synthesis of a sound field described by spherical harmonics expansion coefficients,” The Journal of the Acoustical Society of America, vol. 131, no. 3, pp. 2190-2199, 2012. 74. T. D. Abhayapala and D. B. Ward, “Theory and design of high order sound field microphones using spherical microphone array,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1949-1952, 2002. 75. L. Alon, J. Sheaffer, and B. Rafaely, “Robust plane-wave decomposition of spherical microphone array recordings for binaural sound reproduction,” The Journal of the Acoustical Society of America, vol. 138, no. 3, pp. 1925-1926, 2015. 76. W. Zhang, T. D. Abhayapala, R. A. Kennedy, and R. Duraiswami, “Insights into head-related transfer function: Spatial dimensionality and continuous representation,” The Journal of the Acoustical Society of America, vol. 127, no. 4, pp. 2347-2357, 2010. 77. M. Jeffet, N. R. Shabtai, and B. Rafaely, “Theory and perceptual ecaluation of the binaural reproduction and beamforming trade-off in the generalized spherical array beamformer,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 4, pp. 708-718, 2016. 78. Z. Ben-Hur, D. L. Alon, R. Mehra, and B. Rafaely, “Binaural reproduction based on bilateral ambisonics and ear-aligned hrtfs,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 29, pp. 901-913, 2021. 79. L. Birnie, Z. Ben-Hur, V. Tourbabin, T. Abhayapala, and P. Samarasinghe, “Bilateral-ambisonic reproduction by soundfield translation,” Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 1-5, 2022. 80. C. Borrelli, A. Canclini, F. Antonacci, A. Sarti, and S. Tubaro, “A denoising methodology for higher order ambisonics recordings,” Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 451-455, 2018. 81. M. Lugasi and B. Rafaely, “Speech enhancement using masking for binaural reproduction of ambisonics signals,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, pp. 1767-1777, 2020. 82. A. Herzog and E. A. Habets, “Direction and reverberation preserving noise reduction of Ambisonics signals,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, pp. 2461-2475, 2020. 83. H. Beit-On, M. Lugasi, L. Madmoni, A. Menon, A. Kumar, J. Donley, V. Tourbabin, and B. Rafaely, “Audio signal processing for telepresence based on wearable array in noisy and dynamic scenes,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8797-8801, 2022. 84. I. Ifergan and B. Rafaely, “On the selection of the number of beamformers in beamforming-based binaural reproduction,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 6, 2022. 85. Y. Hsu, C. Ma, and M. R. Bai, “Model-matching principle applied to the design of an array-based all-neural binaural rendering system for audio telepresence,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2023. 86. H. Schröter, A. Maier, A. N. Escalante-B and T. Rosenkranz, “Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio,” Proc. IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 1-5, 2022. 87. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” Proc. 32nd AAAI Conf. Artif. Intell., pp. 3942-3951, 2018. 88. B. G. Shinn-Cunningham, S. G. Santarelli, and N. Kopco, “Tori of confusion: Binaural cues for sources within reach of a listener,” The Journal of the Acoustical Society of America, vol. 107, no. 3, pp. 1627-1636, 2000. 89. J. Braasch, “Modelling of binaural hearing,” in Communication Acoustics, New York: Springer, pp. 75-108, 2005. 90. O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830-1847, 2004. 91. S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Transactions on Signal Processing, vol. 49, no. 8, pp. 1614-1626, 2001. 92. C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Signal Processing, vol. 24, no. 4, pp. 320-327, 1967. 93. K. Scharnhorst, “Angles in complex vector spaces,” Acta Applicandae Mathematicae, vol. 69, pp. 95-103, 2001. 94. B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Audio source separation by activity probability detection with maximum correlation and simplex geometry,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 5, 2021. 95. K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” Proc. Interspeech, pp. 3229-3233, 2018. 96. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” Proc. 32nd AAAI Conf. Artif. Intell., pp. 3942-3951, 2018. 97. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence, modeling for time-domain single-channel speech separation,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46-50, 2020. 98. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Trans. Graph., vol. 37, no. 4, 2018. 99. V. Panayotoy, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015. 100. M. Ravanelli et al, “SpeechBrain: A general-purpose speech toolkit,” arXiv:2106.04624, 2021. 101. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no.4, pp. 788-798, 2010. 102. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329-5333, 2018. 103. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879-4883, 2018. 104. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” Proc. Interspeech, 2018. 105. S. Braun, H. Gamper, C. KA Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 656-660, 2021. 106. K. Tan and D. Wang,” Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 28, pp. 380-390 2019. 107. H. Schroter, A. N. Escalante-B, T. Rosenkranz, and A. Maier, “DeepFilterNet: A low complexity speech enhancement framework for fullband audio based on deep filtering,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7407-7411, 2022. 108. J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” arXiv:2007.01216, 2020. 109. M. Defferrard, K. Benzi, P. Vandergheynst and X. Bresson, “FMA: A dataset for music analysis,” Proc. Int. Society for Music Information Retrieval Conf., pp. 316-323, 2017. 110. C. Armstrong, L. Thresh, D. Murphy, and G. Kearney, “A perceptual evaluation of individual and non-individual HRTFs: a case study of the SADIE II database,” Applied Sciences, vol. 8, no. 11, 2018. 111. J. H. DiBiase, H. F. Silverman, and M. S. Brandstein, “Robust localization in reverberant rooms” in Microphone Arrays, Berlin, Germany:Springer, 2001. 112. L. V. Trees, Optimum array processing, detection, estimation, and modulation theory, Part IV, New York: Wiley, 2002. 113. A. Brughera, L. Dunai, and W. M. Hartmann, “Human interaural time difference thresholds for sine tones: the high-frequency limit,” The Journal of the Acoustical Society of America, vol. 133, pp. 2839-2855, 2013. 114. ITU-R Recommendation, “Method for the subjective assessment of intermediate sound quality (MUSHRA),” International Telecommunications Union, BS. 1534-1, Geneva, Switzerland, 2001. 115. C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework,” Proc. Interspeech, pp. 1816-1820, 2019.
|