|
[1] L. R. Rabiner and M. R. Sambur, “Voiced-unvoiced-silence detection using Itakura LPC distance measure”, Proc. IEEE ICASSP, pp. 323-326, May 1977. [2] J. D. Hoyt and H. Wechsler, “Detection of human speech in structured noise”, Proc. IEEE ICASSP, pp. 237-240, May 1994. [3] L. F. Lamel, L. R. Rabiner, A. E. Rosenberg and J. G. Wilpon, “An improved endpoint detector for isolated word recognition”, IEEE Transactions on Audio Speech and Language Processing, vol. ASSP-29, pp. 777-785, Aug. 1981. [4] B. Kotnik, Z. Kacic and B. Horvat, “A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm”, Proc. 7th Eurospeech, pp. 197-200, Sep. 2001. [5] J. A. Haigh and J. S. Mason, “Robust voice activity detection using cepstral features”, IEEE TEN-CON, pp. 321-324, 1993. [6] N. B. Yoma, F. McInnes and M. Jack, “Robust speech pulse-detection using adaptive noise modeling”, Electron. Letters, vol. 32, July 1996. [7] R. Tucker, “Voice activity detection using a periodicity measure”, Proc. Inst. Elect. Eng., vol. 139, no. 4, pp. 377-380, Aug. 1992. [8] J. Sohn and W. Sung, “A voice activity detector employing soft decision based noise spectrum adaptation”, Proc. IEEE ICASSP, vol. 1, pp. 365-368, 1998. [9] J. Sohn, N. S. Kim and W. Sung, “A statistical model-based voice activity detector”, IEEE Signal Processing Letters, vol. 6, pp. 1-3, Jan. 1999. [10] D. Enqing, L. Guizhong, Z. Yatong and Z. Xiaodi, “Applying support vector machines to voice activity detection”, Proc. IEEE Int. Conf. Signal Processing, pp. 1124-1127, Aug. 2002. [11] J. Ramírez, P. Yélamos, J. M. Górriz, J. C. Segura and L. García, “Speech/non-speech discrimination combining advanced feature extraction and SVM learning”, Proc. Interspeech, pp. 1662-1665, 2006. [12] X.-L. Zhang and Ji Wu, “Deep belief networks based voice activity detection”, IEEE Transactions on Audio Speech and Language Processing, vol. 21, no. 4, pp. 697-710, 2013. [13] T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection”, Proc. IEEE ICASSP, pp. 7378-7382, May 2013. [14] S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions”, Proc. IEEE ICASSP, pp. 2519-2523, May 2014. [15] Y. Guo, K. Li, Q. Fu, and Y. Yan, “A two-microphone based voice activity detection for distant-talking speech in wide range of direction of arrival”, Proc. IEEE ICASSP, pp. 4901-4904, 2012. [16] J. Park et al., “Dual Microphone Voice Activity Detection Exploiting Interchannel Time and Level Differences”, IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1335-1339, 2016. [17] J. DiBiase, H. Silverman and M. Brandstein, “Robust Localization in Reverberant Rooms”, Microphone Arrays, pp. 157-180, 2001. [18] C. Macartney and T. Weyde, “Improved speech enhancement with the wave-u-net”, 2018. [19] R. Giri, U. Isik and A. Krishnaswamy, “Attention wave-U-Net for speech enhancement”, Proc. WASPAA, pp. 4049-4053, 2019. [20] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation”, IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 27, no. 8, pp. 1256-1266, 2019. [21] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition”, Proc. IEEE ICASSP, pp. 7092-7096, 2013. [22] K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement”, Proc. Interspeech, pp. 3229-3233, 2018. [23] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement”, Proc. Interspeech, pp. 2472-2476, 2020. [24] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming”, Proc. IEEE ICASSP, pp. 196-200, 2016. [25] B. D. van Veen and K. M. Buckley, “Beamforming: A versatile approach to spatial filtering”, IEEE ASSP Mag., vol. 5, no. 2, pp. 4-24, Apr. 1988. [26] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition”, IEEE Transactions on Audio Speech and Language Processing, vol. 15, no. 5, pp. 1529-1539, 2007. [27] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL-MVDR: All deep learning MVDR beamformer for target speech separation”, Proc. IEEE ICASSP, pp. 6089-6093, 2021. [28] Y. Luo, E. Ceolini, Cong Han, Shih-Chii Liu, and N. Mesgarani, “FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing”, IEEE Automatic Speech Recognition and Understanding Workshop, 2019. [29] Z. Q. Wang and D. L. Wang, “All-neural multi-channel speech enhancement”, Proc. Interspeech, pp. 3234-3238, 2018. [30] A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement”, arXiv preprint arXiv:2109.00265, 2021. [31] X. Ren, X. Zhang, L. Chen, X. Zheng, et al., “A causal U-Net based neural beamforming network for real-time multichannel speech enhancement”, Proc. Interspeech 2021, Sep. 2021. [32] U. Hamid, R. A. Qamar, and K. Waqas, “Performance comparison of time-domain and frequency-domain beamforming techniques for sensor array processing”, Int. Bhurban Conference on Appl. Sci. & Technology, pp. 379-385, Jan. 2014. [33] B.C. J. Moore, Introduction to the Psychology of Hearing, New York: Academic, 1977. [34] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006. [35] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books”, in Proc. IEEE ICASSP, South Brisbane, Australia, pp. 5206–5210, 2015. [36] Wake Word Data Sets: https://summalinguae.com/data-sets/wake-words/ [37] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework”, Proc. Interspeech, pp. 1816-1820, 2019. [38] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” in Proc. Int. Society for Music Information Retrieval Conf., Suzhou, China, pp. 316–323, 2017. [39] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small room acoustics,” JASA, vol. 65, pp. 943–950, 1979. [40] E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, and X. Serra, “Learning sound event classifiers from web audio with noisy labels”, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 21-25, 2019. [41] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE ICASSP, Salt Lake City, Utah, USA, pp. 749-752, 2001. [42] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE ICASSP, Dallas, pp. 4214–4217, 2010.
|