|
[1] P. Alexandros and N. Shrikanth, “Robust recognition of children’s speech,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 603– 616, 2003. [2] C. H. Lee, “Adaptive compensation for robust speech recognition,” in Proc. ASRU, 1997. [3] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control, ch. 9. Springer, 2006. [4] Y. Ephraim and D. Malah, Fundamentals of Noise Reduction in Spring Handbook of Speech Processing, ch. 43. Springer, 2008. [5] S. Shahnawazuddin, A. Dey, and R. Sinha, “Pitch-adaptive front-end features for robust children’s ASR,” in Proc. Interspeech, 2016. [6] S. Shahnawazuddin, K. T. Deepak, G. Pradhan, and R. Sinha, “Enhancing noise and pitch robustness of children’s ASR,” in Proc. ICASSP, 2017. [7] W. Ahmad, S. Shahnawazuddin, H. Kathania, G. Pradhan, and A. B. Samaddar, “Improving children’s speech recognition through explicit pitch scaling based on iterative spectrogram inversion,” in Proc. Interspeech, 2017. [8] H. K. Kathania, S. Shahnawazuddin, N. Adiga, and W. Ahmad, “Role of prosodic features on children’s speech recognition,” in Proc. ICASSP, 2018. [9] S. Shahnawazuddin, N. Adiga, and H. K. Kathania, “Effect of prosody modification on children’s ASR,” IEEE Signal Processing Letters, vol. 24, no. 11, pp. 1749–1753, 2017. [10] R. Serizel and D. Giuliani, “Vocal tract length normalisation approaches to DNN-based children’s and adults’ speech recognition,” in Proc. SLT, 2014. [11] J. Fainberg, P. Bell, M. Lincoln, and S. Renals, “Improving children’s speech recognition through out-of-domain data augmentation,” in Proc. Interspeech, 2016. [12] P. Sheng, Z. Yang, and Y. Qian, “Gans for children: a generative data augmentation strategy for children speech recognition,” in Proc. ASRU, 2019. [13] P. G. Shivakumar, A. Potamianos, S. Lee, and S. S. Narayanan, “Improving speech recognition for children using acoustic adaptation and pronunciation modeling,” in Proc. WOCCI, 2014. [14] P. G. Shivakumar and P. Georgiou, “Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations,” Computer Speech Language, vol. 63, p. 101077, 2020. [15] R. Tong, L. Wang, and B. Ma, “Transfer learning for children’s speech recognition,” in Proc. IALP, 2017. [16] N. F. Chen, R. Tong, D. Wee, P. Lee, B. Ma, and H. Li, “SingaKids-Mandarin: speech corpus of Singaporean children speaking Mandarin Chinese,” in Proc. Interspeech, 2016. [17] M. H. Yang, H. S. Lee, Y. D. Lu, K. Y. Chen, Y. Tsao, B. Chen, and H. M. Wang, “Discriminative autoencoders for acoustic modeling,” in Proc. Interspeech, 2017. [18] P. T. Huang, H. S. Lee, S. S. Wang, K. Y. Chen, Y. Tsao, and H. M. Wang, “Exploring the encoder layers of discriminative autoencoders for LVCSR,” in Proc. Interspeech, 2019. [19] X. lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in Proc. Interspeech, 2013. [20] R. E. Zezario, J. W. Huang, X. Lu, Y. Tsao, H. T. Hwang, and H. M. Wang, “Deep denoising autoencoder based post filtering for speech enhancement,” in Proc. APSIPA, 2018. [21] S. Jalalvand and D. Falavigna, “Stacked autoencoder for asr error detection and word error rate prediction,” in Proc. Interspeech, 2015. [22] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-stage discriminatively trained HMM-based models for ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 1949– 1961, 2018. [23] K. Lee, “On large-vocabulary speaker-independent continuous speech recognition,” Speech Communication, vol. 7, no. 4, pp. 375 – 379, 1988. [24] A. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012. [25] A. Senior, H. Sak, F. de Chaumont Quitry, T. Sainath, and K. Rao, “Acoustic modelling with CD-CTC-sMBR LSTM RNNs,” in Proc. ASRU, 2015. [26] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition,” in Proc. Interspeech, 2017. [27] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989. [28] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. Interspeech, 2015. [29] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks,” in Proc. Interspeech, 2018. [30] M. Federico, N. Bertoldi, and M. Cettolo, “Irstlm: an open source toolkit for handling large scale language models,” 2008. [31] B. Juang and L. R. Rabiner, “The segmental K-means algorithm for estimating parameters of hidden Markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 9, pp. 1639–1641, 1990. [32] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: trainable text-speech alignment using Kaldi,” in Proc. Interspeech, 2017. [33] K. Kumar, C. Kim, and R. M. Stern, “Delta-spectral cepstral coefficients for robust speech recognition,” in Proc. ICASSP, 2011. [34] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 272– 281, 1999. [35] C. Leggetter and P. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech Language, vol. 9, no. 2, pp. 171 – 185, 1995. [36] S. Rath, D. Povey, K. Veselý, and J. Cernocký, “Improved feature processing for deep neural networks,” in Proc. Interspeech, 2013. [37] K. Han, S. Hahm, B.-H. Kim, J. Kim, and I. Lane, “Deep learning-based telephony speech recognition in the wild,” in Proc. Interspeech, 2017. [38] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks,” in Proc. Interspeech, 2013. [39] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Proc. Interspeech, 2016. [40] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006. [41] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, “I-vector based speaker recognition on short utterances,” in Proc. Interspeech, 2011. [42] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Proc. Interspeech, 2017. [43] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel, “Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification,” in Proc. Interspeech, 2009. [44] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors,” in Proc. ASRU, 2013. [45] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors: robust DNN embeddings for speaker recognition,” in Proc. ICASSP, 2018. [46] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech Language, vol. 16, no. 1, pp. 69–88, 2002. [47] D. Povey, M. Hannemann, G. Boulianne, L. Burget, M. Janda, M. Karafiát, S. Kombrink, P. Motlíček, Y. Qian, K. Riedhammer, K. Vesel, and T. Vu, “Generating exact lattices in the WFST framework,” in Proc. ICASSP, 2012. [48] D. Povey, G. Boulianne, L. Burget, P. Motlicek, and P. Schwarz, “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011. [49] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete LDC93S6A,” Web Download. Philadelphia: Linguistic Data Consortium, 1993. [50] Linguistic Data Consortium and NIST Multimodal Information Group, “CSRII (WSJ1) complete LDC94S13A,” Web Download. Philadelphia: Linguistic Data Consortium, 1997. [51] M. Eskenazi, J. Mostow, and D. Graff, “The CMU kids corpus LDC97S63,” Web Download. Philadelphia: Linguistic Data Consortium, 1997. [52] A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, S. Steidl, and M. Wong, “The PF_STAR children’s speech corpus,” in Proc. Interspeech, 2005. [53] http://www.speech.cs.cmu.edu/cgi-bin/cmudict. [54] https://en.wikipedia.org/wiki/Perplexity. [55] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proc. Interspeech, 2015. [56] https://github.com/YannickJadoul/Parselmouth. [57] S. Ghai and R. Sinha, “Exploring the role of spectral smoothing in context of children’s speech recognition,” in Proc. Interspeech, 2009. [58] P. Ghahremani, B. Babaali, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur, “A pitch extraction algorithm tuned for ASR,” in Proc. ICASSP, 2014. [59] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-supervised training of acoustic models using lattice-free MMI,” in Proc. ICASSP, 2018. [60] http://sox.sourceforge.net/sox.html. |