|
[1] W. X. Liu, Hong and B. Yang, “Audio-visual speech recognition using a two- step feature fusion strategy,” in 25th International Conference on Pattern Recognition (ICPR), pp. 1896–1903, IEEE, 2021. [2] Y. Yuan, W. Tang, M. Fan, Y. Cao, P. Zhang, and L. Xie, “Deep audio-visual system for closed-set word-level speech recognition,” in International Confer- ence on Multimodal Interaction, pp. 540–545, 2019. [3] W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,” The Journal of the Acoustical Society of America, vol. 26, no. 2, pp. 212–215, 1954. [4] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, pp. 746–748, 1976. [5] E. D. Petajan, Automatic lipreading to Enhance Speech Recognition (Speech Reading). PhD Thesis. University of Illinois at Urbana-Champaign, 1984. [6] N. Puviarasan and S. Palanivel, “Lip reading of hearing impaired persons using hmm,” Expert Systems with Applications, vol. 38, no. 4, pp. 4477–4481, 2011. [7] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham, “Lipread- ing using shape, shading and scale,” in AVSP’98 International Conference on Auditory-Visual Speech Processing, 1998. [8] G. I. Chiou and J.-N. Hwang, “Lipreading from color motion video,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 4, pp. 2156–2159 vol. 4, 1996. [9] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in International Conference on Machine Learning (ICML), 2011. [10] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Lipreading using convolutional neural network,” in 15th annual Conference of the Inter- national Speech Communication Association (INTERSPEECH), 2014. [11] O. Koller, H. Ney, and R. Bowden, “Deep learning of mouth shapes for sign language,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 85–91, 2015. [12] M. Wand, J. Koutník, and J. Schmidhuber, “Lipreading with long short-term memory,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115–6119, IEEE, 2016. [13] S. Petridis and M. Pantic, “Deep complementary bottleneck features for visual speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2304–2308, IEEE, 2016. [14] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision, pp. 87–103, Springer, 2016. [15] S. Yang, Y. Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K. Long, S. Shan, and X. Chen, “Lrw-1000: A naturally-distributed large-scale benchmark for lip reading in the wild,” in 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8, IEEE, 2019. [16] T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with lstms for lipreading,” arXiv preprint arXiv:1703.04105, 2017. [17] T. Stafylakis, M. H. Khan, and G. Tzimiropoulos, “Pushing the boundaries of audiovisual word recognition using residual networks and lstms,” Computer Vision and Image Understanding, vol. 176, pp. 22–32, 2018. [18] Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we read speech be- yond the lips? rethinking roi selection for deep visual speech recognition,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 356–363, IEEE, 2020. [19] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal con- volutional networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323, IEEE, 2020. [20] X. Weng and K. Kitani, “Learning spatio-temporal features with two-stream deep 3d cnns for lipreading,” arXiv preprint arXiv:1905.02540, 2019. [21] P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards practical lipread- ing with distilled and efficient models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608–7612, IEEE, 2021. [22] P. Ma, Y. Wang, S. Petridis, J. Shen, and M. Pantic, “Training strategies for improved lip-reading,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8472–8476, 2022. [23] D. Feng, S. Yang, S. Shan, and X. Chen, “Learn an effective lip reading model without pains,” arXiv preprint arXiv:2011.07557, 2020. [24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141, 2018. [25] K. He, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, IEEE, 2016. [26] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond em- pirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017. [27] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in IEEE international confer- ence on acoustics, speech and signal processing (ICASSP), pp. 6548–6552, IEEE, 2018. [28] Z. Miao, H. Liu, and B. Yang, “Part-based lipreading for audio-visual speech recognition,” in IEEE International Conference on Systems, Man, and Cyber- netics (SMC), pp. 2722–2726, 2020. [29] H. Liu, Z. Chen, and B. Yang, “Lip graph assisted audio-visual speech recog- nition using bidirectional synchronous fusion.,” in INTERSPEECH, pp. 3520– 3524, 2020. [30] H. Liu, W. Li, and B. Yang, “Robust audio-visual speech recognition based on hybrid fusion,” in 25th International Conference on Pattern Recognition (ICPR), pp. 7580–7586, IEEE, 2021. [31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com- putation, vol. 9, no. 8, pp. 1735–1780, 1997. [32] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014. [33] X. Liu, “Bi-directional gated recurrent unit neural network based nonlinear equalizer for coherent optical communication system,” Opt. Express, vol. 29, no. 4, pp. 5923–5933, 2021. [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [35] N. Krishnamurthy and J. H. Hansen, “Babble noise: modeling, analysis, and applications,” in IEEE Transactions on Audio, speech, and Language Process- ing, 17(7), pp. 1394–1407, IEEE, 2009.
|