一個單詞級別的中文聽視覺語音辨識系統__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.146) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	郭宣妤
作者(外文):	Kuo, Shuan-Yu
論文名稱(中文):	一個單詞級別的中文聽視覺語音辨識系統
論文名稱(外文):	A Word-level Mandarin Audio-visual Speech Recognition System
指導教授(中文):	劉奕汶
指導教授(外文):	Liu, Yi-Wen
口試委員(中文):	白明憲廖元甫羅仕龍
口試委員(外文):	Bai, Ming-Sian Liao, Yuan-Fu Lo, Shih-Lung
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	109061620
出版年(民國):	111
畢業學年度:	111
語文別:	英文
論文頁數:	44
中文關鍵詞:	深度神經網路、聽視覺、語音辨識
外文關鍵詞:	Deep-neural-network、Audio-visual、Speech-recognition
相關次數:	推薦:0 點閱:578 評分: 下載:0 收藏:0

聽視覺語音辨識系統是一項用來增強並改進單獨聽覺辨識系統容易受到環境噪
音影響的特性之研究。由於視覺不受噪音影響並且對聽力理解有一定的幫助，因此
聽覺和視覺的融合能對於聽覺辨識系統有幫助。有鑒於目前關於中文單字級別的
聽視覺語音辨識系統的研究相當稀少，且在現有的中文辨識系統中並未考慮到環
境噪音的影響，因此本論文針對這兩部分進行了研究。本文藉由將聽覺特徵和視
覺特徵做結合，提出了一個具有抗噪音能力的中文單字級別的聽視覺語音辨識系
統。在聽視覺語音辨識系統中，我們利用使用 ResNet 作為前端模型以抽取特徵，
並且使用雙向遞歸神經網路作為後端模型以對特徵的時間動態進行建模。我們使用
CAS-VSR-W1k 資料庫進行訓練與評估，並且在訓練過程加入不同噪音以增加模
型的強健性。最後我們在相當多元的噪音環境下進行測試，再以 Top-1 正確率對模
型進行評估。我們的聽視覺語音辨識系統最終在無噪音的環境下達到了 83.78% 的
正確率，並高於現有兩篇同樣在 CAS-VSR-W1k 資料庫上做測試的方法。我們的
聽視覺語音辨識系統在無噪音環境下和單獨聽覺辨識系統相比總共進步了 5.59%。
而在極具挑戰性且極度吵雜的-5dB 人聲噪音環境下比起單獨聽覺辨識系統，聽視
覺語音辨識系統的正確率進步了 29.87%。這證明本研究能有效的改進單獨聽覺辨
識系統的正確率，尤其是在極為吵雜的環境下。

Audio-visual speech recognition aims to improve the noise robustness of au-
dio speech recognition. Because visual information is immune to noise interfer-
ence, it can be supplementary information for the audio model to improve the draw-
back of being vulnerable to noise. The problem has drawn much attention in recent
years. However, there were very few works study word-level Mandarin audio-visual
speech recognition. Moreover, the existing works ignored the effects of noise in-
terference. Therefore, we propose a noise-robust Mandarin word-level audio-visual
speech recognition system by fusing audio and visual information. ResNet was used
as front end to extract features and Bidirectional GRU was used as back end to model
the temporal dynamics of the extracted features. The CAS-VSR-W1k dataset was
used for training and evaluation. During the training process, noises were added to
make the model more robust. Next, the model was tested in various noisy environ-
ments. Finally, Top-1 accuracy was used to evaluate our model. The result shows
that our audio-visual model reached an accuracy of 83.87% in a clean environment,
which outperformed the audio model by 5.59%. The result also outperforms the ex-
isting audio-visual models that were tested on the CAS-VSR-W1k dataset. Besides,
when testing under highly challenging -5dB babble noise, our audio-visual model
leads to a 29.87% improvement. These results indicate that we can efficiently im-
prove the performance and robustness of an audio model, especially under a strong
noise environment.

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Work 3
3 Methodology 7
3.1 System Design Architecture . . . . . . . . . . . . . . . . . . . . . 7
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Data Pre-processing and Data Augmentation . . . . . . . . 10
3.3.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Residual Network (ResNet) . . . . . . . . . . . . . . . . . 13
3.4.2 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . 14
3.4.3 Bidirectional Gated Recurrent Unit (BiGRU) . . . . . . . . 16
3.5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 The Audio Model . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.2 The Visual Model . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.3 The Audio-visual Model . . . . . . . . . . . . . . . . . . . 20
4 Experiments and Results 22
4.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Training Strategies for the Audio Stream and the Visual Stream 22
4.1.2 Training Strategies for the Audio-Visual System . . . . . . 24
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Performance of the Audio Model . . . . . . . . . . . . . . 26
4.3.2 Performance of the Visual Model . . . . . . . . . . . . . . 28
4.4 Performance of the Audio-Visual Model . . . . . . . . . . . . . . . 29
5 Discussion 32
6 Conclusions 37
7 Future works 38
References 39
Appendix 42
A.1 Performance of directly training the model end to end . . . . . . . . 42
A.2 Suggestions from the oral defense committees . . . . . . . . . . . . 43
A.2.1 白明憲教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2.2 廖元甫教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2.3 羅仕龍教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2.4 劉奕汶教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 44

[1] W. X. Liu, Hong and B. Yang, “Audio-visual speech recognition using a two-
step feature fusion strategy,” in 25th International Conference on Pattern
Recognition (ICPR), pp. 1896–1903, IEEE, 2021.
[2] Y. Yuan, W. Tang, M. Fan, Y. Cao, P. Zhang, and L. Xie, “Deep audio-visual
system for closed-set word-level speech recognition,” in International Confer-
ence on Multimodal Interaction, pp. 540–545, 2019.
[3] W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility
in noise,” The Journal of the Acoustical Society of America, vol. 26, no. 2,
pp. 212–215, 1954.
[4] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature,
vol. 264, no. 5588, pp. 746–748, 1976.
[5] E. D. Petajan, Automatic lipreading to Enhance Speech Recognition (Speech
Reading). PhD Thesis. University of Illinois at Urbana-Champaign, 1984.
[6] N. Puviarasan and S. Palanivel, “Lip reading of hearing impaired persons using
hmm,” Expert Systems with Applications, vol. 38, no. 4, pp. 4477–4481, 2011.
[7] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham, “Lipread-
ing using shape, shading and scale,” in AVSP’98 International Conference on
Auditory-Visual Speech Processing, 1998.
[8] G. I. Chiou and J.-N. Hwang, “Lipreading from color motion video,” in 1996
IEEE International Conference on Acoustics, Speech, and Signal Processing
Conference Proceedings, vol. 4, pp. 2156–2159 vol. 4, 1996.
[9] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep
learning,” in International Conference on Machine Learning (ICML), 2011.
[10] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Lipreading
using convolutional neural network,” in 15th annual Conference of the Inter-
national Speech Communication Association (INTERSPEECH), 2014.
[11] O. Koller, H. Ney, and R. Bowden, “Deep learning of mouth shapes for sign
language,” in Proceedings of the IEEE International Conference on Computer
Vision Workshops, pp. 85–91, 2015.
[12] M. Wand, J. Koutník, and J. Schmidhuber, “Lipreading with long short-term
memory,” in IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 6115–6119, IEEE, 2016.
[13] S. Petridis and M. Pantic, “Deep complementary bottleneck features for visual
speech recognition,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 2304–2308, IEEE, 2016.
[14] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference
on Computer Vision, pp. 87–103, Springer, 2016.
[15] S. Yang, Y. Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K. Long, S. Shan, and
X. Chen, “Lrw-1000: A naturally-distributed large-scale benchmark for lip
reading in the wild,” in 2019 14th IEEE International Conference on Automatic
Face Gesture Recognition (FG 2019), pp. 1–8, IEEE, 2019.
[16] T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with lstms
for lipreading,” arXiv preprint arXiv:1703.04105, 2017.
[17] T. Stafylakis, M. H. Khan, and G. Tzimiropoulos, “Pushing the boundaries of
audiovisual word recognition using residual networks and lstms,” Computer
Vision and Image Understanding, vol. 176, pp. 22–32, 2018.
[18] Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we read speech be-
yond the lips? rethinking roi selection for deep visual speech recognition,”
in 2020 15th IEEE International Conference on Automatic Face and Gesture
Recognition (FG 2020), pp. 356–363, IEEE, 2020.
[19] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal con-
volutional networks,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 6319–6323, IEEE, 2020.
[20] X. Weng and K. Kitani, “Learning spatio-temporal features with two-stream
deep 3d cnns for lipreading,” arXiv preprint arXiv:1905.02540, 2019.
[21] P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards practical lipread-
ing with distilled and efficient models,” in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 7608–7612, IEEE,
2021.
[22] P. Ma, Y. Wang, S. Petridis, J. Shen, and M. Pantic, “Training strategies for
improved lip-reading,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 8472–8476, 2022.
[23] D. Feng, S. Yang, S. Shan, and X. Chen, “Learn an effective lip reading model
without pains,” arXiv preprint arXiv:2011.07557, 2020.
[24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 7132–7141, 2018.
[25] K. He, “Deep residual learning for image recognition,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778,
IEEE, 2016.
[26] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond em-
pirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[27] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic,
“End-to-end audiovisual speech recognition,” in IEEE international confer-
ence on acoustics, speech and signal processing (ICASSP), pp. 6548–6552,
IEEE, 2018.
[28] Z. Miao, H. Liu, and B. Yang, “Part-based lipreading for audio-visual speech
recognition,” in IEEE International Conference on Systems, Man, and Cyber-
netics (SMC), pp. 2722–2726, 2020.
[29] H. Liu, Z. Chen, and B. Yang, “Lip graph assisted audio-visual speech recog-
nition using bidirectional synchronous fusion.,” in INTERSPEECH, pp. 3520–
3524, 2020.
[30] H. Liu, W. Li, and B. Yang, “Robust audio-visual speech recognition based
on hybrid fusion,” in 25th International Conference on Pattern Recognition
(ICPR), pp. 7580–7586, IEEE, 2021.
[31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com-
putation, vol. 9, no. 8, pp. 1735–1780, 1997.
[32] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[33] X. Liu, “Bi-directional gated recurrent unit neural network based nonlinear
equalizer for coherent optical communication system,” Opt. Express, vol. 29,
no. 4, pp. 5923–5933, 2021.
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[35] N. Krishnamurthy and J. H. Hansen, “Babble noise: modeling, analysis, and
applications,” in IEEE Transactions on Audio, speech, and Language Process-
ing, 17(7), pp. 1394–1407, IEEE, 2009.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文