帳號:guest(3.133.124.123)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):宋展延
作者(外文):Song, Zhan-Yan.
論文名稱(中文):台灣小學生英語單字發音自動評量系統方法之開發
論文名稱(外文):Development of Automatic Evaluation Systems on Taiwanese Elementary School Students’ English Pronunciation
指導教授(中文):劉奕汶
指導教授(外文):Liu, Yi-Wen
口試委員(中文):馮開明
徐憶萍
蘇文鈺
口試委員(外文):Feng, Kai-Ming
Hsu, Yi-Ping
Su, Wen-Yu
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:105061587
出版年(民國):107
畢業學年度:107
語文別:英文
論文頁數:42
中文關鍵詞:語音評分共振峰曲線支持向量機發音評分梅爾倒頻譜參數
外文關鍵詞:speech assessmentpronunciation scoringMel-Frequency Cepstral CoefficientsSpeech Evaluationcomputer assisted language learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:960
  • 評分評分:*****
  • 下載下載:20
  • 收藏收藏:0
本論文動機是想開發一個語音分析系統來測試小學生發音的單字是否正確,並
且對於唸不對的地方進行評量,希望可以使得偏鄉小朋友在缺乏英語教師資源
時,能夠自行跟著電腦輔助系統學習,從而提升發音準確度與熟練度。資料來
源是均一教育平台收集新北市三峽區的小學生音檔和標準老師的音檔。每個單
字都有它的發音獨特性,所以每個單字的評量規則都不一樣。此研究根據均一
教育平台老師對每位學生發音後的評分結果以及評量規則,來做出相對應不同
單字的自動評量系統。其中主要方法為以均一教育平台老師的音檔作為標準,
來評判學生所發音的單字特徵與老師差異程度。所用的特徵有音量強度曲線、對
數能量上升率、過零率、音長、梅爾倒頻譜係數和第一第二共振峰曲線。再運
用盒狀圖來做出統計上的分析查看其分類的明顯程度。另外也針對母音部分再
進行另一種評量方式: 支持向量機。以梅爾倒頻譜係數、線性預測係數與音量來
作為判斷特徵,取80%學生來訓練評分模型,再以20%學生做測試看其分類的
準確度如何以進行往後母音上的評分判斷。由盒狀圖可以看出當評分準則為母
音不清時,共振峰移動軌跡會有明顯的不一樣,可以用此分類。當評分準則為
母音長短音不清時,母音音長會有明顯的不同,所以適合此評斷的方式。子音
部分也可以由音量強度曲線、對數能量上升率和過零率來判斷有無念好。若使
用支持向量機在無標準老師作為標準,直接在母音上以學生來做自動分類的準
確率大約六成。評量發音的系統結合已開發出先辨識的方法,可以使學生在練習發音時念完全不對的情況下,一直對照老師的標準音檔進行自然發音的練習模式。之後再進行仔細的評分標準判斷,讓學生在練習發音的時候知道單字哪一部分發音不好,可以讓缺乏教師在身旁指導的情況下自行練習英語發音。
The motivation behind this thesis is to help elementary school students from rural areas to master English pronunciation. We explore several methods to judge whether they pronounced precisely.
English words were collected from school children in New Taipei City, and scoring rules were defined by their teachers. Then, we try to carry out an automatic evaluation system applicable to different words. The goal is to compare the audio files from the teachers with those from the pupils and to find out the difference among them corresponding to different word characteristics. The features which are extracted from these audio signals include the short-term log root mean square energy, the rate of rise of log root mean square energy, the zero crossing rate, the length of a speech sound, the Mel-scale frequency cepstral coefficients, and the trajectory of first two formants. Subsequently, the boxplots are adopted to make a statistical analysis and to observe the distribution of the audio data with different scores. In addition, a support vector machine is used to evaluate the vowel segment of the words. The Mel-scale frequency cepstral coefficients, the linear predictive coefficients and the volume are taken as the characteristics. Data from 80% of the students are used as the training set to build the model, while the rest of them are treated as testing data. The score prediction accuracy was about 64% for the vowel segment. Thus, we have an evaluation system which consists of the boxplot and the SVM accuracy. Also, a recognition system is developed to distinguish whether the students pronounce another word or not. We combine both systems to realize the overall evaluation system. In the future, we hope students can use the system and thus obtain some feedbacks to check which parts of the word are not well pronounced, and practice the correct pronunciation with the assistance of this evaluation system.
摘 要 i
Abstract ii
1 Introduction 1
1.1 Motivation..................................... 2
1.2 Literature review ................................. 2
2 Methods 4
2.1 System overview ................................. 4
2.2 End-Point Detection(EPD)............................ 5
2.3 Feature
2.3.1 The short-term log root mean square (RMS) energy . . . . . . . . . . . 5
2.3.2 The rate of rise of log root mean square energy . . . . . . . . . . . . . 5
2.3.3 Zero-crossing rate (ZCR) ........................ 5
2.3.4 Mel-scale Frequency Cepstral Coefficients (MFCC) . . . . . . . . . . 6
2.3.5 Linear prediction coefficients (LPC)[5] . . . . . . . . . . . . . . . . . 9
2.3.6 Formant contour ............................. 9
2.4 Dynamic Time Warping ............................. 10
2.5 Comparing groups using box-plots........................ 12
2.6 Receiver operating characteristics[15]...................... 13
2.7 Support Vector Machine(SVM) ......................... 14
3 Dataset and grading criteria 17
3.1 Random check .................................. 17
4 Experiment results 19
4.1 Consonants .................................... 19
4.2 Vowels....................................... 22
4.3 Discussion..................................... 35
5 Conclusion and future works 36
5.1 Conclusion .................................... 36
5.2 Futureworks ................................... 38
Appendices 42
A Suggestions from the Committees 42
[1] L. Weigelt, S. Sadoff and J. Miller, “ Plosive/fricative distinction: the voiceless case,” J.
Acoust. Soc. Am., vol. 87, no. 6, pp. 2729-2737, 1990.
[2] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences,” in IEEE Trans.
Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357􅍭366, 1980.
[3] A. V. Oppenheim and R. W. Schafer, Discrete-time Signal Processing. Pearson Education,
2009.
[4] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word
recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 26, no. 1, pp. 43􅍭
49, 1978.
[5] J. Makhoul, “Linear prediction: A tutorial review,” in Proc. IEEE, vol. 63 no. 5, pp. 561􅍭
580, 1975.
[6] H. Kamata, H. Oka, and Y. Ishida, “Estimation of vocal tract transfer function considering
the glottis open and close characteristics,” in Proc. IEEE Pacific Rim Conf. Commun.
Comput. Signal Process., vol. 1, pp. 137-140, 1993.
[7] J. Schroeter, “Techniques for estimating vocal-tract shapes from the speech signal,” IEEE
Trans. Speech Audio Process, vol. 2, no. 1, pp. 133-150, 1994.
[8] P. Ladefoged, A Course in Phonetics, 5th ed, Boston, MA: Thomson Wadsworth, 2006.
[9] Florian Keiler, Daniel Arfib, and Udo Zölzer, “Efficient linear prediction for digital audio
effects,” in Proc. Int. Conf. on Digital Audio Effects, Dec. 2000.
[10] N. Levinson, “The Wiener RMS error criterion in filter design and prediction.” J. Math.
Phys, vol. 25, pp. 261􅍭278, 1947.
[11] P. Escudero, P. Boersma, A. S. Rauber, and R. A. Bion, “A cross-dialect acoustic
description of vowels: Brazilian and European Portuguese,” J. Acoust. Soc. Am., vol. 126,
no. 3, pp. 1379-1393, 2009.
[12] F. Itakura and S. Saito, “Digital filtering techniques for speech analysis and synthesis,” in
Proc. 7th Int. Conf. Acoust., 1971.
[13] N. Anderson, “On the calculation of filter coefficients for maximum entropy spectral
analysis,” IEEE Modern Spectral Analysis, New York, 1978.
[14] Nayland College - Mathematics: Comparing Box plots. Available:
http://maths.nayland.school.nz/Year_11/AS1.10_Multivar_data/11_Comparing_Boxplots.htm
[15] J. A. Hanley, and B. J. McNeil, “The meaning and use of the area under a receiver operating
characteristic (ROC) curve,” Radiology, vol. 143, no. 1, pp. 29-36, 1982.
[16] M. A. Aizerman, “Theoretical foundations of the potential function method in pattern
recognition learning,” Automation and Remote Control, vol. 25, pp. 821-837, 1964.
[17] J. Mercer, “Functions of positive and negative type, and their connection with the theory of integral equations,” Philos. Trans. R. Soc. A Math. Phys. Eng. Sci., vol. 209, no. 441-458,
pp. 415-446, Jan. 1909.
[18] R. Fletcher, Practical Methods of Optimization; 2nd ed, Wiley- Interscience, 1987.
[19] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Min.
Knowl. Discov., vol. 2, no. 2, pp. 121-167, Jun. 1998.
[20] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and
Computing Archive, vol. 14, no. 3, pp. 199-222, Aug. 2004.
[21] B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by linear prediction of the
speech wave,” J. Acoust. Soc. Am., vol. 50, no. 2B, pp. 637–655, 1971.
[22] 李俊毅,「語音評分」, 國立清華大學, 2002.
[23] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multi-class support vector
machines,” IEEE Trans. Neural Networks, vol. 13, no. 2, pp. 415-425, 2002.
[24] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A train algorithm for optimal margin classifiers,” in Proc. Fifth Annual Workshop on Computational Learning Theory, pp. 144-152, 1992.
[25] 何育澤, 「基於支持向量機之混合聲響辨認」, 國立清華大學, 2014.
[26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, É. Duchesnay, “Scikit-learn: Machine Learning in Python,” J. Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
[27] A. M. Kondoz and B. G. Evans, “A high quality voice coder with integrated echo canceller and voice activity detector for VSAT systems,” in Proc. 3rd Eur. Conf. Satellite Commun.,
pp. 196-200, 1993.
[28] L. R. Rabiner and M. R. Sambur, “An algorithm for determining the endpoints of isolated
utterances,” Bell Syst. Tech. J., vol. 54, no. 2, pp. 297-315, 1975.
[29] R. Bachu, S. Kopparthi, B. Adapa, and B. Barkana, 􋕨Separation of voiced and unvoiced
using zero crossing rate and energy of the speech signal,” in Proc. Am. Soc. for Eng.
Education Zone Conf., pp. 1􅍭7, 2008.
[30] A. Bala, “Voice command recognition system based on mfcc and dtw,” Int. J. Engineering
Science and Technology, vol. 2, no. 12, pp. 7335-7342, 2010.
[31] M. Diogo, M. Eskenazi, J. Magalhaes, and S. Cavaco, “Robust scoring of voice exercises
in computer-based speech therapy systems,” in Signal Processing Conf., pp. 393-397, Aug.
2016.
[32] O. Mich, A. Neri, and D. Giuliani, “The effectiveness of a computer assisted pronunciation
training system for young foreign language learners,” in Proc. CALL Conf., Taylor & Francis, 2006.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *