基於聲音辨識特徵之聲音重建__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.96) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳柏旻
作者(外文):	Chen, Bo Min
論文名稱(中文):	基於聲音辨識特徵之聲音重建
論文名稱(外文):	Sound reconstruction based on features for sound recognition
指導教授(中文):	劉奕汶
指導教授(外文):	Liu, Yi Wen
口試委員(中文):	冀泰石曹昱
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	101061590
出版年(民國):	104
畢業學年度:	104
語文別:	中文
論文頁數:	51
中文關鍵詞:	聲音重建、聲音辨認、梅爾
外文關鍵詞:	sound reconstruction、sound recognition、mel
相關次數:	推薦:0 點閱:756 評分: 下載:45 收藏:0

在日常生活中，聲音扮演了人與人之間溝通以及使人了解何種事件發生的媒介。透過萃取聲音特徵的方式保留聲音重要的特徵資訊，以達成聲音辨識的目的。若使用傳輸後的聲音特徵來重建聲音，便可達到傳輸聲音的目的，等於將聲音的辨識和傳輸整合在一起。
本論文使用的聲音特徵為普遍使用在聲音辨識的梅爾頻率倒頻譜係數(Mel Frequency Cepstral Coefficients, MFCC)，而由於MFCC代表的只是聲音的頻譜包絡，已經捨棄掉細節，而語音中的音高就是語音的細節，因此再加上音高(pitch)作為特徵的一部分，以增加重建聲音的完整性。聲音重建模型則使用聲源－濾波器模型(source-filter model)為基礎，使用MFCC回推的頻率響應當作原始聲音的頻譜包絡，並用音高來決定聲源訊號。有音高的有聲語音，在重建其聲源時會根據人的發聲機制來決定泛音和雜訊的頻率分布範圍，將頻譜包絡和聲源用修改後的聲源－濾波器模型重建出聲音。
本論文用來分析和重建的聲音使用語音以及非語音，透過分析重建過程以及結果來探討可能影響重建聲音品質的因素，並透過主觀的真人聽覺測試以及客觀的聲音品質感知評估(Perceptual Evaluation of Audio Quality, PEAQ)對重建聲音評分，分數範圍為1分(非常差)到5分(非常好)。真人聽覺測試結果顯示非語音和語音的重建效果分數約介於3到4分之間，屬於可清楚理解的程度。聲音品質感知評估則顯示非語音的重建效果分數約介於2到3.5分之間，而語音的重建效果分數只些微大於1分。

Abstract
Sounds play an important role in our life. We can communicate with each other and know what happens by listening to sounds. By extracting the feature of sounds, we can keep specific information of sounds to recognize sounds. Sound transmission can be done if sounds could be reconstructed from the transmitted features of sounds. In this research, we attempt to reconstruct sounds using features that are typically transmitted for recognition purposes.
In this thesis, we take the mel frequency cepstral coefficients (MFCC), a set of features that has been commonly used for sound recognition, as the basic features for reconstruction. Because MFCC does not encode the detail of sounds, we use the pitch as additional information to enhance the completeness of the features. The sound reconstruction is based on a source-filter model which takes the reconstructed frequency response from MFCC as the spectral envelope and determines the sound source with the pitch. The critical factors of the reconstructed sound source are the frequency distribution of noise and harmonics which could be determined by the human speech production mechanism. We then combine the spectral envelope with the sound source to reconstruct sounds through a modified source-filter model.
In this thesis, we test our methods by analysis and reconstruction of speech and non-speech materials. We attempt to find the factors that may affect the quality of reconstructed sounds. We also evaluate reconstructed sounds by subjective listening test and objective perceptual evaluation of audio quality ( PEAQ). The range of grades is from 1(very bad) to 5(very good). The result of listening test reveals that the grade of speech and non-speech reconstruction is about 3 to 4. PEAQ reveals that the grade of non-speech reconstruction is about 2 to 3.5 and the grade of speech reconstruction is slightly higher than 1.

摘要 i
Abstract ii
誌謝 iii
第一章緒論 1
1.1 研究動機 1
1.2 文獻回顧 2
1.3 研究方向 5
1.4 章節大綱 5
第二章系統架構與方法 6
2.1 聲音特徵萃取及分析參數選擇 6
2.1.1 梅爾頻率倒頻譜係數[1] 6
2.1.2 音高 11
2.1.3 分析參數 16
2.2 聲音重建 17
2.2.1 聲音重建模型 17
2.2.2頻譜包絡重建 18
2.2.3 聲源重建 20
2.2.4聲音重建實作 22
第三章結果與討論 25
3.1 非語音分析與重建結果 25
3.1.1 聲音持續時間短暫的非語音聲音 26
3.1.2 聲音持續時間較長的非語音聲音 30
3.2 語音分析與重建結果 34
3.2.1 有聲喉音 34
3.2.2 無聲氣音以及帶有無聲氣音的有聲喉音 36
3.2.3 完整語音 40
3.3 聽覺測試結果 43
3.3.1主觀評分 43
3.3.2客觀評分 44
第四章結論與未來展望 46
4.1 結論 46
4.2 未來展望 47
附錄 46
參考文獻 49

[1] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., vol. 28, no. 4, pp. 357–366, Aug. 1980.
[2] X. Huang, A. Acero, H.-W. Hon, and R. Foreword By-Reddy, Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR, 2001.
[3] Z. Tychtl and J. Psutka, “Speech production based on the mel-frequency cepstral coefficients.,” in EuroSpeech, 1999, vol. 99, pp. 2335–2338.
[4] B. P. Milner and X. Shao, “Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model,” in 7th International Conference on Spoken Language Processing (ICSLP-2002), 2002, pp. 2421–2424.
[5] D. Chazan, R. Hoory, G. Cohen, and M. Zibulski, “Speech reconstruction from mel frequency cepstral coefficients and pitch frequency,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2000, vol. 3, pp. 1299–1302.
[6] X. Shao and B. Milner, “Clean speech reconstruction from noisy mel-frequency cepstral coefficients using a sinusoidal model,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., 2003, vol. 1, pp. I–704–I–707.
[7] B. Milner, “Pitch prediction from MFCC vectors for speech reconstruction,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 1, pp. I–97–100.
[8] X. Shao and B. Milner, “Predicting fundamental frequency from mel-frequency cepstral coefficients to enable speech reconstruction,” J. Acoust. Soc. Am., vol. 118, no. 2, pp. 1134–1143, 2005.
[9] B. Milner and X. Shao, “Prediction of fundamental frequency and voicing from mel-frequency cepstral coefficients for unconstrained speech reconstruction,” IEEE Trans. Audio, Speech Lang. Process., vol. 15, no. 1, pp. 24–33, Jan. 2007.
[10] J. O. Smith, Spectral Audio Signal Processing, 2011 editi. http://ccrma.stanford.edu/~jos/sasp/.
[11] E. Larson and R. Maddox, “Real-time time-domain pitch tracking using wavelets,” Proc. Univ. Illinois Urbana Champaign Res. Exp. Undergraduates Progr., 2005.
[12] C. T. Ferrand, “Speech science: An integrated approach to theory and clinical practice,” Ear Hear., vol. 22, no. 6, p. 549, 2001.
[13] D. P. W. Ellis, “PLP and RASTA (and MFCC, and inversion) in Matlab.” 2005.
[14] 王小川, 語音訊號處理, 修訂二版. 全華圖書, 2008.
[15] S. N. Levine and J. O. Smith III, “A sines+ transients+ noise audio representation for data compression and time/pitch scale modifications,” in Audio Engineering Society Convention 105, 1998.
[16] R. J. McAulay and T. F. Quatieri, Sinusoidal coding. Defense Technical Information Center, 1995.
[17] A. V Oppenheim, R. W. Schafer, J. R. Buck, and others, Discrete-time signal processing, vol. 2. Prentice-hall Englewood Cliffs, 1989.
[18] P. Kabal, “An examination and interpretation of ITU-R BS. 1387: Perceptual evaluation of audio quality,” TSP Lab Tech. Report, Dept. Electr. Comput. Eng. McGill Univ., pp. 1–89, 2002.
[19] L. Besacier, S. Grassi, A. Dufaux, M. Ansorge, and F. Pellandini, “GSM speech coding and speaker recognition,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2000, vol. 2, pp. II1085–II1088.
[20] I. Recommendation, “Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs “,” ITU-T Recomm., p. 862, 2001.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文