作者(外文):Chen, Bo Min
論文名稱(外文):Sound reconstruction based on features for sound recognition
指導教授(外文):Liu, Yi Wen
外文關鍵詞:sound reconstructionsound recognitionmel
本論文使用的聲音特徵為普遍使用在聲音辨識的梅爾頻率倒頻譜係數(Mel Frequency Cepstral Coefficients, MFCC),而由於MFCC代表的只是聲音的頻譜包絡,已經捨棄掉細節,而語音中的音高就是語音的細節,因此再加上音高(pitch)作為特徵的一部分,以增加重建聲音的完整性。聲音重建模型則使用聲源-濾波器模型(source-filter model)為基礎,使用MFCC回推的頻率響應當作原始聲音的頻譜包絡,並用音高來決定聲源訊號。有音高的有聲語音,在重建其聲源時會根據人的發聲機制來決定泛音和雜訊的頻率分布範圍,將頻譜包絡和聲源用修改後的聲源-濾波器模型重建出聲音。
本論文用來分析和重建的聲音使用語音以及非語音,透過分析重建過程以及結果來探討可能影響重建聲音品質的因素,並透過主觀的真人聽覺測試以及客觀的聲音品質感知評估(Perceptual Evaluation of Audio Quality, PEAQ)對重建聲音評分,分數範圍為1分(非常差)到5分(非常好)。真人聽覺測試結果顯示非語音和語音的重建效果分數約介於3到4分之間,屬於可清楚理解的程度。聲音品質感知評估則顯示非語音的重建效果分數約介於2到3.5分之間,而語音的重建效果分數只些微大於1分。
Sounds play an important role in our life. We can communicate with each other and know what happens by listening to sounds. By extracting the feature of sounds, we can keep specific information of sounds to recognize sounds. Sound transmission can be done if sounds could be reconstructed from the transmitted features of sounds. In this research, we attempt to reconstruct sounds using features that are typically transmitted for recognition purposes.
In this thesis, we take the mel frequency cepstral coefficients (MFCC), a set of features that has been commonly used for sound recognition, as the basic features for reconstruction. Because MFCC does not encode the detail of sounds, we use the pitch as additional information to enhance the completeness of the features. The sound reconstruction is based on a source-filter model which takes the reconstructed frequency response from MFCC as the spectral envelope and determines the sound source with the pitch. The critical factors of the reconstructed sound source are the frequency distribution of noise and harmonics which could be determined by the human speech production mechanism. We then combine the spectral envelope with the sound source to reconstruct sounds through a modified source-filter model.
In this thesis, we test our methods by analysis and reconstruction of speech and non-speech materials. We attempt to find the factors that may affect the quality of reconstructed sounds. We also evaluate reconstructed sounds by subjective listening test and objective perceptual evaluation of audio quality ( PEAQ). The range of grades is from 1(very bad) to 5(very good). The result of listening test reveals that the grade of speech and non-speech reconstruction is about 3 to 4. PEAQ reveals that the grade of non-speech reconstruction is about 2 to 3.5 and the grade of speech reconstruction is slightly higher than 1.
摘要 i
Abstract ii
誌謝 iii
第一章 緒論 1
1.1 研究動機 1
1.2 文獻回顧 2
1.3 研究方向 5
1.4 章節大綱 5
第二章 系統架構與方法 6
2.1 聲音特徵萃取及分析參數選擇 6
2.1.1 梅爾頻率倒頻譜係數[1] 6
2.1.2 音高 11
2.1.3 分析參數 16
2.2 聲音重建 17
2.2.1 聲音重建模型 17
2.2.2頻譜包絡重建 18
2.2.3 聲源重建 20
2.2.4聲音重建實作 22
第三章 結果與討論 25
3.1 非語音分析與重建結果 25
3.1.1 聲音持續時間短暫的非語音聲音 26
3.1.2 聲音持續時間較長的非語音聲音 30
3.2 語音分析與重建結果 34
3.2.1 有聲喉音 34
3.2.2 無聲氣音以及帶有無聲氣音的有聲喉音 36
3.2.3 完整語音 40
3.3 聽覺測試結果 43
3.3.1主觀評分 43
3.3.2客觀評分 44
第四章 結論與未來展望 46
4.1 結論 46
4.2 未來展望 47
附錄 46
參考文獻 49
