帳號:guest(3.14.143.8)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):吳啟聖
作者(外文):Wu, Chi-Sheng
論文名稱(中文):可變形基頻之波網聲碼器
論文名稱(外文):Morphable Fundamental Frequency Wavenet Vocoder
指導教授(中文):蘇豐文
指導教授(外文):SOO, VON-WUN
口試委員(中文):邱瀞德
黃國源
口試委員(外文):CHIU, CHING-TE
HUANG, KOU-YUAN
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:106065703
出版年(民國):109
畢業學年度:108
語文別:英文
論文頁數:28
中文關鍵詞:波網捲積神經網路聲碼器可變形
外文關鍵詞:WavenetConvolutionalNeuralNetworkvocoder
相關次數:
  • 推薦推薦:0
  • 點閱點閱:628
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
自然人聲合成是一個具有挑戰的研究方向,長久以來吸引許多的研究者投入開發相關技術。從2016年開始,Google的研究團隊提出了Wavenet聲音合成技術,將深度學習技術引入,擊敗了傳統的參數合成模型。然而Wavenet相較於傳統合成算法,宛若一黑盒子,使用者難以透過參數的調整去進一步生成人聲的語調與情緒,使得這樣的架構難以近一步推廣到應用市場。我們針對這問題研發了一種新的Wavenet架構,使用基於基頻-脈衝模型(pulse-resonant model)的架構改良原本的Wavenet模型,使得人聲的基頻可以獨立調整。
Nature vocal synthesis is a challenge research topic that attracts people to get involved in it. Since 2016, Google proposed a deep learning-based synthesis model Wavenet which dominated the traditional parameter synthesis model. However, Wavenet is like a black box which is hard to be customized by users and limits its applications especially tone and emotion modification. Therefore, based on Wavenet, we proposed a pulse-resonant Wavenet to allow users could customize the synthetic vocal by morphing its fundamental frequency.
摘要 i
Abstract ii
Acknowledgement iii
List of Tables vi
List of Figures vii
1 Introduction 1
2 Related Work 6
3 Methodology 10
3.1 Mathematical Symbol Conventions . . . . . . . . . . . . . . . . . . . . . . 10
3.2 The limitation of Wavenet . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Pulse-resonant convolutional neural layer . . . . . . . . . . . . . . . . . . 15
3.4 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experiments and Results 20
iv
4.1 The Objective Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Effective Validation for the f0 input of Baseline Model . . . . . . . . . . . 21
4.4 Validation of the f0 input in the proposed model . . . . . . . . . . . . . . . 23
4.5 Independence of new model’s variables with f0r . . . . . . . . . . . . . . . 24
5 Conclusion and Future Work 25
References 26
[1]MalcolmSlaney,MicheleCovell,andBudLassiter. Automaticaudiomorphing. pages1001 – 1004 vol. 2, 06 1996. ISBN 0-7803-3192-3. doi: 10.1109/ICASSP.1996.543292.[2]Satoshi Imai, Kazuo Sumita, and Chieko Furuichi. Mel log spectrum approximation(mlsa) filter for speech synthesis.Electronics and Communications in Japan (Part I:Communications), 66(2):10–18, 1983.[3]Hideki Kawahara. Straight, exploitation of the other aspect of vocoder: Perceptuallyisomorphic decomposition of speech sounds.Acoust Sci Technol, 27349, 01 2006.doi: 10.1250/ast.27.349.[4]Masanori Morise, Fumiya YOKOMORI, and Kenji Ozawa. World: A vocoder-basedhigh-quality speech synthesis system for real-time applications.IEICE Transactionson Information and Systems, E99.D:1877–1884, 07 2016. doi: 10.1587/transinf.2015EDP7457.[5]AkiraTamamori,TomokiHayashi,KazuhiroKobayashi,KazuyaTakeda,andTomoki26
Toda. Speaker-dependent wavenet vocoder. pages 1118–1122, 08 2017. doi: 10.21437/Interspeech.2017-314.[6]TomokiHayashi,AkiraTamamori,KazuhiroKobayashi,KazuyaTakeda,andTomokiToda. Aninvestigationofmulti-speakertrainingforwavenetvocoder. pages712–718,12 2017. doi: 10.1109/ASRU.2017.8269007.[7]D. G. Childers, D. P. Skinner, and R. C. Kemerait. The cepstrum: A guide toprocessing.Proceedings of the IEEE, 65(10), Oct 1977. ISSN 1558-2256. doi:10.1109/PROC.1977.10747.[8]S. S. Stevens, J. Volkmann, and E. B. Newman. A scale for the measurement of thepsychological magnitude pitch.The Journal of the Acoustical Society of America, 8(3):185–190, 1937. doi: 10.1121/1.1915893. URLhttps://doi.org/10.1121/1.1915893.[9]Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, NavdeepJaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, YannisAgiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-endspeech synthesis. 2017. URLhttps://arxiv.org/abs/1703.10135.[10]JonathanShen, RuomingPang, RonWeiss, MikeSchuster, NavdeepJaitly, ZonghengYang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif Saurous, YannisAgiomvrgiannakis, and Yonghui Wu. Natural tts synthesis by conditioning waveneton mel spectrogram predictions. pages 4779–4783, 04 2018. doi: 10.1109/ICASSP.2018.8461368.27
[11]Arturo Camacho and John Harris. A sawtooth waveform inspired pitch estimator forspeech and music.The Journal of the Acoustical Society of America, 124:1638–52,10 2008. doi: 10.1121/1.2951592.[12]John Kominek and Alan Black. The cmu arctic speech databases.SSW5-2004, 012004.
(此全文未開放授權)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *