帳號:guest(3.138.118.215)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):朱贊全
作者(外文):Chu, Chan-Chuan
論文名稱(中文):非同步學習「過渡」及「延音」以改善中文歌聲合成之基頻軌跡
論文名稱(外文):Improving the Fundamental Frequency (F0) Trajectories in Mandarin Singing Voice Synthesis by Asynchronously Learning the Transition and Sustain
指導教授(中文):劉奕汶
指導教授(外文):Liu, Yi-Wen
口試委員(中文):宋孔彬
冀泰石
吳尚鴻
口試委員(外文):Sung, Kung-Bin
Chi, Tai-Shih
Wu, Shan-Hung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:107061593
出版年(民國):109
畢業學年度:109
語文別:英文
論文頁數:56
中文關鍵詞:歌聲合成神經網路深度學習
外文關鍵詞:singing voice synthesisneural networkdeep learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:820
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在歌聲合成的領域中,不同語言在語言學上所需考慮的特徵不同,例如中文是聲調語言,比起日語、韓語或英語,還需要考慮到聲調影響字義以及基頻的因素。於是本實驗室團隊先前重新定義了上下文本相關特徵的組成,並且自行蒐集了中文流行歌資料庫,建立一套基於雙向遞迴神經網路的中文歌聲合成系統,只要給定樂譜即可合成出歌聲,歌者的咬字及音色都可以被模型學習。

然而合成歌聲中有幾個比較明顯的問題是,不同音高間轉換不夠平滑,甚至會有明顯的機械音,以及合成基頻過度平滑導致顫音等個人歌唱技巧消失的情況。這些跟音準有直接相關的問題,我們認為會非常直接地影響到聽者的感受。在本篇論文中,我們著重於改善基頻合成的效果,首先使用了殘差連接來確保音準不會偏離樂譜音高太多,並開發了過渡及延音演算法來解決這些問題,藉由權重計算使得音高轉換過渡區間更加自然,以及保留了單一音符間顫音的表現。除此之外,由於基頻抽取後,會有不少過高或過低的跳點,這些數量不少的跳點會影響模型的學習效果,於是參考等化器中壓縮器的概念來減低跳點的影響。根據客觀計算結果,經由使用的演算法以及其他前處理後,歌聲合成的效果會比較好,均方根誤差下降超過15赫茲,主觀計算也顯示新方法所合成的音檔獲得較高的分數。
In the field of singing voice synthesis (SVS), features which have to be taken into account are not consistent between different languages. For example, different from Japanese, Korean or English, Mandarin is a tonal language which means the tone might influence the word meaning. For singing voice synthesis purposes, our team previously defined contextual factors which take the tonality into consideration, and created an Mandarin pop song database to train a bi-directional long short-term memory (BiLSTM) network. The input to the network is a musical score, and the system can model the pronunciation and the timbre to synthesize the singing voice.

However, a few problems persist regarding F0 synthesis (F0 stands for fundamental frequency). First, the F0 trajectory at note boundaries is not smooth enough, which causes the synthesized voice to sound unnatural. Next, oversmoothing of the F0 trajectory causes certain singing expressions, such as vibrato, to be eliminated. In this research, we are dedicated to improving the synthesis performance of F0. First, The residual connection is adopted to ensure the accuracy of the pitch produced by the neural network architecture. Then, to alleviate the above problems, algorithms are developed to handle transition and sustain separately. Also, because the F0 trajectory extracted from the vocoder may not be perfect, extreme values often occur and reduce the effectiveness of learning. To mend this problem, a simple compressor is adopted. Objective evaluation shows that the root mean square error (RMSE) between synthesized F0 trajectories and the ground truth decreases over 15 Hz after we conduct the algorithms and the pre-processing. Also, subjective evaluation shows the audio synthesized by the new method gets a higher score.
1. Introduction p. 1
2. Related work p. 5
3. The proposed system p. 10
4. Experiments and results p. 25
5. Conclusions p. 39
6. Future works p. 40
References p. 42
Appendix p. 45
1. H. Kenmochi, “Vocaloid and hatsune miku phenomenon in japan,” in Interdisciplinary Workshop on Singing Voice, 2010.
2. K. Hua, “Modeling singing f0 with neural network driven transition-sustain models,” ArXiv, vol. abs/1803.04030, 2018.
3. M. Morise, “Harvest: A high-performance fundamental frequency estimator from speech signals,” in INTERSPEECH 2017, pp. 2321–2325, Aug 2017.
4. Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on generative adversarial networks,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6955–6959, May 2019.
5. P. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “Xiaoicesing: A high-quality and integrated singing voice synthesis system,” arXiv preprint arXiv:2006.06261,2020.
6.J. Lee, H.-S. Choi, C.-B. Jeon, J. Koo, and K. Lee, “Adversarially trained end-to-end Korean singing voice synthesis system,” in INTERSPEECH 2019,pp. 2588–2592, Sep 2019.
7. P. Chandna, M. Blaauw, J. Bonada, and E. Gómez, “WGANSing: A multi-voice singing voice synthesizer based on the Wasserstein-GAN,” ArXiv, vol. abs/1903.10729, 2019.
8. Y. Gu, X. Yin, Y. Rao, Y. Wan, B. Tang, Y. Zhang, J. Chen, Y. Wang, and Z. Ma, “Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders,” arXiv preprint arXiv:2004.11012, 2020.
9. H. Kenmochi and H. Ohshita, “Vocaloid-commercial singing synthesizer based on sample concatenation,” in Eighth Annual Conference of the International Speech Communication Association, 2007.
10. T. Nakano and M. Goto, “Vocalistener: A singing-to-singing synthesis system based on iterative parameter estimation,”Proceedings of Sound and Music Computing (SMC), pp. 343–348, Jan 2009.
11. T. Nakano and M. Goto, “Vocalistener2: A singing synthesis system able to mimic a user’s singing in terms of voice timbre changes as well as pitch and dynamics,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 453–456, May 2011.
12. K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “HMM-based singing voice synthesis system,” in Proceedings International Conference on Spoken Language Processing (ICSLP), pp. 2274–2277, 2006.
13. H.-Y. Gu and H.-L. Liau, “Mandarin singing voice synthesis using an HMM based scheme,” Journal of Information Science and Engineering - JISE, vol. 27, pp. 347–351, Jun 2008.
14. X. Li and Z. Wang, “A HMM-based mandarin chinese singing voice synthesis system,” IEEE/CAA Journal of Automatica Sinica, vol. 3, pp. 192–202, April 2016.
15. K. Nakamura, K. Oura, Y. Nankaku, and K. Tokuda, “HMM-based singing voice synthesis and its application to Japanese and English,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 265–269, May 2014.
16. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for hmm-based speech synthesis,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 3, pp. 1315–1318, IEEE, 2000.
17. M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks,” in INTERSPEECH 2016, pp. 2478–2482, Sep 2016.
18. J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J. Kim, “Korean singing voice synthesis based on an LSTM recurrent neural network,” in INTERSPEECH 2018, pp. 1551–1555, Sep 2018.
19. M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions, vol. 99-D, pp. 1877–1884, 2016.
20. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
21. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, IEEE, 2018.
22. J. Uszkoreit, “Transformer: A novel neural network architecture for language understanding,” Google AI Blog, vol. 31, 2017.
23. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lock-hart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.
24. K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” in Advances in Neural Information Processing Systems, pp. 14910–14921, 2019.
25. 李依哲, “基於雙向時間遞歸神經網路之中文歌聲合成,” 2019. 國立清華大學電機工程學系碩士論文, https://hdl.handle.net/11296/yf8qks.
26. M. o. E. Organized by Department of Lifelong Education, The Manual of the Phonetic Symbols of Mandarin Chinese (Digital Version). Ministry of Education, 1 ed., Jan 2017.
27. M. Morise, “CheapTrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol. 67, pp. 1–7, 2015.
28. M. Morise, “D4C, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, vol. 84, pp. 57–65, 2016.
29. K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis,” in Proceedings International Conference on Spoken Language Processing (ICSLP), pp. 1043–1046, 1994.
30. H. Zen, T. Toda, M. Nakamura, and K. Tokuda, “Details of the nitech hmm-based speech synthesis system for the blizzard challenge 2005,” IEICE Transactions, vol. 90-D, pp. 325–333, 01 2007.
31. T. Saitou, M. Goto, M. Unoki, and M. Akagi, “Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices,” in 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 215–218, IEEE, 2007.
32. L. Ardaillon, G. Degottex, and A. Roebel, “A multi-layer f0 model for singing voice synthesis using a b-spline representation with intuitive controls,” in INTERSPEECH 2015, Sep 2015.
33. R. Maher and J. Beauchamp, “An investigation of vocal vibrato for synthesis,” Applied Acoustics, vol. 30, no. 2-3, pp. 219–245, 1990.
34. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, Dec 2014.
35. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *