帳號:guest(3.144.116.22)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):楊馥榕
作者(外文):Yang, Fu-Rong
論文名稱(中文):基於音韻學的時長模型之中文歌聲合成
論文名稱(外文):Mandarin Singing Voice Synthesis with a Phonology-based Duration Model
指導教授(中文):劉奕汶
指導教授(外文):Liu, Yi-Wen
口試委員(中文):吳誠文
吳尚鴻
楊奕軒
口試委員(外文):Wu, Cheng-Wen
Wu, Shan-Hung
Yang, Yi-Hsuan
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061583
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:52
中文關鍵詞:歌聲合成時長模型中文音韻學
外文關鍵詞:singing voice synthesisduration modelMandarin phonology
相關次數:
  • 推薦推薦:0
  • 點閱點閱:655
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
歌聲合成系統旨在從歌詞和相對應的樂譜中,合成類似人類的自然歌唱聲音。近年來,在人機互動的熱門應用中,歌聲合成技術成為不可或缺的要素,例如:虛擬歌手、作曲助理以及其他智能電子設備等。目前,歌聲合成的主流技術分為兩個階段:首先給定樂譜,透過神經網路預測聲學特徵,接著經由聲碼器,將頻域的聲學特徵轉換為聽覺系統能接收的時域之聲音訊號。如此一來,特定音色、指定歌詞、音高及音長都能夠被電腦有效的生成,即使此音色的歌手未演唱過此歌曲。此外,在歌聲合成系統中,通常會搭配使用「時長模型」,透過預測的音素時長,初步擴展輸入序列的長度,使之與輸出序列的長度大致對齊。在現今大部分的歌聲合成系統中,時長模型通常使用神經網絡來預測音素的持續時間,再結合樂譜提供的音長,計算元音的持續時間。在本論文中,不同於神經網路導向,我們基於中文音韻學,提出了規則導向的音素時長預測演算法。具體來說,我們在現有的訓練資料集中,透過分析中文音韻所制定的規則,搜尋符合與目標歌詞具有「相同輔音」與「相似音長」的所有項目,來推斷目標輔音的持續時間。另外,無論是由歌詞預測聲學特徵,或是還原回時域訊號,均為困難的映射,若要提高合成音質,經常須透過更複雜的神經網路,而對於少量資料集可能產生過擬合的問題,造成神經網路的泛化能力不佳。對此,利用本實驗室先前提出的MPop600資料集,針對其中僅三小時的特定音色,我們採用了Tacotron2和Parallel WaveGAN的組合,作為歌聲合成系統的骨幹,我們發現它們在小資料集上具有良好的數據使用效率,除了能夠合成不錯的歌聲品質,模型也具備良好的泛化能力。最後,實驗結果證實了所提出的規則導向之時長模型,合成的歌聲在綜合表現上,均比基於神經網路導向的模型優良。再加上由於中文是聲調語言,在提出的規則導向模型中考慮聲調,更有助於提升合成歌聲的自然度。
Singing voice synthesis (SVS) systems are built to generate human-like voice signals from lyrics and the corresponding musical scores. The mainstream voice synthesis techniques involve two stages: acoustic feature modeling and audio synthesis. In most SVS systems, a neural network-based auxiliary duration model is employed to predict the duration of phonemes. According to the phoneme durations, the input sequence is pre-expanded to roughly align with the length of the output sequence to match with the rhythm in singing. In this thesis, a rule-based algorithm inspired by Mandarin phonology is proposed for the duration modeling in Mandarin SVS. Specifically, the algorithm infers the duration of an “initial” consonant by looking up syllables in an existing training set that begin with the same consonant and have similar note lengths, and then computing the average consonant duration. Around this, with the 3-hour female singing voices in the MPop600 dataset, we employ a combination of Tacotron2 and Parallel WaveGAN as the backbone of our SVS system for their robustness and favorable data efficiency on small datasets. Experimental results show that the singing voice synthesized by the proposed duration model is more expressive than that of a learning-based model. Moreover, since Mandarin is a tonal language, the inclusion of tonality consideration further enhances the naturalness of the generated voices.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Phoneme Duration Estimation Algorithm . . . . . . . . . . . . 2
1.2.2 Robust and Data Efficient Architecture of Neural SVS . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 4
2.1 Overview of the General SVS System . . . . . . . . . . . . . . 4
2.2 Previous Works for SVS . . . . . . . . . . . . .. . . . . . . . 6
2.3 Duration Modeling . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Mandarin Phonology in a Nutshell . . . . . . . . . . . . . . . 7
3 Duration Analysis 9
3.1 Phoneme Stretching Analysis . . . . . . . . . . . . . . . . . . 9
3.1.1 Distribution and Skewness . . . . . . . . . . . . . . . . . 10
3.1.2 Stretching of Phonemes and Syllables . . . . . . . . . . . . 11
3.1.3 Initial Ratio across Different Singer . . . . . . . . . . . 12
3.2 The Rule-based Algorithm . . . . . . . . . . . . . . . . . . . 13
3.2.1 Rule 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Rule 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Rule 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 The Proposed SVS system 17
4.1 Input Sequence Representation . . . . . . . . . . . . . . . . 17
4.1.1 Linguistic and Musical Information Extraction . . . . . . . 18
4.1.2 Length Regulator . . . . . . . . . . . . . . . . . . . . . . 19
4.1.3 Embedding and Final Input Sequence . . . . . . . . . . . . . 20
4.2 Mel-scale Spectrogram Computation . . . . . . . . . . . . . . 20
4.3 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Location-sensitive Attention . . . . . . . . . . . . . . . . 22
4.3.3 The Decoder . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 The Audio Synthesizer . . . . . . . . . . . . . . . . . . . . 26
4.4.1 Discriminator . . . . . . . . . . . . . . . . . . . . . . . 26
4.4.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Experiments 29
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Content of the Selected Subset . . . . . . . . . . . . . . . 29
5.1.2 Phoneme Segmentation . . . . . . . . . . . . . . . . . . . . 30
5.2 Experimental Conditions . . . . . . . . . . . . . . . . . . . 32
5.3 Training Strategy . . . . . . . . . . . . . . . . . . . . . . 33
5.3.1 Teacher-forcing Mode . . . . . . . . . . . . . . . . . . . . 33
5.3.2 Inference Mode . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Hyperparameter and Training Setting . . . . . . . . . . . . . 34
5.4.1 Mel-feature Prediction Network . . . . . . . . . . . . . . . 34
5.4.2 Neural Vocoder . . . . . . . . . . . . . . . . . . . . . . . 34
6 Results and Evaluations 36
6.1 Objective evaluation . . . . . . . . . . . . . . . . . . . . . 36
6.1.1 Root-mean-square Error . . . . . . . . . . . . . . . . . . . 36
6.1.2 Pearson Correlation Coefficient . . . . . . . . . . . . . . 37
6.1.3 Mel Cepstral Distortion . . . . . . . . . . . . . . . . . . 37
6.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . 38
6.3 Discussions on the Overall Performance . . . . . . . . . . . . 39
6.4 Evaluation on Duration Model by Preference Test . . . . . . . 42
7 Conclusions 43
8 Future Works 44
8.1 Multi-note Conditioned on One Character . . . . . . . . . . . 44
8.2 Multi-voice Singing Synthesizer . . . . . . . . . . . . . . . 44
8.3 Singing Style Transfer . . . . . . . . . . . . . . . . . . . . 45
References 46
Appendix 49
A.1 Phonetic Components of Mandarin Pinyin . . . . . . . . . . . . 49
A.2 Hyperparameter of Tacotron2 . . . . . . . . . . . . . . . . . 50
A.3 Hyperparameter of Parallel WaveGan . . . . . . . . . . . . . . 51
A.4 Suggestions From the Oral Defense Committees . . . . . . . . . 52
A.4.1 吳誠文教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.4.2 吳尚鴻教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.4.3 楊奕軒教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.4.4 劉奕汶教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
[1] P. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “XiaoiceSing: A high-quality and integrated singing voice synthesis system,” arXiv preprint arXiv:2006.06261, 2020.
[2] J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “HifiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020.
[3] Y. Gu, X. Yin, Y. Rao, Y. Wan, B. Tang, Y. Zhang, J. Chen, Y. Wang, and Z. Ma, “Bytesing: A Chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders,” in International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, 2021.
[4] C.-C. Chu, F.-R. Yang, Y.-J. Lee, Y.-W. Liu, and S.-H. Wu, “MPop600: A Mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1647– 1652, 2020.
[5] J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J.-J. Kim, “Korean singing voice synthesis system based on an LSTM recurrent neural network,” in Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1551–1555, 2018.
[6] Y. Wu, S. Li, C. Yu, H. Lu, C. Weng, L. Zhang, and D. Yu, “Peking opera synthesis via duration informed attention network,” arXiv preprint arXiv:2008.03029, 2020.
[7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, 2018.
[8] X. Ying, “An overview of overfitting and its solutions,” in Journal of Physics: Conference Series, IOP Publishing, 2019.
[9] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGan: A fast waveform generation model based on generative adversarial networks with multiresolution spectrogram,” in ICASSP, pp. 6199–6203, 2020.
[10] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
[11] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
[12] H. Kenmochi and H. Ohshita, “Vocaloid-commercial singing synthesizer based on sample concatenation,” in INTERSPEECH, 2007.
[13] K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “An HMM-based singing voice synthesis system,” in International Conference on Spoken Language Processing, 2006.
[14] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks.,” in INTERSPEECH, pp. 2478–2482, 2016.
[15] K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on convolutional neural networks,” arXiv preprint arXiv:1904.06868, 2019.
[16] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on generative adversarial networks,” in ICASSP, pp. 6955– 6959, 2019.
[17] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
[18] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
[19] J. Lee, H.-S. Choi, C.-B. Jeon, J. Koo, and K. Lee, “Adversarially trained end-to-end Korean singing voice synthesis system,” arXiv preprint arXiv:1908.01919, 2019.
[20] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning (ICML), pp. 2410–2419, 2018.
[21] J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, and Z. Zhao, “DiffSinger: Diffusion acoustic model for singing voice synthesis,” arXiv preprint arXiv:2105.02446, 2021.
[22] H. Kawahara, “STRAIGHT, exploitation of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds,” Acoustical science and technology, vol. 27, no. 6, pp. 349–353, 2006.
[23] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” arXiv preprint arXiv:1905.09263, 2019.
[24] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei, et al., “Durian: Duration informed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700, 2019.
[25] M. Blaauw and J. Bonada, “Sequence-to-sequence singing synthesis using the feed-forward transformer,” in ICASSP, pp. 7229–7233, 2020.
[26] C. Lü, Chinese literacy learning in an immersion program. Springer, 2019.
[27] Y.-J. Lee, T.-C. Liao, and Y.-W. Liu, “A simple strategy for natural Mandarin spoken word stretching via the vocoder,” in International Congress on Acoustics (ICA), 2019.
[28] E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech,” Speech communication, vol. 16, no. 2, pp. 175–205, 1995.
[29] J. Driedger, “Time-scale modification algorithms for music audio signals,” Master’s thesis, Saarland University, 2011.
[30] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in APSIPA ASC, pp. 1–9, 2013.
[31] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015.
[32] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
[33] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[35] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
[36] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE: A convolutional representation for pitch estimation,” in ICASSP, pp. 161–165, 2018.
[37] R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” in ICASSP, pp. 6189–6193, 2020.
[38] P. Verma and J. O. Smith, “Neural style transfer for audio spectograms,” arXiv preprint arXiv:1801.01589, 2018.
[39] C.-W. Wu, J.-Y. Liu, Y.-H. Yang, and J.-S. R. Jang, “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1807.02254, 2018.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *