作者(外文):Yang, Fu-Rong
論文名稱(外文):Mandarin Singing Voice Synthesis with a Phonology-based Duration Model
指導教授(外文):Liu, Yi-Wen
口試委員(外文):Wu, Cheng-Wen
Wu, Shan-Hung
Yang, Yi-Hsuan
外文關鍵詞:singing voice synthesisduration modelMandarin phonology
歌聲合成系統旨在從歌詞和相對應的樂譜中,合成類似人類的自然歌唱聲音。近年來,在人機互動的熱門應用中,歌聲合成技術成為不可或缺的要素,例如:虛擬歌手、作曲助理以及其他智能電子設備等。目前,歌聲合成的主流技術分為兩個階段:首先給定樂譜,透過神經網路預測聲學特徵,接著經由聲碼器,將頻域的聲學特徵轉換為聽覺系統能接收的時域之聲音訊號。如此一來,特定音色、指定歌詞、音高及音長都能夠被電腦有效的生成,即使此音色的歌手未演唱過此歌曲。此外,在歌聲合成系統中,通常會搭配使用「時長模型」,透過預測的音素時長,初步擴展輸入序列的長度,使之與輸出序列的長度大致對齊。在現今大部分的歌聲合成系統中,時長模型通常使用神經網絡來預測音素的持續時間,再結合樂譜提供的音長,計算元音的持續時間。在本論文中,不同於神經網路導向,我們基於中文音韻學,提出了規則導向的音素時長預測演算法。具體來說,我們在現有的訓練資料集中,透過分析中文音韻所制定的規則,搜尋符合與目標歌詞具有「相同輔音」與「相似音長」的所有項目,來推斷目標輔音的持續時間。另外,無論是由歌詞預測聲學特徵,或是還原回時域訊號,均為困難的映射,若要提高合成音質,經常須透過更複雜的神經網路,而對於少量資料集可能產生過擬合的問題,造成神經網路的泛化能力不佳。對此,利用本實驗室先前提出的MPop600資料集,針對其中僅三小時的特定音色,我們採用了Tacotron2和Parallel WaveGAN的組合,作為歌聲合成系統的骨幹,我們發現它們在小資料集上具有良好的數據使用效率,除了能夠合成不錯的歌聲品質,模型也具備良好的泛化能力。最後,實驗結果證實了所提出的規則導向之時長模型,合成的歌聲在綜合表現上,均比基於神經網路導向的模型優良。再加上由於中文是聲調語言,在提出的規則導向模型中考慮聲調,更有助於提升合成歌聲的自然度。
Singing voice synthesis (SVS) systems are built to generate human-like voice signals from lyrics and the corresponding musical scores. The mainstream voice synthesis techniques involve two stages: acoustic feature modeling and audio synthesis. In most SVS systems, a neural network-based auxiliary duration model is employed to predict the duration of phonemes. According to the phoneme durations, the input sequence is pre-expanded to roughly align with the length of the output sequence to match with the rhythm in singing. In this thesis, a rule-based algorithm inspired by Mandarin phonology is proposed for the duration modeling in Mandarin SVS. Specifically, the algorithm infers the duration of an “initial” consonant by looking up syllables in an existing training set that begin with the same consonant and have similar note lengths, and then computing the average consonant duration. Around this, with the 3-hour female singing voices in the MPop600 dataset, we employ a combination of Tacotron2 and Parallel WaveGAN as the backbone of our SVS system for their robustness and favorable data efficiency on small datasets. Experimental results show that the singing voice synthesized by the proposed duration model is more expressive than that of a learning-based model. Moreover, since Mandarin is a tonal language, the inclusion of tonality consideration further enhances the naturalness of the generated voices.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Phoneme Duration Estimation Algorithm . . . . . . . . . . . . 2
1.2.2 Robust and Data Efficient Architecture of Neural SVS . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 4
2.1 Overview of the General SVS System . . . . . . . . . . . . . . 4
2.2 Previous Works for SVS . . . . . . . . . . . . .. . . . . . . . 6
2.3 Duration Modeling . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Mandarin Phonology in a Nutshell . . . . . . . . . . . . . . . 7
3 Duration Analysis 9
3.1 Phoneme Stretching Analysis . . . . . . . . . . . . . . . . . . 9
3.1.1 Distribution and Skewness . . . . . . . . . . . . . . . . . 10
3.1.2 Stretching of Phonemes and Syllables . . . . . . . . . . . . 11
3.1.3 Initial Ratio across Different Singer . . . . . . . . . . . 12
3.2 The Rule-based Algorithm . . . . . . . . . . . . . . . . . . . 13
3.2.1 Rule 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Rule 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Rule 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 The Proposed SVS system 17
4.1 Input Sequence Representation . . . . . . . . . . . . . . . . 17
4.1.1 Linguistic and Musical Information Extraction . . . . . . . 18
4.1.2 Length Regulator . . . . . . . . . . . . . . . . . . . . . . 19
4.1.3 Embedding and Final Input Sequence . . . . . . . . . . . . . 20
4.2 Mel-scale Spectrogram Computation . . . . . . . . . . . . . . 20
4.3 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Location-sensitive Attention . . . . . . . . . . . . . . . . 22
4.3.3 The Decoder . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 The Audio Synthesizer . . . . . . . . . . . . . . . . . . . . 26
4.4.1 Discriminator . . . . . . . . . . . . . . . . . . . . . . . 26
4.4.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Experiments 29
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Content of the Selected Subset . . . . . . . . . . . . . . . 29
5.1.2 Phoneme Segmentation . . . . . . . . . . . . . . . . . . . . 30
5.2 Experimental Conditions . . . . . . . . . . . . . . . . . . . 32
5.3 Training Strategy . . . . . . . . . . . . . . . . . . . . . . 33
5.3.1 Teacher-forcing Mode . . . . . . . . . . . . . . . . . . . . 33
5.3.2 Inference Mode . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Hyperparameter and Training Setting . . . . . . . . . . . . . 34
5.4.1 Mel-feature Prediction Network . . . . . . . . . . . . . . . 34
5.4.2 Neural Vocoder . . . . . . . . . . . . . . . . . . . . . . . 34
6 Results and Evaluations 36
6.1 Objective evaluation . . . . . . . . . . . . . . . . . . . . . 36
6.1.1 Root-mean-square Error . . . . . . . . . . . . . . . . . . . 36
6.1.2 Pearson Correlation Coefficient . . . . . . . . . . . . . . 37
6.1.3 Mel Cepstral Distortion . . . . . . . . . . . . . . . . . . 37
6.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . 38
6.3 Discussions on the Overall Performance . . . . . . . . . . . . 39
6.4 Evaluation on Duration Model by Preference Test . . . . . . . 42
7 Conclusions 43
8 Future Works 44
8.1 Multi-note Conditioned on One Character . . . . . . . . . . . 44
8.2 Multi-voice Singing Synthesizer . . . . . . . . . . . . . . . 44
8.3 Singing Style Transfer . . . . . . . . . . . . . . . . . . . . 45
References 46
Appendix 49
A.1 Phonetic Components of Mandarin Pinyin . . . . . . . . . . . . 49
A.2 Hyperparameter of Tacotron2 . . . . . . . . . . . . . . . . . 50
A.3 Hyperparameter of Parallel WaveGan . . . . . . . . . . . . . . 51
A.4 Suggestions From the Oral Defense Committees . . . . . . . . . 52
A.4.1 吳誠文教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.4.2 吳尚鴻教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.4.3 楊奕軒教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.4.4 劉奕汶教授. . . . . . . . . . . . . . . . . . . . . . . . . . 52
