帳號:guest(18.222.182.29)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):胡祐瑄
作者(外文):Hu, Yo-Shang
論文名稱(中文):利用中插法訓練深度學習模型以產生流行歌曲的饒舌歌曲與歌詞
論文名稱(外文):Generating Rap Music and Lyrics for a Popular Song by Interpolation Training on Deep Learning Models
指導教授(中文):蘇豐文
指導教授(外文):Soo, Von-Wun
口試委員(中文):丁川康
許永真
口試委員(外文):Ting, Chuan-Kang
Hsu, Yung-Jen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:110065531
出版年(民國):112
畢業學年度:112
語文別:英文
論文頁數:101
中文關鍵詞:音樂生成機器學習自然語言處理歌詞生成音樂資訊檢索
外文關鍵詞:Music generationMachine LearningNatural Language ProcessingLyric generationMusic Information Retrieval
相關次數:
  • 推薦推薦:1
  • 點閱點閱:346
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
歌詞與歌曲的插值已經有些許的研究,但因為缺少饒舌音樂與歌詞的資料集,因此針對饒舌歌詞與饒舌歌曲的插值仍未被研究。因此,本論文使用兩種架構,分別針對中文饒舌歌詞與饒舌歌曲的插值進行生成,來解決此問題。

在饒舌歌詞生成的部分,我們使用基於自注意力機制的Transformer架構完成任務。透過嵌入層的操作,我們使模型能夠呈現出押韻及控制字數的效果,同時利用BERT中的掩碼策略,賦予模型填空的能力。此外,我們提出了一種強化的top-k解碼策略,以改進生成效果。最後,我們還引入了額外的模組,以協助使用者填入與前後歌詞段相關的詞彙,進一步提升生成效果。

在饒舌歌曲生成方面,我們選擇使用\cite{chang2021variablelength}中提出的XLNet生成架構,並根據饒舌歌曲的特質和需求進行改良。同時,我們採用自建的MIDI饒舌歌曲資料集進行訓練,實驗結果顯示所產生的音樂樣本特性更貼近饒舌歌曲的特質。

最後,透過主觀實驗和客觀實驗,我們驗證了我們的架構在饒舌歌詞與饒舌歌曲插值方面優於基線模型,並在某些方面的效果接近人類所創作的饒舌歌詞及音樂。
Interpolation of lyrics and music has been studied, yet there remains a gap in researching the interpolation of rap lyrics and rap music because lacking of rap music dataset and rap lyrics dataset. Therefore, this thesis proposes two architectures, each tailored for generating interpolation in Chinese rap lyrics and rap musics to overcome the problem.

For the part of rap lyrics generation, we employ a Transformer architecture based on self-attention mechanisms. Through embedding layer operations, the model achieves rhyming effects and word count control. Additionally, we use masking strategy in BERT to make the model have the ability to fill in the blanks. Furthermore, an enhanced top-k decoding strategy is proposed to improve generation quality. Finally, we introduce additional modules to assist users in filling relevant vocabularies from preceding segment and subsequent segment, further enhancing generation results.

For the part of rap music generation, we use the XLNet generation architecture proposed by \cite{chang2021variablelength}, making modifications to align with the requirements of rap music interpolation. Concurrently, training is conducted using a self-constructed MIDI rap music dataset, with experimental results demonstrating that the generated samples align more closely with the traits of rap music.

Finally, through subjective and objective evaluations, we proved that our architectures outperform baseline models in the interpolation of rap lyrics and rap music. In certain aspects, the results even approach the quality of rap lyrics and music created by human artists.
Abstract (Chinese) I
Abstract II
Acknowledgements III
Contents IV
1 Introduction 1
1.1 Rap Lyric Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Rap Music Generation . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related work 10
2.1 Natural Text Generation . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Rule-Based Text Generation . . . . . . . . . . . . . . . . . . 10
2.1.2 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . 11
2.2 Lyric Generation and Poetry Generation . . . . . . . . . . . . . . . 11
2.3 Rap Lyric Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Automatic Music Generation . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Traditional Music Generation . . . . . . . . . . . . . . . . . 14
2.4.2 Advanced Music Generation . . . . . . . . . . . . . . . . . . 14
2.5 Using the Transformer model to generate symbolic music . . . . . . 15
2.6 Interpolation Music generation . . . . . . . . . . . . . . . . . . . . . 17
2.7 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.2 Top-k versus Beam Search . . . . . . . . . . . . . . . . . . . 21
2.8 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Methodology for Lyric Generation 26
3.1 Overview of a Lyric Generation Framework . . . . . . . . . . . . . . 26
3.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Data Representations . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Reverse-Order Language Model . . . . . . . . . . . . . . . . 32
3.2.3 Pre-training and Fine-tuning . . . . . . . . . . . . . . . . . . 32
3.3 Top-K Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Topic modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Topic coherence of text interpolation . . . . . . . . . . . . . 36
4 Methodology for Interpolation Music Generation 39
4.1 Overview of a Song Generation Framework . . . . . . . . . . . . . . 39
4.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Bar representation . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.2 Position and Duration representation . . . . . . . . . . . . . 44
4.3.3 IsRap representation . . . . . . . . . . . . . . . . . . . . . . 44
4.3.4 Track representation . . . . . . . . . . . . . . . . . . . . . . 45
4.3.5 Chord representation . . . . . . . . . . . . . . . . . . . . . . 45
5 Experiment for Rap Lyrics Generation 47
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Evaluation on Rap Lyrics Generation . . . . . . . . . . . . . . . . . 48
5.3.1 Objective Evaluation on Rap Lyrics Generation . . . . . . . 49
6 Experiment for Rap Music Generation 57
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.1 Rap music characteristic evaluation . . . . . . . . . . . . . . 59
6.3.2 Rap music evaluation . . . . . . . . . . . . . . . . . . . . . . 63
6.3.3 Experiments results on Rap music generation . . . . . . . . 66
7 Subjective evaluation 70
7.1 Subjective evaluation on Rap Lyrics Generation . . . . . . . . . . . 70
7.2 Subjective evaluation on Rap music generation . . . . . . . . . . . 72
7.3 Overall subjective evaluation on combining both rap music and
lyrics generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Conclusion 76
Bibliography 79
A Questionnaire 86
A.1 第一部份- 個人背景調查 . . . . . . . . . . . . . . . . . . . . . . . . 86
A.2 第二部份- 饒舌歌詞品質評估 . . . . . . . . . . . . . . . . . . . . . 86
A.3 第三部份- 饒舌歌曲品質評估 . . . . . . . . . . . . . . . . . . . . . 89
A.4 第四部份- 饒舌歌曲與歌詞綜合評估 . . . . . . . . . . . . . . . . . 89
B Demonstration of rap music generation examples for 10 popular
songs 91
C Demonstrate of lyrics and music combined generation examples 92
C.1 Demonstration one - 至少還有你 . . . . . . . . . . . . . . . . . . . 92
C.2 Demonstration two - 如果可以 . . . . . . . . . . . . . . . . . . . . . 94
C.3 Demonstration three - 月亮代表我的心 . . . . . . . . . . . . . . . . 96
C.4 Demonstration four - 沒那麼簡單 . . . . . . . . . . . . . . . . . . . 98
C.5 Demonstration five - 千里之外 . . . . . . . . . . . . . . . . . . . . . 99
[1] Lilac Atassi. Generating symbolic music using diffusion models, 2023.
[2] Gabriele Barbieri, Fran ̧cois Pachet, Pierre Roy, and Mirko Degli Esposti.
Markov constraints for generating lyrics with style. In Ecai, volume 242,
pages 115–120, 2012.
[3] Google Brain. Magenta, generating long sequence in stories and songs, 2016.
[4] Chin-Jui Chang, Chun-Yi Lee, and Yi-Hsuan Yang. Variable-length music
score infilling via xlnet and musically specialized positional encoding, 2021.
[5] Yihao Chen and Alexander Lerch. Melody-conditioned lyrics generation with
seqgans. In 2020 IEEE International Symposium on Multimedia (ISM), pages
189–196, 2020.
[6] Li-Wei Cheng. Lyrics generation based on a conditional self-attention encoder-
decoder model, 2022.
[7] Liu Chien-Hung and Chuan-Kang Ting. Computational intelligence in music
composition: A survey. IEEE Transactions on Emerging Topics in Compu-
tational Intelligence, PP:1–1, 12 2016.
[8] Kevin Lex Tan Cua. Popular music instrumental accompaniment generation
with solo via music interpolation using transformers learning.
[9] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. Pre-
training with whole word masking for chinese bert, 2021.
[10] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and
Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a
fixed-length context, 2019.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert:
Pre-training of deep bidirectional transformers for language understanding,
2019.
[12] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. Musegan:
Multi-track sequential generative adversarial networks for symbolic music gen-
eration and accompaniment, 2017.
[13] Zeyao Du. Gpt2-chinese: Tools for training gpt2 model in chinese language.
https://github.com/Morizeyao/GPT2-Chinese, 2019.
[14] Angela Fan, Mike Lewis, and Yann N. Dauphin. Hierarchical neural story
generation. CoRR, abs/1805.04833, 2018.
[15] Steven Gilbers, Nienke Hoeksema, Kees de Bot, and Wander Lowie. Regional
variation in west and east coast african-american english prosody and rap
flows. Language and Speech, 63(4):713–745, 2020. PMID: 31680609.
[16] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver-
sarial networks, 2014.
[17] Ga ̈etan Hadjeres and Frank Nielsen. Interactive music generation with posi-
tional constraints using anticipation-rnns, 2017.
[18] Jing He, Ming Zhou, and Long Jiang. Generating chinese classical poems with
statistical machine translation models. In Twenty-Sixth AAAI Conference on
Artificial Intelligence, 2012.
[19] Carlos Hernandez-Olivan and Jose R. Beltran. Musicaiz: A python library for
symbolic music generation, analysis and visualization. SoftwareX, 22:101365,
2023.
[20] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer,
Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica
Dinculescu, and Douglas Eck. Music transformer, 2018.
[21] Yu-Siang Huang and Yi-Hsuan Yang. Pop music transformer: Beat-based
modeling and generation of expressive pop piano compositions, 2020.
[22] Christian Ziegenhahn Jensen and Espen Sørhaug. The perfect rap lyrics -
ai generated rap lyrics that are better than lyrics from existing popular and
critically acclaimed rap songs. abs/2107.01875, 2021.
[23] Daniel D. Johnson. Generating polyphonic music using tied parallel networks.
In Jo ̃ao Correia, Vic Ciesielski, and Antonios Liapis, editors, Computational
Intelligence in Music, Sound, Art and Design, pages 128–143, Cham, 2017.
Springer International Publishing.
[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022.
[25] Nayu Liu, Wenjing Han, Guangcan Liu, Da Peng, Ran Zhang, Xiaorui Wang,
and Huabin Ruan. ChipSong: A controllable lyric generation system for
Chinese popular song. In Proceedings of the First Workshop on Intelligent
and Interactive Writing Assistants (In2Writing 2022), pages 85–95, Dublin,
Ireland, May 2022. Association for Computational Linguistics.
[26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization,
2019.
[27] Ruixuan Luo, Jingjing Xu, Yi Zhang, Zhiyuan Zhang, Xuancheng Ren, and
Xu Sun. Pkuseg: A toolkit for multi-domain chinese word segmentation.
CoRR, abs/1906.11455, 2019.
[28] Pablo L ́opez Di ́eguez. Variational autoencoders for polyphonic music inter-
polationl, 2020.
[29] Eric Malmi, Pyry Takala, Hannu Toivonen, Tapani Raiko, and Aristides Gio-
nis. Dopelearning: A computational approach to rap lyrics generation. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pages 195–204, 2016.
[30] Eric Malmi, Pyry Takala, Hannu Toivonen, Tapani Raiko, and Aristides Gio-
nis. Dopelearning: A computational approach to rap lyrics generation. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pages 195–204, 2016.
[31] Enrique Manjavacas, Mike Kestemont, and Folgert Karsdorp. Generation
of hip-hop lyrics with hierarchical modeling and conditional templates. In
Proceedings of the 12th International Conference on Natural Language Gen-
eration, pages 301–310, Tokyo, Japan, October–November 2019. Association
for Computational Linguistics.
[32] Curtis Northcutt Nikola I. Nikolov, Eric Malmi and Loreto Parisi. Rapformer:
Conditional rap lyrics generation with denoising autoencoders. In Proceedings
of the 13th International Conference on Natural Language Generation, page
360–373, Dublin, Ireland, December 2020. Association for Computational Lin-
guistics.
[33] Alice H Oh and Alexander I Rudnicky. Stochastic natural language generation
for spoken dialog systems. Computer Speech Language, 16(3):387–407, 2002.
Spoken Language Generation.
[34] Ashis Pati, Alexander Lerch, and Ga ̈etan Hadjeres. Learning to traverse latent
spaces for musical score inpainting, 2019.
[35] Peter Potash, Alexey Romanov, and Anna Rumshisky. Ghostwriter: Using
an lstm for automatic rap lyric generation. In Proceedings of the 2015 Confer-
ence on Empirical Methods in Natural Language Processing, pages 1919–1924,
2015.
[36] Naveen Ram, Tanay Gummadi, Rahul Bhethanabotla, Richard J Savery, and
Gil Weinberg. Say what? collaborative pop lyric generation using multi-
task transfer learning. In Proceedings of the 9th International Conference on
Human-Agent Interaction. ACM, nov 2021.
[37] Ehud Reiter and Robert Dale. Building Natural Language Generation Sys-
tems. Studies in Natural Language Processing. Cambridge University Press,
2000.
[38] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas
Eck. A hierarchical latent vector model for learning long-term structure in
music, 2019.
[39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-
lan Salakhutdinov. Dropout: A simple way to prevent neural networks from
overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
[40] Andreas Stolcke. An efficient probabilistic context-free parsing algorithm
that computes prefix probabilities. Computational Linguistics, 21(2):165–201,
1995.
[41] Chih-Pin Tan, Alvin W. Y. Su, and Yi-Hsuan Yang. Melody infilling with
user-provided structural context, 2022.
[42] Kees van Deemter, Emiel Krahmer, and Mari ̈et Theune. Squibs and dis-
cussions: Real versus template-based natural language generation: A false
opposition? Computational Linguistics, 31(1):15–24, 2005.
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you
need. Advances in neural information processing systems, 30, 2017.
[44] Jie Wang and Xinyan Zhao. Theme-aware generation model for chinese lyrics.
CoRR, abs/1906.02134, 2019.
[45] Su Wang, Greg Durrett, and Katrin Erk. Narrative interpolation for gener-
ating and understanding stories. CoRR, abs/2008.07466, 2020.
[46] Zhe Wang, Wei He, Hua Wu, Haiyang Wu, Wei Li, Haifeng Wang, and Enhong
Chen. Chinese ghost generation with planning based neural network. arXiv
preprint arXiv:1610.09889, 2016.
[47] Yu-Wei Wen and Chuan-Kang Ting. Recent advances of computational in-
telligence techniques for composing music. IEEE Transactions on Emerging
Topics in Computational Intelligence, 7(2):578–597, 2023.
[48] Shih-Lun Wu and Yi-Hsuan Yang. The jazz transformer on the front line:
Exploring the shortcomings of ai-composed music through quantitative mea-
sures, 2020.
[49] Xianchao Wu, Chengyuan Wang, and Qinying Lei. Transformer-xl based
music generation with multiple sequences of time-valued notes. CoRR,
abs/2007.07244, 2020.
[50] Lanqing Xue, Kaitao Song, Duocai Wu, Xu Tan, Nevin L. Zhang, Tao Qin,
Wei-Qiang Zhang, and Tie-Yan Liu. Deeprapper: Neural rap generation with
rhyme and rhythm modeling. CoRR, abs/2107.01875, 2021.
[51] Cheng Yang, Maosong Sun, Xiaoyuan Yi, and Wenhao Li. Stylistic Chinese
poetry generation via unsupervised style disentanglement. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing,
pages 3960–3969, Brussels, Belgium, October-November 2018. Association for
Computational Linguistics.
[52] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. Midinet: A convolutional
generative adversarial network for symbolic-domain music generation using
1d and 2d conditions. CoRR, abs/1703.10847, 2017.
[53] Li-Chia Yang and Alexander Lerch. On the evaluation of generative models
in music. Neural Comput. Appl., 32(9):4773–4784, may 2020.
[54] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhut-
dinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for
language understanding, 2020.
[55] Xingxing Zhang and Mirella Lapata. Chinese poetry generation with recur-
rent neural networks. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 670–680, Doha,
Qatar, October 2014. Association for Computational Linguistics.
(此全文20251109後開放外部瀏覽)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *