帳號:guest(18.223.203.104)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):郁祖達
作者(外文):Yu, Zu-Da
論文名稱(中文):使用虛擬自我注意力機制的轉換器模型生成指定條件的音樂
論文名稱(外文):Conditional Symbolic Music Generation Using Transformer-XL with Pseudo-Self Attention
指導教授(中文):蘇豐文
指導教授(外文):Soo, Von-Wun
口試委員(中文):古倫維
郭柏志
口試委員(外文):Ku, Lun-Wei
Kuo, Po-Chih
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:108065466
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:75
中文關鍵詞:深度學習機器作曲條件生成
外文關鍵詞:Deep LearningMachine CompositionConditional Generation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:506
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
雖然使用深度學習作曲在近幾年取得了許多舉世矚目的成就且越來越接近人類作曲家的水平,但控制模型產生的音樂,如通過輸入風格、體裁、情感為條件來產生相關的音樂仍然是自動作曲領域的一大問題。

我們提出了將虛擬自注意力機制作用於Transformer-XL模型上的方法,通過將條件信息嵌入到自注意層中,使得我們可以基於條件來產生相關的音樂。本文使用了類似語言模型的音樂特徵表示,利用大量無標簽音樂數據訓練一個語言模型,在預訓練模型的基礎上額外使用四種帶有標記的不同體裁的音樂訓練條件信息編碼器及虛擬自注意力嵌入參數,通過改變輸入條件能夠控制模型輸出的音樂體裁。

實驗結果表明,條件信息編碼器具有能控制模型輸出的能力,相較於用帶標簽數據來微調模型的方法,它可以達到接近的效果,而且其使用的參數更少,需要的儲存空間也更小。當結合條件編碼器與微調時,在某些類型的音樂上的表現超過了使用帶標簽數據微調模型的方法。
Although machine composition has achieved many world-renowned achievements in recent years and is getting closer to the level of human composers, the control of machine-generated music, for example, by inputting style, genre, and emotion as conditions for generating relevant music, is still a major problem in the field of machine composition.

In this paper, we propose a method to adopt Pseudo-Self Attention mechanism on Transformer-XL model, which can embed conditions into the attention layer of Transformer-XL, allowing us to generate music based on the conditions. We use a language-like music data representation. A language model is first trained using large amount of unlabeled music data. Based on the pre-trained model, we additionally use labeled data with four different genres to train the conditional encoder and Pseudo-Self Attention parameters. By changing the input conditions, the model can control the genre of output music.

The experimental results show that the conditional information encoder has the ability to control the model output to achieve close results compared to the method of fine-tuning the model with labelled data, and it uses fewer parameters and requires less storage space. When combined with conditional encoders and fine-tuning, the performance on certain genres of music even exceeds that of the method of fine-tuning the model using labelled data.
摘要
Abstract
Acknowledgement
List of Tables
List of Figures
1 Introduction1
2 Related Work6
2.1Deep Learning Music Generation . . . . . . . . . . . . . . . . . . . . . . .6
2.2Controllable Music Generation . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Methodology14
3.1Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . . . 18
3.3Transformer-XL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2Recurrence Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3Relative Position Encoding . . . . . . . . . . . . . . . . . . . . . . 21
3.4Pseudo-Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5Inject Genre Information Into Transformer-XL . . . . . . . . . . . . . . . 24
3.5.1Conditional Encoder For Music Genre . . . . . . . . . . . . . . . . 24
3.5.2PSA Matrix Decomposition . . . . . . . . . . . . . . . . . . . . . 26
3.5.3Transformer-XL With PSA . . . . . . . . . . . . . . . . . . . . . . 27
4 Experiment and Evaluation30
4.1Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3Preprocess And Training Parameters . . . . . . . . . . . . . . . . . . . . . 31
4.4Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1Cost Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 34
4.5Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Conclusion and Future Work44
5.1Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45References46
.1Appendix:Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
.2Appendix:Subjective Questionnaire . . . . . . . . . . . . . . . . . . . . . 62
[1] Ga ̈etan Hadjeres, F. Pachet, and F. Nielsen. Deepbach: a steerable model for bachchorales generation. InICML, 2017.
[2] Liangrong Yi and J. Goldsmith. Automatic generation of four-part harmony. InBMA,2007.
[3] J. Yosinski, J. Clune, Yoshua Bengio, and Hod Lipson. How transferable are featuresin deep neural networks?ArXiv, abs/1411.1792, 2014.
[4] Sageev Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan. This time with feel-ing: learning expressive musical performance.Neural Computing and Applications,pages 1–13, 2018.
[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,Advances in Neural Information Processing Systems, volume 30. CurranAssociates, Inc., 2017. URLhttps://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[6] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, I. Simon, CurtisHawthorne, Noam M. Shazeer, Andrew M. Dai, M. Hoffman, M. Dinculescu, andD. Eck. Music transformer: Generating music with long-term structure. InICLR,2019.
[7] Zihang Dai, Z. Yang, Yiming Yang, J. Carbonell, Quoc V. Le, and R. Salakhutdinov.Transformer-xl: Attentive language models beyond a fixed-length context. InACL,2019.
[8] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative posi-tion representations. InNAACL-HLT, 2018.
[9] Zachary M. Ziegler, Luke Melas-Kyriazi, Sebastian Gehrmann, and Alexander M.Rush. Encoder-agnostic adaptation for conditional language generation.ArXiv,abs/1908.06938, 2019.
[10] I. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,S. Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. InNIPS, 2014.
[11] Diederik P. Kingma and M. Welling.Auto-encoding variational bayes.CoRR,abs/1312.6114, 2014.
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 9:1735–1780, 1997.
[13] J. Chung, aglar G ̈ulehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.ArXiv, abs/1412.3555,2014.
[14] ElliotWaite.Generatinglong-termstructureinsongsandsto-ries.https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn. 2016.
[15] D. Johnson. Generating polyphonic music using tied parallel networks. InEvo-MUSART, 2017.
[16] Li-Chia Yang, Szu-Yu Chou, and Y. Yang. Midinet: A convolutional generative ad-versarial network for symbolic-domain music generation. InISMIR, 2017.
[17] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learn-ing with deep convolutional generative adversarial networks.CoRR, abs/1511.06434,2016.
[18] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Y. Yang. Musegan: Multi-tracksequential generative adversarial networks for symbolic music generation and accom-paniment. In AAAI, 2018.
[19] Ishaan Gulrajani, Faruk Ahmed, Mart ́ın Arjovsky, Vincent Dumoulin, and Aaron C.Courville. Improved training of wasserstein gans. InNIPS, 2017.
[20] Geoffrey E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data withneural networks.Science, 313:504 – 507, 2006.
[21] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and D. Eck. A hi-erarchical latent vector model for learning long-term structure in music.ArXiv,abs/1803.05428, 2018.
[22] Yu-Siang Huang and Y. Yang. Pop music transformer: Generating music with rhythmand harmony.ArXiv, abs/2002.00212, 2020.
[23] H. H. Mao, Taylor Shin, and G. Cottrell. Deepj: Style-specific music generation.2018IEEE 12th International Conference on Semantic Computing (ICSC), pages 377–382,2018.
[24] Gino Brunner, Andres Konrad, Y. Wang, and Roger Wattenhofer. Midi-vae: Modelingdynamics and instrumentation of music with applications to style transfer.ArXiv,abs/1809.07600, 2018.
[25] Yu-Quan Lim, Chee Seng Chan, and F. Loo. Clavinet: Generate music with differentmusical styles.IEEE MultiMedia, 28:83–93, 2021.
[26] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.ArXiv, abs/1611.01144, 2017.
[27] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-trainingof deep bidirectional transformers for language understanding. InNAACL-HLT, 2019.
[28] Alec Radford, Jeffrey Wu, R. Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners. 2019.
[29] T. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh,Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, EricSigler, Mateusz Litwin, Scott Gray, Benjamin Chess, J. Clark, Christopher Berner,Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language mod-els are few-shot learners.ArXiv, abs/2005.14165, 2020.
[30] J. Russell. A circumplex model of affect.Journal of Personality and Social Psychol-ogy, 39:1161–1178, 1980.
[31] Li-Chia Yang and Alexander Lerch. On the evaluation of generative models in music.Neural Computing and Applications, 32:4773–4784, 2018.
[32] Shih-Lun Wu and Yi-Hsuan Yang. The jazz transformer on the front line: Explor-ing the shortcomings of ai-composed music through quantitative measures.ArXiv,abs/2008.01307, 2020.
[33] Hao-Wen Dong, K. Chen, Julian McAuley, and Taylor Berg-Kirkpatrick. Muspy: Atoolkit for symbolic music generation.ArXiv, abs/2008.01951, 2020.[34] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi AnnaHuang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enablingfactorized piano music modeling and generation with the MAESTRO dataset. InInternational Conference on Learning Representations, 2019.URLhttps://openreview.net/forum?id=r1lYRjC9F7.
[35] Lucas N Ferreira, Levi HS Lelis, and Jim Whitehead. Computer-generated music fortabletop role-playing games. 2020.
[36] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2015.
(此全文未開放授權)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *