作者(外文):Yu, Zu-Da
論文名稱(外文):Conditional Symbolic Music Generation Using Transformer-XL with Pseudo-Self Attention
指導教授(外文):Soo, Von-Wun
口試委員(外文):Ku, Lun-Wei
Kuo, Po-Chih
外文關鍵詞:Deep LearningMachine CompositionConditional Generation
Although machine composition has achieved many world-renowned achievements in recent years and is getting closer to the level of human composers, the control of machine-generated music, for example, by inputting style, genre, and emotion as conditions for generating relevant music, is still a major problem in the field of machine composition.

In this paper, we propose a method to adopt Pseudo-Self Attention mechanism on Transformer-XL model, which can embed conditions into the attention layer of Transformer-XL, allowing us to generate music based on the conditions. We use a language-like music data representation. A language model is first trained using large amount of unlabeled music data. Based on the pre-trained model, we additionally use labeled data with four different genres to train the conditional encoder and Pseudo-Self Attention parameters. By changing the input conditions, the model can control the genre of output music.

The experimental results show that the conditional information encoder has the ability to control the model output to achieve close results compared to the method of fine-tuning the model with labelled data, and it uses fewer parameters and requires less storage space. When combined with conditional encoders and fine-tuning, the performance on certain genres of music even exceeds that of the method of fine-tuning the model using labelled data.
List of Tables
List of Figures
1 Introduction1
2 Related Work6
2.1Deep Learning Music Generation . . . . . . . . . . . . . . . . . . . . . . .6
2.2Controllable Music Generation . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Methodology14
3.1Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . . . . 18
3.3Transformer-XL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2Recurrence Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3Relative Position Encoding . . . . . . . . . . . . . . . . . . . . . . 21
3.4Pseudo-Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5Inject Genre Information Into Transformer-XL . . . . . . . . . . . . . . . 24
3.5.1Conditional Encoder For Music Genre . . . . . . . . . . . . . . . . 24
3.5.2PSA Matrix Decomposition . . . . . . . . . . . . . . . . . . . . . 26
3.5.3Transformer-XL With PSA . . . . . . . . . . . . . . . . . . . . . . 27
4 Experiment and Evaluation30
4.1Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3Preprocess And Training Parameters . . . . . . . . . . . . . . . . . . . . . 31
4.4Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1Cost Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 34
4.5Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Conclusion and Future Work44
5.1Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45References46
.1Appendix:Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
.2Appendix:Subjective Questionnaire . . . . . . . . . . . . . . . . . . . . . 62
