作者(外文):Feng, Shao-Di
論文名稱(外文):Generate Music with More Refined Emotions
指導教授(外文):Soo, Von-Wun
口試委員(外文):Chu, Hung-Kuo
Lin, Hao-Qiang
外文關鍵詞:music generationmusic emotion controllingsemi-supervised learningVariational Autoencoder
由於具有情感標籤的符號音樂數據集稀缺且不完整,因此生成具有特定情感的符音樂是一項具有挑戰性的任務。通常,數據集只標註悲傷或快樂等一般情感標籤,因此模型生成能力有限,只能生成帶有標籤情感的音樂。本研究旨在基於在 Russel 的 2D 情感模型中僅用四個像限標記的訓練數據集生成更精緻的情感。我們專注於Music Fadernet理論,將arousal和valence映射到低級屬性,結合Transformer和GM-VAE構建符號音樂生成模型。我們為模型採用了in-attention機制,並通過控制條件信息來改進它。並且我們展示了音樂生成模型可以根據用戶在高級語言表達方面指定的情感並通過操縱其相應的低階音樂屬性來控制音樂的生成。最後,我們使用預先訓練的情感分類器針對名為 EMOPIA 的流行鋼琴 midi 數據集來評估模型性能,並通過主觀聆聽評估,我們證明該模型可以正確地生成具有更精緻情感的音樂。
To generate symbolic music with specific emotion is a challenging task due to symbolic music datasets that have emotion labels are scarce and incomplete. Usually, datasets are only labeled with general emotion labels such as sadness or happiness, so the model generation ability is limited and can only generate music with labeled emotions. This research aims to generate more refined emotions based on the training datasets that are only labeled with four quadrants in the Russel's 2D emotion model. We focus on the theory of Music Fadernet and map arousal and valence to the low-level attributes, and build a symbolic music generation model by combining transformer and GM-VAE. We adopt an in-attention mechanism for the model and improve it by allowing modulation by conditional information. And we show the music generation model could control the generation of music according to the emotions specified by users in terms of high level linguistic expression and by manipulating their corresponding low-level musical attributes. Finally, we evaluate the model performance using a pre-trained emotion classifier against a pop piano midi dataset called EMOPIA and by subjective listening evaluation we demonstrate that the model could generate music with more refined emotions correctly.
Abstract (Chinese) I
Abstract II
Contents III
1 Introduction 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Conditional symbolic music generation . . . . . . . . . . . . 5
1.1.2 Emotion-conditioned Symbolic Music Generation . . . . . . 6
1.1.3 Relationship between low level feature and arousal and valence 8
2 Methodology 9
2.1 Music Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Variational Autoencoders (VAE) . . . . . . . . . . . . . . . . . . . 10
2.3 Gaussian Mixture Variational Autoencoder (GM-VAE) . . . . . . . 12
2.4 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Experiment 19
3.1 Datasets and Model Hyper parameters . . . . . . . . . . . . . . . . 19
3.2 Low Level Music Attributes . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Music Generation with Specific Emotion . . . . . . . . . . . . . . . 21
3.4 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Conclusion and Future work 28
Bibliography 29
A. Questionaire Design 32
B. Music Samples 33
