作者(外文):Munguia Velez, Kelvin Xavier
論文名稱(外文):Polyphonic Music Composition: An Adversarial Inverse Reinforcement Learning Approach
指導教授(外文):Soo, Von-Wun
口試委員(外文):Chiu, Ching-Te
Chen, Chaur-Chin
外文關鍵詞:music compositionadversarial inverse reinforcement learning
自動音樂旋律產生傳統上是用監督式深度學習模型來訓練。但是這個方法有個缺陷會導致不悅耳的旋律產生包括過度重複相同樣式。有鑒於近年增強式學習法在很多領域的成功, 這篇論文探討另一個方法來結合新的監督式深度學習法, 深度增強式學習法與逆增強式學習法以產生旋律作曲。音樂產生可以視為時間軸上一系列的動作, 每個動作在作曲中選擇了一個和弦音符, 因此允許我們將旋律作曲視為增強式學習問題中尋求最大累積奬勵的一系列動作的解。我們首先用監督式學習訓練雙軸長短期記憶體期模型(Bi-axial LSTM model)並用深度Q-learning 方法來微調改進。但是如何設計一個好的獎勵函數是非常弔詭與困難的。我們用對抗式逆增強式學習法從人類專家的作曲作品軌跡中學得獎勵函數。結合這個獎勵函數與音樂理論規則, 我們改善用了使用監督式學習模型所產生的音樂。結果顯示, 我們的方法所產生的音樂在客觀的指標與使用者喜好的主觀評估
Deep Supervised Learning models are traditionally used for automatic music harmony generation. However, this approach suffers from limitations which may lead to unpleasing harmonies, including excessive repetition of patterns. Motivated by the recent success of reinforcement learning in multiple fields, this dissertation explores an alternative approach to harmony composition using a combination of novel Deep Supervised Learning, Deep Reinforcement Learning and Inverse Reinforcement Learning techniques. Music generation can be seen as a sequence of actions through time, with taking an action being equivalent to selecting the next chord in the composition, therefore allowing us to model harmony composition as a reinforcement learning problem in which we look to maximize an accumulated reward. We start by training a Bi-axial LSTM model using supervised learning and improve upon it by tuning it using Deep Q-learning. However, designing a good reward function is known to be a tricky and difficult process, so to overcome this we propose learning a reward function from a set of human-composed tracks using Adversarial Inverse Reinforcement Learning. We combine this learned reward function with a reward based on music theory rules to improve the generation of the model trained by supervised learning. The results show improvement over a pre-trained model not trained with reinforcement learning with respect to a set of objective metrics and preference from users based on user studies.
