作者(外文):Lu, Chien-Yu
論文名稱(中文):奏其所願: 多峰分布之音色特徵強化音訊風格轉換
論文名稱(外文):Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer
指導教授(外文):Lee, Che-Rung
Su, Li
口試委員(外文):Chiu, Wei-Chen
Hsu, Chiou-Ting
外文關鍵詞:Machine LearningStyle TransferMusicDeep Learning
當要以某首和弦樂來生成不同於原始風格並符合多樣、富有想像力和合理的音樂作品是一項具有挑戰性的任務。為了實現這一點,以非監督的方式學習音樂的域變(即風格)和域不變(即內容)信息是很關鍵的。在此篇論文中,我們提出了一個不需平行資料的非監督式的音樂風格轉換的方法。此外,為了表示音樂的多模態分佈,我們在提出的系統中採用 Multi-modal Unsupervised Image-to-Image Translation (MUNIT) 框架。而為了更好的捕捉聲音的粒度,例如音色的感知維度和儀器特定性能的細微差別,我們使用了梅爾頻率倒頻譜係數、光譜差異以及頻譜包絡等認知合理的特徵與廣泛使用的梅爾頻譜圖組合成音色增強的多聲道輸入表示。我們也引入了 Relativistic average Generative Adversarial Networks (RaGAN) 來增加收斂速度以及穩定性。我們對鋼琴獨奏,吉他獨奏和弦樂四重奏三種不同類型的曲風進行雙邊風格轉移的實驗。結果證明了提出的方法在音樂風格轉移中有改善的音質並允許使用者操縱輸出結果的優點。
Style transfer of polyphonic music recordings is a challenging task when considering the modeling of diverse, imaginative, and reasonable music pieces in the style different from their original one. To achieve this, learning stable multi-modal representations for both domain-variant (i.e., style) and domain-invariant (i.e., content) information of music in an unsupervised manner is critical. In this paper, we propose an unsupervised music style transfer method without the need for parallel data. Besides, to characterize the multi-modal distribution of music pieces, we employ the Multi-modal Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed system. This allows one to generate diverse outputs from the learned latent distributions representing contents and styles. Moreover, to better capture the granularity of sound, such as the perceptual dimensions of timbre and the nuance in instrument-specific performance, cognitively plausible features including mel-frequency cepstral coefficients (MFCC), spectral difference, and spectral envelope, are combined with the widely-used mel-spectrogram into a timbre-enhanced multi-channel input representation. The Relativistic average Generative Adversarial Networks (RaGAN) is also utilized to achieve fast convergence and high stability. We conduct experiments on bilateral style transfer tasks among three different genres, namely piano solo, guitar solo, and string quartet. Results demonstrate the advantages of the proposed method in music style transfer with improved sound quality and in allowing users to manipulate the output.
