帳號:guest(3.133.127.37)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):盧建宇
作者(外文):Lu, Chien-Yu
論文名稱(中文):奏其所願: 多峰分布之音色特徵強化音訊風格轉換
論文名稱(外文):Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer
指導教授(中文):李哲榮
蘇黎
指導教授(外文):Lee, Che-Rung
Su, Li
口試委員(中文):邱維辰
許秋婷
口試委員(外文):Chiu, Wei-Chen
Hsu, Chiou-Ting
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:106062560
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:24
中文關鍵詞:機器學習風格轉換音樂深度學習
外文關鍵詞:Machine LearningStyle TransferMusicDeep Learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:251
  • 評分評分:*****
  • 下載下載:7
  • 收藏收藏:0
當要以某首和弦樂來生成不同於原始風格並符合多樣、富有想像力和合理的音樂作品是一項具有挑戰性的任務。為了實現這一點,以非監督的方式學習音樂的域變(即風格)和域不變(即內容)信息是很關鍵的。在此篇論文中,我們提出了一個不需平行資料的非監督式的音樂風格轉換的方法。此外,為了表示音樂的多模態分佈,我們在提出的系統中採用 Multi-modal Unsupervised Image-to-Image Translation (MUNIT) 框架。而為了更好的捕捉聲音的粒度,例如音色的感知維度和儀器特定性能的細微差別,我們使用了梅爾頻率倒頻譜係數、光譜差異以及頻譜包絡等認知合理的特徵與廣泛使用的梅爾頻譜圖組合成音色增強的多聲道輸入表示。我們也引入了 Relativistic average Generative Adversarial Networks (RaGAN) 來增加收斂速度以及穩定性。我們對鋼琴獨奏,吉他獨奏和弦樂四重奏三種不同類型的曲風進行雙邊風格轉移的實驗。結果證明了提出的方法在音樂風格轉移中有改善的音質並允許使用者操縱輸出結果的優點。
Style transfer of polyphonic music recordings is a challenging task when considering the modeling of diverse, imaginative, and reasonable music pieces in the style different from their original one. To achieve this, learning stable multi-modal representations for both domain-variant (i.e., style) and domain-invariant (i.e., content) information of music in an unsupervised manner is critical. In this paper, we propose an unsupervised music style transfer method without the need for parallel data. Besides, to characterize the multi-modal distribution of music pieces, we employ the Multi-modal Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed system. This allows one to generate diverse outputs from the learned latent distributions representing contents and styles. Moreover, to better capture the granularity of sound, such as the perceptual dimensions of timbre and the nuance in instrument-specific performance, cognitively plausible features including mel-frequency cepstral coefficients (MFCC), spectral difference, and spectral envelope, are combined with the widely-used mel-spectrogram into a timbre-enhanced multi-channel input representation. The Relativistic average Generative Adversarial Networks (RaGAN) is also utilized to achieve fast convergence and high stability. We conduct experiments on bilateral style transfer tasks among three different genres, namely piano solo, guitar solo, and string quartet. Results demonstrate the advantages of the proposed method in music style transfer with improved sound quality and in allowing users to manipulate the output.
Chinese Abstract i
Abstract ii
Contents iv
List of Figures vi
1 Introduction 1
2 Background 4
2.1 Related Works ............................... 4
2.1.1 Generative Adversarial Networks ........... 4
2.1.2 Domain Adaptation ......................... 4
2.1.3 Music Style Transfer ...................... 5
3 Implementation 7
3.1 Data Representation ......................... 7
3.2 Proposed Method ............................. 9
3.2.1 Overview .................................. 9
3.2.2 RaGAN .................................... 11
3.2.3 Intrinsic Consistency Loss ............... 11
3.2.4 Signal Reconstruction .................... 12
3.2.5 Implementation details ................... 13
4 Experiment and Results 14
4.0.1 Subjective Evaluation .................... 15
4.0.2 Illustration of Examples ................. 17
4.0.3 Style Code Interpolation ................. 17
5 Conclusion 20
Bibliography 21
1. Vinoo Alluri and Petri Toiviainen. Exploring perceptual and acoustical correlates of polyphonic timbre. Music Perception: An Interdisciplinary Journal, 27(3):223–242, 2010.
2. Jean-Julien Aucouturier and Emmanuel Bigand. Seven problems that keep mir from attracting the interest of cognition and neuroscience. Journal of Intelligent Information Systems, 41(3):483–497, 2013.
3. O. B. Bohan. Singing style transfer, 2017. http://madebyoll.in/posts/singing_style_transfer/.
4. Anne Caclin, Stephen McAdams, Bennett K Smith, and Suzanne Winsberg. Acoustic correlates of timbre space dimensions: A confirmatory study using synthetic tones. The Journal of the Acoustical Society of America, 118(1):471–482, 2005.
5. Marcelo Freitas Caetano and Xavier Rodet. Sound morphing by feature interpolation. In Proc. IEEE ICASSP, pages 22–27, 2011.
6. Yu-Sheng Chen, Yu-Ching Wang, Man-Hsin Kao, and Yung-Yu Chuang. Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In CVPR, pages 6306–6314, 2018.
7. Shuqi Dai and Gus Xia. Music style transfer issues: A position paper. In the 6th International Workshop on Musical Metacreation (MUME), 2018.
8. Chris Donahue, Julian McAuley, and Miller Puckette. Synthesizing audio with generative adversarial networks. arXiv preprint arXiv:1802.04208, 2018.
9. Jonathan Driedger, Thomas Pr¨atzlich, and Meinard M¨uller. Let it bee-towards nmf-inspired audio mosaicing. In ISMIR, pages 350–356, 2015.
10. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In IEEE CVPR, pages 2414–2423, 2016.
11. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
12. John M Grey. Multidimensional perceptual scaling of musical timbres. The Journal of the Acoustical Society of America, 61(5):1270–1277, 1977.
13. JunYoung Gwak, Christopher B. Choy, Animesh Garg, Manmohan Chandraker, and Silvio Savarese. Weakly supervised generative adversarial networks for 3d reconstruction. CoRR, abs/1705.10904, 2017.
14. Albert Haque, Michelle Guo, and Prateek Verma. Conditional end-to-end audio transforms. arXiv preprint arXiv:1804.00047, 2018.
15. Ehsan Hosseini-Asl, Yingbo Zhou, Caiming Xiong, and Richard Socher. A multi-discriminator cyclegan for unsupervised non-parallel speech domain adaptation. arXiv preprint arXiv:1804.00522, 2018.
16. Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
17. Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard GAN. CoRR, abs/1807.00734, 2018.
18. Kazuhiro Kobayashi, Tomoki Toda, Graham Neubig, Sakriani Sakti, and Satoshi Nakamura. Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In INTERSPEECH, 2014.
19. Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In Proc. ECCV, Part IV, pages 577–593,2016.
20. Olivier Lartillot, Petri Toiviainen, and Tuomas Eerola. A matlab toolbox for music information retrieval. In Data analysis, machine learning and applications, pages 261–268. Springer, 2008.
21. Yijun Li, Sifei Liu, Jimei Yang, and Ming-Hsuan Yang. Generative face completion. In CVPR, pages 5892–5900, 2017.
22. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. CoRR, abs/1703.00848, 2017.
23. Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV,pages 2813–2821, 2017.
24. Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taigman. A universal music translation network. arXiv preprint arXiv:1805.07848, 2018.
25. Geoffroy Peeters, Bruno L Giordano, Patrick Susini, Nicolas Misdariis, and Stephen McAdams. The timbre toolbox: Extracting audio descriptors from musical signals. The Journal of the Acoustical Society of America, 130(5):2902–2916, 2011.
26. Kai Siedenburg, Ichiro Fujinaga, and Stephen McAdams. A comparison of approaches to timbre descriptors in music information retrieval and music psychology. Journal of New Music Research, 45(1):27–41, 2016.
27. Stanley S Stevens. On the psychophysical law. Psychological review, 64(3):153,1957.
28. Shih-Yang Su, Cheng-Kai Chiu, Li Su, and Yi-Hsuan Yang. Automatic conversion of pop music into chiptunes for 8-bit pixel art. In Proc. IEEE ICASSP,pages 411–415. IEEE, 2017.
29. D. Ulyanov and V. Lebedev. Singing style transfer, 2016. https://dmitryulyanov.github.io/ audio-texture-synthesis-and-style-transfer/.
30. Vesa V¨alim¨aki, Sira Gonz´alez, Ossi Kimmelma, and Jukka Parviainen. Digital audio antiquing-signal processing methods for imitating the sound quality of historical recordings. Journal of the Audio Engineering Society, 56(3):115–139,2008.
31. A¨aron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, page 125, 2016.
32. Prateek Verma and Julius O. Smith. Neural style transfer for audio spectograms. CoRR, abs/1801.01589, 2018.
33. Cheng-Wei Wu, Jen-Yu Liu, Yi-Hsuan Yang, and Jyh-Shing R Jang. Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1807.02254, 2018.
34. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
35. Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization. In Proc. ECCV, Part III, 2016.
36. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.
37. Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In NIPS, pages 465–476, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *