帳號:guest(3.149.238.79)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):許君展
作者(外文):Kevin Lex Tan Cua
論文名稱(中文):透過內插訓練轉換器產生流行音樂具獨奏的樂器伴奏
論文名稱(外文):Popular Music Instrumental Accompaniment Generation with Solo via Music Interpolation Using Transformers Learning
指導教授(中文):蘇豐文
指導教授(外文):Soo, Von-Wun
口試委員(中文):劉奕汶
蘇黎
口試委員(外文):LIU, YI-WEN
Su, Li
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:108065710
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:58
中文關鍵詞:音樂伴奏生成音樂分類變壓器自動編碼器多源變壓器器樂獨奏生成
外文關鍵詞:Music Accompaniment GenerationMusic ClassificationTransformerAutoencoderAutomatic AccompanimentInstrumental Solo Gener- ation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:397
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在生成器樂伴奏時,重要的是生成的伴奏能夠支持流行歌曲中的主唱旋律。通過這樣做,主旋律能夠與伴奏的樂器旋律聽起來更好。許多伴奏生成模型缺乏的一個方面是生成器樂獨奏的能力。在流行歌曲中,主旋律或歌手可能需要一些時間來休息。在這些時刻,產生的器樂伴奏不會偏離它們通常產生的東西,這會導致聽眾失去興趣。因此,因為沒有主旋律,所以沒有什麼可伴奏的。我們通過添加一個器樂獨奏生成模塊來改進當前的伴奏生成模型,該模塊在歌手缺席時生成器樂獨奏主旋律。此外,我們探索了器樂主旋律是否與人聲主旋律不同,並訓練分類器這樣做。最後,我們執行數據增強以改進生成過程,因為它增加了模型學習的不同鍵。總的來說,我們的模型能夠生成器樂獨奏,並通過在伴奏中使用這些來改進基線伴奏生成模型。我們了解到,Transformer 分類器能夠以超過 85\% 的準確率從器樂獨奏主旋律中對人聲主旋律進行分類。我們還了解到,在歌手安靜的部分加入器樂獨奏確實增加了人類評估者更喜歡器樂伴奏的可能性,與僅包含輔助器樂伴奏的器樂伴奏相比,該可能性增加了兩倍。
In generating instrumental accompaniments, it is important that the generated accompaniment is able to support the lead melody, the vocalist in Pop Songs. By doing so, the lead melody is able to sound better with the accompanying instrumental melody. An aspect that has been lacking in a lot of accompaniment generating models is the ability to generate an instrumental solo. In Pop Songs, the lead melody or singer may take some time to rest. In these moments the instrumental accompaniments generated do not deviate from what they normally generate, which can cause listeners to lose interest. Thus, because there is no lead melody, there is nothing to accompany. We improve upon the current accompaniment generating models by adding an instrumental solo generation module, which generates an instrumental solo lead melody whenever the singer is absent. Additionally, we explore whether an instrumental lead melody is different from a vocal lead melody, and train a classifier in doing so. Finally, we perform data augmentation in order to improve the generation process, as it increases the different keys that the model learns. Overall, our model was able to generate instrumental solos and by using these in the accompaniment improve upon the baseline accompaniment generation model. We learn that a Transformer classifier is able to classify a vocal lead melody from an instrumental solo lead melody with greater than 85\% accuracy. We also learn that including an instrumental solo in the parts where the singer is quiet indeed increases the likelihood that the instrumental accompaniment is preferred twice as much by a human evaluator compared to that of an instrumental accompaniment that solely includes a supporting instrumental accompaniment.
摘要 i
Abstract ii
Acknowledgements iii
Contents vi
List of Tables vii
List of Figures ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review and Related Work 8
2.1 MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Symbolic Music Representation . . . . . . . . . . . . . . . . . . . . . 8
2.3 Symbolic Music Genre Classification . . . . . . . . . . . . . . . . . . 10
2.4 Music Generation, Machine Translation, and Sequence-to-Sequence
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Music Interpolation and Autoencoders . . . . . . . . . . . . . . . . . 13
2.6 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Methodology 19
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Pop909 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Creating an instrumental solo interpolation dataset from Pop909 20
3.1.3 Instrumental Solo Melodies vs Vocal Melodies . . . . . . . . . 21
3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Symbolic Music Representation . . . . . . . . . . . . . . . . . 23
3.3 Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Classification Model . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Solo Generation Model . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Accompaniment Generation Model . . . . . . . . . . . . . . . . . . . 25
3.5 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Solo Generation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 Accompaniment Generation Model . . . . . . . . . . . . . . . . . . . 28
3.8 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8.1 Classification Model . . . . . . . . . . . . . . . . . . . . . . . 28
3.8.2 Solo Generation Model . . . . . . . . . . . . . . . . . . . . . . 30
3.8.3 Accompaniment Generation Model . . . . . . . . . . . . . . . 32
3.9 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Experiments 34
4.1 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Generation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Solo Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Accompaniment Generation . . . . . . . . . . . . . . . . . . . 37
5 Results and Discussion 39
5.1 Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Classification Model . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2 Instrumental Solo Generation . . . . . . . . . . . . . . . . . . 40
5.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusion and Future Work 49
6.0.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
References 52
[1] Y. Ren, J. He, X. Tan, T. Qin, Z. Zhao, and T.-Y. Liu, “Popmag: Pop mu-
sic accompaniment generation,” in Proceedings of the 28th ACM International
Conference on Multimedia, MM ’20, (New York, NY, USA), p. 1198–1206,
Association for Computing Machinery, 2020.
[2] H. Zhu, Q. Liu, N. J. Yuan, C. Qin, J. Li, K. Zhang, G. Zhou, F. Wei, Y. Xu, and
E. Chen, “Xiaoice band: A melody and arrangement generation framework for
pop music,” in Proceedings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, KDD ’18, (New York, NY, USA),
p. 2837–2846, Association for Computing Machinery, 2018.
[3] N. Jiang, S. Jin, Z. Duan, and C. Zhang, “Rl-duet: Online music accompa-
niment generation using deep reinforcement learning,” in The Thirty-Fourth
AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second
Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The
Tenth AAAI Symposium on Educational Advances in Artificial Intelligence,
EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 710–718, AAAI
Press, 2020.
[4] I. Simon, D. Morris, and S. Basu, “Mysong: Automatic accompaniment gener-
ation for vocal melodies,” in Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, CHI ’08, (New York, NY, USA), p. 725–734,
Association for Computing Machinery, 2008.
[5] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. M.
Shazeer, A. M. Dai, M. Hoffman, M. Dinculescu, and D. Eck, “Music trans-
former: Generating music with long-term structure,” in ICLR, 2019.
[6] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical
latent vector model for learning long-term structure in music,” in International
Conference on Machine Learning (ICML), 2018.
[7] P. L. Diéguez and V.-W. Soo, “Variational autoencoders for polyphonic music
interpolation,” in 2020 International Conference on Technologies and Applica-
tions of Artificial Intelligence (TAAI), pp. 56–61, 2020.
[8] K. Choi, C. Hawthorne, I. Simon, M. Dinculescu, and J. Engel, “Encoding
musical style with transformer autoencoders,” in ICML, 2020.
[9] T. Borghuis, A. Tibo, S. Conforti, L. Canciello, L. Brusci, and P. Frasconi, “Off
the beaten track: Using deep learning to interpolate between music genres,”
CoRR, vol. abs/1804.09808, 2018.
[10] T. Borghuis, A. Tibo, S. Conforti, L. Brusci, and P. Frasconi, “Full-band music
genres interpolations with wasserstein autoencoders,” in Ital-IA, 2019. Work-
shop AI for Media and Entertainment ; Conference date: 18-03-2019 Through
18-03-2019.
[11] D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow, “Understanding and im-
proving interpolation in autoencoders via an adversarial regularizer,” in Inter-
national Conference on Learning Representations, 2019.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in
Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran
Associates, Inc., 2017.
[13] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with
neural networks,” in Advances in Neural Information Processing Systems
(Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger,
eds.), vol. 27, Curran Associates, Inc., 2014.
[14] P. Todd and G. Loy, “A connectionist approach to algorithmic composition,”
Computer Music Journal, vol. 13, pp. 173–194, 1989.
[15] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Com-
putation, vol. 9, pp. 1735–1780, 11 1997.
[16] D. Eck and J. Schmidhuber, “A first look at music composition using lstm
recurrent neural networks,” tech. rep., 2002.
[17] T. Jiang, Q. Xiao, and X. Yin, “Music generation using bidirectional recurrent
network,” in 2019 IEEE 2nd International Conference on Electronics Technol-
ogy (ICET), pp. 564–569, 2019.
[18] S. Ji, J. Luo, and X. Yang, “A comprehensive survey on deep music gener-
ation: Multi-level representations, algorithms, evaluations, and future direc-
tions,” ArXiv, vol. abs/2011.06801, 2020.
[19] KevinL, “Midi: Who we are.” https://www.midi.org/about; accessed 5-Dec-
2021.
[20] Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling
and generation of expressive pop piano compositions,” in Proceedings of the
28th ACM International Conference on Multimedia, MM ’20, (New York, NY,
USA), p. 1180–1188, Association for Computing Machinery, 2020.
[21] D. C. Corrêa and F. Rodrigues, “A survey on symbolic data-based music genre
classification,” Expert Syst. Appl., vol. 60, pp. 190–210, 2016.
[22] R. Hillewaere, B. Manderick, and D. Conklin, “Alignment methods for folk tune
classification,” in Data Analysis, Machine Learning and Knowledge Discovery,
pp. 369–377, Springer, 2014.
[23] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu, “MusicBERT:
Symbolic music understanding with large-scale pre-training,” in Findings of
the Association for Computational Linguistics: ACL-IJCNLP 2021, (Online),
pp. 791–800, Association for Computational Linguistics, Aug. 2021.
[24] H. Liang, W. Lei, P. Y. Chan, Z. Yang, M. Sun, and T.-S. Chua, “Pirhdy:
Learning pitch-, rhythm-, and dynamics-aware embeddings for symbolic mu-
sic,” in Proceedings of the 28th ACM International Conference on Multimedia,
pp. 574–582, 2020.
[25] L. Li, R. Zhang, and Z. Wang, “Melodic phrase attention network for symbolic
data-based music genre classification (student abstract),” in AAAI, 2021.
[26] C. Liu, L. Feng, G. Liu, H. Wang, and S. Liu, “Bottom-up broadcast neural
network for music genre classification,” Multim. Tools Appl., vol. 80, pp. 7313–
7331, 2021.
[27] M. Ashraf, G. Geng, X. Wang, F. Ahmad, and F. Abid, “A globally regu-
larized joint neural architecture for music classification,” IEEE Access, vol. 8,
pp. 220980–220989, 2020.
[28] M. Agrawal and A. Nandy, “A novel multimodal music genre classi-
fier using hierarchical attention and convolutional neural network,” ArXiv,
vol. abs/2011.11970, 2020.
[29] S. Allamy and A. L. Koerich, “1d cnn architectures for music genre classifica-
tion,” ArXiv, vol. abs/2105.07302, 2021.
[30] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, “A survey on
text classification: From shallow to deep learning,” ArXiv, vol. abs/2008.00364,
2020.
[31] P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu, “Text classification im-
proved by integrating bidirectional lstm with two-dimensional max pooling,”
in COLING, 2016.
[32] P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for text classifi-
cation with multi-task learning,” in Proceedings of the Twenty-Fifth Inter-
national Joint Conference on Artificial Intelligence, IJCAI’16, p. 2873–2879,
AAAI Press, 2016.
[33] D. Wang, J. Gong, and Y. xi Song, “W-rnn: News text classification based on
a weighted rnn,” ArXiv, vol. abs/1909.13077, 2019.
[34] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate,” Jan. 2015. 3rd International Conference on
Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through
09-05-2015.
[35] R. Paulus, C. Xiong, and R. Socher, “A deep reinforced model for abstrac-
tive summarization,” in International Conference on Learning Representations,
2018.
[36] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin, “Convolutional
sequence to sequence learning,” in ICML, 2017.
[37] S. Woo, J. Park, J.-Y. Lee, and I.-S. Kweon, “Cbam: Convolutional block
attention module,” in ECCV, 2018.
[38] L. Fenaux and M. J. Quintero, “Bumblebee: A transformer for music,” ArXiv,
vol. abs/2107.03443, 2021.
[39] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov,
“Transformer-xl: Attentive language models beyond a fixed-length context,” in
ACL, 2019.
[40] J. Libovický, J. Helcl, and D. Mareček, “Input combination strategies for multi-
source transformer decoder,” in WMT, 2018.
[41] Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, X. Gu, and G. Xia,
“Pop909: A pop-song dataset for music arrangement generation,” in ISMIR,
2020.
[42] L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,”
Neural Computing and Applications, vol. 32, no. 9, pp. 4773–4784, 2020.
[43] R. Tavenard, J. Faouzi, G. Vandewiele, F. Divo, G. Androz, C. Holtz, M. Payne,
R. Yurchak, M. Rußwurm, K. Kolar, and E. Woods, “Tslearn, a machine learn-
ing toolkit for time series data,” Journal of Machine Learning Research, vol. 21,
no. 118, pp. 1–6, 2020.
[44] B. Paassen, B. Mokbel, and B. Hammer, “A toolbox for adaptive sequence
dissimilarity measures for intelligent tutoring systems,” in EDM, 2015.
[45] V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions
and Reversals,” Soviet Physics Doklady, vol. 10, p. 707, Feb. 1966.
[46] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for
spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 26, no. 1, pp. 43–49, 1978.
[47] R. Popovici and R. Andonie, “Music genre classification with self-organizing
maps and edit distance,” in 2015 International Joint Conference on Neural
Networks (IJCNN), pp. 1–7, 2015.
[48] B. Benward and M. N. Saker, Music in theory and practice. McGraw-Hill
Education, 2015.
[49] C. Donahue, H. H. Mao, Y. Li, G. Cottrell, and J. McAuley, “Lakhnes: Im-
proving multi-instrumental music generation with cross-domain pre-training,”
in ISMIR, 2019.
[50] M. S. Cuthbert and C. Ariza, “Music21: A toolkit for computer-aided musicol-
ogy and symbolic music data.,” in ISMIR (J. S. Downie and R. C. Veltkamp,
eds.), pp. 637–642, International Society for Music Information Retrieval, 2010.
[51] O. M. Bjørndalen, “Midi objects for python¶.” https://mido.readthedocs.
io/en/latest; accessed 5-Dec-2021.
[52] M. Schwenk, “Midieditor: Graphical interface to edit, play, and record midi
data.” https://midieditor.org/; accessed 5-Dec-2021.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *