一個基於卷積的高效自注意力旋律生成神經網絡模型__國立清華大學博碩士論文全文影像系統

帳號：guest(18.224.43.50) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	劉真濡
作者(外文):	Liu, Zhen-Ru
論文名稱(中文):	一個基於卷積的高效自注意力旋律生成神經網絡模型
論文名稱(外文):	An efficient music generator based on CNN with attention mechanism
指導教授(中文):	陳人豪蘇豐文
指導教授(外文):	Chen, Jen-Hao Soo, Von-Wun
口試委員(中文):	陳仁純劉晉良
口試委員(外文):	Chen, Ren-Chuen Liu, Jinn-Liang
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	計算與建模科學研究所
學號:	108026466
出版年(民國):	110
畢業學年度:	109
語文別:	英文
論文頁數:	64
中文關鍵詞:	深度學習、自註意力機制、音樂生成
外文關鍵詞:	Deep Learning、Self-Attention、Music Generation
相關次數:	推薦:0 點閱:126 評分: 下載:0 收藏:0

我們設計了一個基於卷積與自註意力機制結合而成的神經網絡模型用於
單音音樂旋律生成。採用 WAV 格式文件作為資料集而並非使用更廣泛的
MIDI格式，提取主旋律音高，持續時間的序列作為數據集，進行訓練。得
到了一個易於訓練，易部署的旋律生成模型。最後，我們引入一套基於統
計的指標進行模型生成樣本的評估。

We designed a neural network model built on the combination of convolution
and self-attention mechanism for monophonic music melody generation. The WAV
format file is utilized instead of the more widely utilized midi format, and the sequence of the main melody pitch and duration is extracted as a dataset for training.
A melody generation model that can be trained easily and deploy is obtained. Finally, we lead a set of statistics-based indicators into evaluating the model generation samples.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
2. Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
2.1. Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1. Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3. Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4. Motive/Motif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.5. Melody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.6. Counterpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.7. Five core elements of music . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. Degital Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1. System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2. Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3. Fouier transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.4. Spectral representation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3. Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1. Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . 8
2.3.2. Residual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3. Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.4. Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.5. Time-Distributed Operator Layer . . . . . . . . . . . . . . . . . . . . 14
3. Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1. Reasons for choosing the dataset . . . . . . . . . . . . . . . . . . . . . 15
3.1.2. Feature extraction and processing of the dataset . . . . . . . . . . . . . 16
3.1.3. Dataset processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2. The model designation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1. RNN-Attention in Bahdanau’s Work . . . . . . . . . . . . . . . . . . . 20
3.2.2. Convolution with Bahdanau’s attention . . . . . . . . . . . . . . . . . 21
3.2.3. CNN based time-distributed Bahdanau’s attention . . . . . . .22
3.2.4. A more complex model . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.5. Further improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.6. The final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4. Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1. Training effiency evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2. A review of evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3. Objective music measurements With music features . . . . . . . . . 35
4.3.1. Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2. Absolute measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3. Relative Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4. Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.1. Designation of the subjective evaluation questionnaire . . . . . .47
4.4.2. The Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5. Conclusions And Future Work . . . . . . . . . . . . . . . . . . . . . . 52
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Appendix .A. Questionnare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
[2] L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,” Neural Computing and Applications, pp. 1–12, 2018.
[3] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, “Deep learning techniques for music generation,” pp. 4–5, Springer, 2020.
[4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,”2016.
[5] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” 2017.
[6] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, “Parallel wavenet: Fast high-fidelity speech synthesis,”2017.
[7] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” 2018.
[8] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” in International Conference on Learning Representations, 2019.
[9] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial neural audio synthesis,” in International Conference on Learning Representations, 2019.
[10] S. Dieleman, A. van den Oord, and K. Simonyan, “The challenge of realistic music generation: modelling raw audio at scale,” in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), vol. 31, Curran Associates, Inc., 2018.
[11] R. Manzelli, V. Thakkar, A. Siahkamari, and B. Kulis, “Conditioning deep generative raw audio models for structured automatic music,” 2018.
[12] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional 52waveform synthesis,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019.
[13] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” 2019.
[14] S. Kim, S. gil Lee, J. Song, J. Kim, and S. Yoon, “Flowavenet : A generative flow for raw audio,” 2019.
[15] J. Serrà, S. Pascual, and C. Segura Perales, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019.
[16] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019.
[17] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity speech synthesis with adversarial networks,” in International Conference on Learning Representations, 2020.
[18] W. Ping, K. Peng, K. Zhao, and Z. Song, “Waveflow: A compact flow-based model for raw audio,” 2020.
[19] E. Waite, “Project magenta: Generating long-term structure in songs and stories.” https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn, 2016.
[20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Representations by Backpropagating Errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[22] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional generative adversarial network for symbolic-domain music generation,” 2017.
[23] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” 2017.
[24] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014.
[25] C.-Z. A. Huang, T. Cooijmans, A. Roberts, A. Courville, and D. Eck, “Counterpoint by convolution,” 2019.
[26] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015.
[27] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical latent vector model for learning long-term structure in music,” 2019.
[28] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR,
vol. abs/1312.6114, 2014.
[29] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music transformer,” in International Conference on Learning Representations, 2019.
[30] Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” 2020.
[31] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2014.
[32] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An attention pooling based representation learning method for speech emotion recognition,” in Interspeech 2018, International Speech Communication Association, September 2018.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
[34] “Note (i).” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/9781561
592630.001.0001/omo-9781561592630-e-0000020121, 2001.
[35] W. Drabkin, “Scale.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo
/9781561592630.001.0001/omo-9781561592630-e-0000024691, 2001.
[36] “Whole tone.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/97815
61592630.001.0001/omo-9781561592630-e-0000030241, 2001.
[37] W. Drabkin and M. Lindley, “Semitone.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/9781561592630.001.0001/omo-9781561592630-e-0000025395, 2001.
[38] “Heptatonic.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/97815
61592630.001.0001/omo-9781561592630-e-0000012823, 2001.
[39] “Heptachord.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/97815
61592630.001.0001/omo-9781561592630-e-0000012822, 2001.
[40] “Tonic.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/9781561592630.001.0001/omo-9781561592630-e-0000028121, 2001.
[41] W. Drabkin, “Motif.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/
9781561592630.001.0001/omo-9781561592630-e-0000019221, 2001.
[42] D. Fallows, “Head-motif.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093
/gmo/9781561592630.001.0001/omo-9781561592630-e-0000012638, 2001.
[43] A. L. Ringer, “Melody.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gm
o/9781561592630.001.0001/omo-9781561592630-e-0000018357, 2001.
[44] K.-J. Sachs and C. Dahlhaus, “Counterpoint,” 2001.
[45] A. B. Downey, “Think DSP: Digital signal processing in python,” ch. 8.2 Filtering and Convolution, pp. 91–93, O’Reilly Media, Inc., 1st ed., 2016.
[46] M. Hayes, Schaum’s Outline of Digital Signal Processing, ch. 1.4 CONVOLUTION, pp. 11–15. Schaum’s, McGraw-Hill, 1 ed., 1998.
[47] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 4.1 Definitions and properties, pp. 115–120. Wiley-ISTE, 2 ed., 2014.
[48] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 5.4 Frequential content of an image, pp. 198–204. Wiley-ISTE, 2 ed., 2014.
[49] M. Hayes, Schaum’s Outline of Digital Signal Processing, ch. 2 Fourier Analysis, pp. 55–67. Schaum’s, McGraw-Hill, 1 ed., 1998.
[50] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 1.1.2 Spectral representation of signals, pp. 57–60. Wiley-ISTE, 2 ed., 2014.
[51] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 2 Discrete Time Signals and Sampling, pp. 65–93. Wiley-ISTE, 2 ed., 2014.
[52] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” 2018.
[53] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” 2013.
[54] M. Lin, Q. Chen, and S. Yan, “Network in network,” 2014.
[55] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” in A Field Guide to Dynamical Recurrent Neural Networks (S. C. Kremer and J. F. Kolen, eds.), IEEE Press, 2001.
[56] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
[57] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
[58] P. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[59] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” 2013.
[60] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” 2014.
[61] H. Micko, “Attention in detection theory,” in Trends in Mathematical Psychology (E. Degreef and J. Van Buggenhaut, eds.), vol. 20 of Advances in Psychology, pp. 87–103, North-Holland, 1984.
[62] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1412–1421, Association for Computational Linguistics, Sept. 2015.
[63] “tf.keras.layers.timedistributed.” https://www.tensorflow.org/api_docs/python/tf/
keras/layers/TimeDistributed?hl=zh-tw.
[64] “Timedistributed layer.” https://keras.io/api/layers/recurrent_layers/time_distributed/.
[65] B. McFee, V. Lostanlen, A. Metsai, M. McVicar, S. Balke, C. Thomé, C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee, O. Nieto, J. Mason, D. Ellis, E. Battenberg, S. Seyfarth, R. Yamamoto, K. Choi, viktorandreevichmorozov, J. Moore, R. Bittner, S. Hidaka, Z. Wei, nullmightybofo, D. Hereñú, F.-R. Stöter, P. Friesch, A. Weiss, M. Vollrath, and T. Kim,“librosa/librosa: 0.8.0,” July 2020.
[66] T. pandas development team, “pandas-dev/pandas: Pandas,” Feb. 2020.
[67] J. Brownlee, Deep Learning for Time Series Forecasting - Predict the Future with MLPs, CNNs and LSTMs in Python. 2018.
[68] D. Foster, Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play. O＇Reilly Media, 2019.
[69] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.
[70] F. Schmidt, “Generalization in generation: A closer look at exposure bias,” 2019.
[71] C. Ariza, “The interrogator as critic: The turing test and the evaluation of generative music systems,” Computer Music Journal, vol. 33, no. 2, pp. 48–70, 2009.
[72] M. Pearce and G. Wiggins, “Evaluating cognitive models of musical composition,” pp. 73–80, 01 2007.
[73] M. Bretan, G. Weinberg, and L. Heck, “A unit selection methodology for music generation using deep neural networks,” 2016.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文