|
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017. [2] L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,” Neural Computing and Applications, pp. 1–12, 2018. [3] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, “Deep learning techniques for music generation,” pp. 4–5, Springer, 2020. [4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,”2016. [5] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” 2017. [6] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, “Parallel wavenet: Fast high-fidelity speech synthesis,”2017. [7] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” 2018. [8] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” in International Conference on Learning Representations, 2019. [9] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial neural audio synthesis,” in International Conference on Learning Representations, 2019. [10] S. Dieleman, A. van den Oord, and K. Simonyan, “The challenge of realistic music generation: modelling raw audio at scale,” in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), vol. 31, Curran Associates, Inc., 2018. [11] R. Manzelli, V. Thakkar, A. Siahkamari, and B. Kulis, “Conditioning deep generative raw audio models for structured automatic music,” 2018. [12] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional 52waveform synthesis,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019. [13] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” 2019. [14] S. Kim, S. gil Lee, J. Song, J. Kim, and S. Yoon, “Flowavenet : A generative flow for raw audio,” 2019. [15] J. Serrà, S. Pascual, and C. Segura Perales, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019. [16] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019. [17] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity speech synthesis with adversarial networks,” in International Conference on Learning Representations, 2020. [18] W. Ping, K. Peng, K. Zhao, and Z. Song, “Waveflow: A compact flow-based model for raw audio,” 2020. [19] E. Waite, “Project magenta: Generating long-term structure in songs and stories.” https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn, 2016. [20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Representations by Backpropagating Errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [22] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional generative adversarial network for symbolic-domain music generation,” 2017. [23] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” 2017. [24] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014. [25] C.-Z. A. Huang, T. Cooijmans, A. Roberts, A. Courville, and D. Eck, “Counterpoint by convolution,” 2019. [26] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015. [27] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical latent vector model for learning long-term structure in music,” 2019. [28] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR, vol. abs/1312.6114, 2014. [29] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music transformer,” in International Conference on Learning Representations, 2019. [30] Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” 2020. [31] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2014. [32] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An attention pooling based representation learning method for speech emotion recognition,” in Interspeech 2018, International Speech Communication Association, September 2018. [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015. [34] “Note (i).” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/9781561 592630.001.0001/omo-9781561592630-e-0000020121, 2001. [35] W. Drabkin, “Scale.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo /9781561592630.001.0001/omo-9781561592630-e-0000024691, 2001. [36] “Whole tone.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/97815 61592630.001.0001/omo-9781561592630-e-0000030241, 2001. [37] W. Drabkin and M. Lindley, “Semitone.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/9781561592630.001.0001/omo-9781561592630-e-0000025395, 2001. [38] “Heptatonic.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/97815 61592630.001.0001/omo-9781561592630-e-0000012823, 2001. [39] “Heptachord.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/97815 61592630.001.0001/omo-9781561592630-e-0000012822, 2001. [40] “Tonic.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/9781561592630.001.0001/omo-9781561592630-e-0000028121, 2001. [41] W. Drabkin, “Motif.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gmo/ 9781561592630.001.0001/omo-9781561592630-e-0000019221, 2001. [42] D. Fallows, “Head-motif.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093 /gmo/9781561592630.001.0001/omo-9781561592630-e-0000012638, 2001. [43] A. L. Ringer, “Melody.” https://www.oxfordmusiconline.com/grovemusic/view/10.1093/gm o/9781561592630.001.0001/omo-9781561592630-e-0000018357, 2001. [44] K.-J. Sachs and C. Dahlhaus, “Counterpoint,” 2001. [45] A. B. Downey, “Think DSP: Digital signal processing in python,” ch. 8.2 Filtering and Convolution, pp. 91–93, O’Reilly Media, Inc., 1st ed., 2016. [46] M. Hayes, Schaum’s Outline of Digital Signal Processing, ch. 1.4 CONVOLUTION, pp. 11–15. Schaum’s, McGraw-Hill, 1 ed., 1998. [47] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 4.1 Definitions and properties, pp. 115–120. Wiley-ISTE, 2 ed., 2014. [48] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 5.4 Frequential content of an image, pp. 198–204. Wiley-ISTE, 2 ed., 2014. [49] M. Hayes, Schaum’s Outline of Digital Signal Processing, ch. 2 Fourier Analysis, pp. 55–67. Schaum’s, McGraw-Hill, 1 ed., 1998. [50] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 1.1.2 Spectral representation of signals, pp. 57–60. Wiley-ISTE, 2 ed., 2014. [51] G. Blanchet and M. Charbit, Digital Signal and Image Processing using MATLAB, Volume 1: Fundamentals, ch. 2 Discrete Time Signals and Sampling, pp. 65–93. Wiley-ISTE, 2 ed., 2014. [52] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” 2018. [53] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” 2013. [54] M. Lin, Q. Chen, and S. Yan, “Network in network,” 2014. [55] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” in A Field Guide to Dynamical Recurrent Neural Networks (S. C. Kremer and J. F. Kolen, eds.), IEEE Press, 2001. [56] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994. [57] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990. [58] P. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990. [59] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” 2013. [60] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” 2014. [61] H. Micko, “Attention in detection theory,” in Trends in Mathematical Psychology (E. Degreef and J. Van Buggenhaut, eds.), vol. 20 of Advances in Psychology, pp. 87–103, North-Holland, 1984. [62] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1412–1421, Association for Computational Linguistics, Sept. 2015. [63] “tf.keras.layers.timedistributed.” https://www.tensorflow.org/api_docs/python/tf/ keras/layers/TimeDistributed?hl=zh-tw. [64] “Timedistributed layer.” https://keras.io/api/layers/recurrent_layers/time_distributed/. [65] B. McFee, V. Lostanlen, A. Metsai, M. McVicar, S. Balke, C. Thomé, C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee, O. Nieto, J. Mason, D. Ellis, E. Battenberg, S. Seyfarth, R. Yamamoto, K. Choi, viktorandreevichmorozov, J. Moore, R. Bittner, S. Hidaka, Z. Wei, nullmightybofo, D. Hereñú, F.-R. Stöter, P. Friesch, A. Weiss, M. Vollrath, and T. Kim,“librosa/librosa: 0.8.0,” July 2020. [66] T. pandas development team, “pandas-dev/pandas: Pandas,” Feb. 2020. [67] J. Brownlee, Deep Learning for Time Series Forecasting - Predict the Future with MLPs, CNNs and LSTMs in Python. 2018. [68] D. Foster, Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play. O'Reilly Media, 2019. [69] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org. [70] F. Schmidt, “Generalization in generation: A closer look at exposure bias,” 2019. [71] C. Ariza, “The interrogator as critic: The turing test and the evaluation of generative music systems,” Computer Music Journal, vol. 33, no. 2, pp. 48–70, 2009. [72] M. Pearce and G. Wiggins, “Evaluating cognitive models of musical composition,” pp. 73–80, 01 2007. [73] M. Bretan, G. Weinberg, and L. Heck, “A unit selection methodology for music generation using deep neural networks,” 2016. |