|
[1] P. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “XiaoiceSing: A high-quality and integrated singing voice synthesis system,” arXiv preprint arXiv:2006.06261, 2020. [2] J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “HifiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020. [3] Y. Gu, X. Yin, Y. Rao, Y. Wan, B. Tang, Y. Zhang, J. Chen, Y. Wang, and Z. Ma, “Bytesing: A Chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders,” in International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, 2021. [4] C.-C. Chu, F.-R. Yang, Y.-J. Lee, Y.-W. Liu, and S.-H. Wu, “MPop600: A Mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1647– 1652, 2020. [5] J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J.-J. Kim, “Korean singing voice synthesis system based on an LSTM recurrent neural network,” in Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1551–1555, 2018. [6] Y. Wu, S. Li, C. Yu, H. Lu, C. Weng, L. Zhang, and D. Yu, “Peking opera synthesis via duration informed attention network,” arXiv preprint arXiv:2008.03029, 2020. [7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, 2018. [8] X. Ying, “An overview of overfitting and its solutions,” in Journal of Physics: Conference Series, IOP Publishing, 2019. [9] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGan: A fast waveform generation model based on generative adversarial networks with multiresolution spectrogram,” in ICASSP, pp. 6199–6203, 2020. [10] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016. [11] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [12] H. Kenmochi and H. Ohshita, “Vocaloid-commercial singing synthesizer based on sample concatenation,” in INTERSPEECH, 2007. [13] K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “An HMM-based singing voice synthesis system,” in International Conference on Spoken Language Processing, 2006. [14] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks.,” in INTERSPEECH, pp. 2478–2482, 2016. [15] K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on convolutional neural networks,” arXiv preprint arXiv:1904.06868, 2019. [16] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on generative adversarial networks,” in ICASSP, pp. 6955– 6959, 2019. [17] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017. [18] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017. [19] J. Lee, H.-S. Choi, C.-B. Jeon, J. Koo, and K. Lee, “Adversarially trained end-to-end Korean singing voice synthesis system,” arXiv preprint arXiv:1908.01919, 2019. [20] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning (ICML), pp. 2410–2419, 2018. [21] J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, and Z. Zhao, “DiffSinger: Diffusion acoustic model for singing voice synthesis,” arXiv preprint arXiv:2105.02446, 2021. [22] H. Kawahara, “STRAIGHT, exploitation of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds,” Acoustical science and technology, vol. 27, no. 6, pp. 349–353, 2006. [23] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” arXiv preprint arXiv:1905.09263, 2019. [24] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei, et al., “Durian: Duration informed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700, 2019. [25] M. Blaauw and J. Bonada, “Sequence-to-sequence singing synthesis using the feed-forward transformer,” in ICASSP, pp. 7229–7233, 2020. [26] C. Lü, Chinese literacy learning in an immersion program. Springer, 2019. [27] Y.-J. Lee, T.-C. Liao, and Y.-W. Liu, “A simple strategy for natural Mandarin spoken word stretching via the vocoder,” in International Congress on Acoustics (ICA), 2019. [28] E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech,” Speech communication, vol. 16, no. 2, pp. 175–205, 1995. [29] J. Driedger, “Time-scale modification algorithms for music audio signals,” Master’s thesis, Saarland University, 2011. [30] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in APSIPA ASC, pp. 1–9, 2013. [31] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015. [32] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015. [33] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [35] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019. [36] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE: A convolutional representation for pitch estimation,” in ICASSP, pp. 161–165, 2018. [37] R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” in ICASSP, pp. 6189–6193, 2020. [38] P. Verma and J. O. Smith, “Neural style transfer for audio spectograms,” arXiv preprint arXiv:1801.01589, 2018. [39] C.-W. Wu, J.-Y. Liu, Y.-H. Yang, and J.-S. R. Jang, “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1807.02254, 2018. |