|
[1] L. Dreamtonics Co., “Synthesizer v studio.” Accessed: 2022-06-30. [2] H. Kenmochi and H. Ohshita, “VOCALOID - commercial singing synthesizer based on sample concatenation,” in Proc. Interspeech 2007, pp. 4009–4010, 2007. [3] R. Valle, J. Li, R. J. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6189–6193, 2020. [4] P. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System,” in Proc. Interspeech 2020, pp. 1306–1310, 2020. [5] X. Zhuang, T. Jiang, S.-Y. Chou, B. Wu, P. Hu, and S. Lui, “Litesing: Towards fast, lightweight and expressive singing voice synthesis,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7078–7082, 2021. [6] Y. Gu, X. Yin, Y. Rao, Y. Wan, B. Tang, Y. Zhang, J. Chen, Y. Wang, and Z. Ma, “Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, 2021. [7] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, (Red Hook, NY, USA), Curran Associates Inc., 2020. [8] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” ArXiv, vol. abs/1701.07875, 2017. [9] C.-C. Chu, F.-R. Yang, Y.-J. Lee, Y.-W. Liu, and S.-H. Wu, “MPop600: A Mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1647–1652, 2020. [10] J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, and Z. Zhao, “Diffsinger: Diffusion acoustic model for singing voice synthesis,” arXiv, vol. abs/2105.02446, 2021. [11] J. Wu, Z. Huang, J. Thoma, D. Acharya, and L. Van Gool, “Wasserstein divergence for gans,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018. [12] 楊馥榕, “基於音韻學的時長模型之中文歌聲合成,” Master’s thesis, 國立清華大學電機工程學系, 十月2021. 國立清華大學電機工程學系碩士論文, https://hdl.handle.net/11296/5d946q. [13] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, (Red Hook, NY, USA), Curran Associates Inc., 2020. [14] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, p. 125, ISCA, 2016. [15] Y. Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi, “Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7237–7241, 2022. [16] M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based highquality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016. [17] J. Lee, H.-S. Choi, C.-B. Jeon, J. Koo, and K. Lee, “Adversarially Trained Endto-End Korean Singing Voice Synthesis System,” in Proc. Interspeech 2019, pp. 2588–2592, 2019. [18] M. Blaauw and J. Bonada, “Sequence-to-sequence singing synthesis using the feed-forward transformer,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7229–7233, 2020. [19] J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “Hifisinger: Towards highfidelity neural singing voice synthesis,” ArXiv, vol. abs/2009.01776, 2020. [20] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, pp. 2410–2419, PMLR, 10–15 Jul 2018. [21] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multiresolution spectrogram,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203, 2020. [22] J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J.-J. Kim, “Korean Singing Voice Synthesis Based on an LSTM Recurrent Neural Network,” in Proc. Interspeech 2018, pp. 1551–1555, 2018. [23] F.-R. Yang, Y.-P. Cho, Y.-H. Yang, D.-Y. Wu, S.-H. Wu, and Y.-W. Liu, “Mandarin singing voice synthesis with a phonology-based duration model,” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1975–1981, 2021. [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017. [25] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on generative adversarial networks,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6955–6959, 2019. [26] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances in Neural Information Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 8780–8794, Curran Associates, Inc., 2021. [27] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations, 2021. [28] M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff-TTS: A Denoising Diffusion Model for Text-to-Speech,” in Proc. Interspeech 2021, pp. 3605–3609, 2021. [29] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning trilemma with denoising diffusion GANs,” in International Conference on Learning Representations, 2022. [30] S. Liu, D. Su, and D. Yu, “Diffgan-tts: High-fidelity and efficient text-tospeech with denoising diffusion gans,” ArXiv, vol. abs/2201.11972, 2022. [31] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2021. [32] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity speech synthesis with adversarial networks,” in International Conference on Learning Representations, 2020. [33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017. [34] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” ArXiv, vol. abs/2006.04558, 2021. [35] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural networks: the official journal of the International Neural Network Society, vol. 107, pp. 3–11, 2018. [36] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” ArXiv, vol. abs/1505.00853, 2015. [37] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4396–4405, 2019. [38] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8107–8116, 2020. [39] 李依哲, “基於雙向時間遞歸神經網路之中文歌聲合成,” Master’s thesis, 國立清華大學電機工程學系, 十月2019. 國立清華大學電機工程學系碩士論文, https://hdl.handle.net/11296/yf8qks. [40] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015. [42] Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems and Computers, 2003, vol. 2, pp. 1398–1402 Vol.2, 2003. [43] J. Nilsson and T. Akenine-Möller, “Understanding ssim,” ArXiv, vol. abs/2006.13846, 2020. [44] C. Gan, X. Wang, M. Zhu, and X. Yu, “Audio quality evaluation using frequency structural similarity measure,” in IET International Communication Conference on Wireless Mobile and Computing (CCWMC 2011), pp. 299–303, 2011. [45] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, pp. 5180–5189, PMLR, 10–15 Jul 2018. [46] T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, 2021. [47] R. Daher, M. K. Zein, J. El Zini, M. Awad, and D. Asmar, “Change your singer: A transfer learning generative adversarial framework for song to song conversion,” in 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, 2020. [48] S. Yong and J. Nam, “Singing expression transfer from one voice to another for a given song,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 151–155, 2018. [49] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN,” in Proc. Interspeech 2021, pp. 2187–2191, 2021. [50] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform models for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 402–415, 2020. [51] V. Hampala, M. Garcia, J. G. Švec, R. C. Scherer, and C. T. Herbst, “Relationship between the electroglottographic signal and vocal fold contact area,” Journal of Voice, vol. 30, no. 2, pp. 161–171, 2016. [52] S. Kim, K. Na, C. Lee, J. An, and I. Kim, “U-singer: Multi-singer singing voice synthesizer that controls emotional intensity,” ArXiv, vol. abs/2203.00931, 2022. [53] Y. Agrawal, R. G. R. Shanker, and V. Alluri, “Transformer-based approach towards music emotion recognition from lyrics,” in Advances in Information Retrieval (D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, and F. Sebastiani, eds.), (Cham), pp. 167–175, Springer International Publishing, 2021. |