帳號:guest(18.117.159.105)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):康冠儀
作者(外文):Kang, Kuan-Yi
論文名稱(中文):綜合考量音色及音樂表情控制之歌聲轉換
論文名稱(外文):Singing voice conversion by joint consideration of timbre and expressive control
指導教授(中文):劉奕汶
指導教授(外文):Liu, Yi-wen
口試委員(中文):王新民
黃朝宗
吳尚鴻
口試委員(外文):Wang, Hsin-Min
Huang, Chao-Tsung
Wu, Shan-Hung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:106061538
出版年(民國):108
畢業學年度:108
語文別:英文
論文頁數:51
中文關鍵詞:歌聲轉換韻律情感
外文關鍵詞:Singing voice conversionProsodyExpression
相關次數:
  • 推薦推薦:0
  • 點閱點閱:261
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
歌聲轉換透過轉換帶有個人身份的特徵來轉換歌手甲的歌聲至歌手乙的歌聲。歌聲是一個極具豐富資訊的聲音,其中一些個人唱歌的習慣、偏好以及技巧可能不只會反映在音色也同時在音高以及能量的時間序列上。然而,至今在大多數的語音轉換的方法上仍然比較重視在開發音色轉換上的演算法,這篇論文也因此透過各種實驗討論歌聲在這些韻律特徵上所帶有的個人獨特性,以及人類聽覺感知上對於置換這些韻律特徵對歌手辨識所帶來的影響,最後也實作並提出同時包含音色以及韻律轉換的歌聲轉換方法。
從音高以及能量萃取出的一些特徵可以在辨識六個歌手的任務上,透過隨機森林達到50%的正確率,另外一些特徵也在統計上有顯著的組間差異。在受試者測驗中,越多非原歌手的韻律特徵會達到越低的歌手辨識率以及相似度。在實作所提出的歌聲轉換方法上,若加入字典集的情感特徵轉換,受試者所評測的相似度會大於純粹只有音色轉換的方法,但同時品質上評分則較低。除此之外,在受測者實驗中,對於歌手熟悉的程度也會影響最後的聽測結果。
The task of singing voice conversion converts source singer’s singing voice to target singer’s singing voice. The conversion of the features carrying individuality is needed in order to change the identity from source singer to target singer. In most of the voice conversion task, the methods focus only on the timbre conversion from source speaker to target speaker. While singing voice containing plentiful information not only in timbre but also expressive features such as pitch and intensity, the personal habits and techniques man contain in the temporal parts of these features.
The thesis therefore discuss the individuality of these expression features in singing voice, and further design experiments to investigate how the transformation of these features affect on human perception for identification and similarity of singers. The singing voice conversion methods containing both timbre and expression conversion are then proposed to improve the similarity score to target.
The features extracted from pitch and intensity have 50% classification accuracy of 6 singers using random forest, with few of the features also having significant between-singer differences. The more modification of the expression features from the original singer leads to lower similarity and identification score. The dictionary-based expression conversion proposed improves the similarity score on subjective test but have lower quality score than the timbre conversion methods. The difference of the familiarity to the singers also have different subjective scores.
摘要 -------------------------------------------------------------I
Abstract --------------------------------------------------------II
Acknowledgements -----------------------------------------------III
Contents --------------------------------------------------------VI
List of Figures -----------------------------------------------VIII
List of Tables --------------------------------------------------IX
1 Introduction ---------------------------------------------------1
1.1 Background and Motivation ------------------------------------1
1.2 Thesis Overview ----------------------------------------------2
2 Individuality of Singing Voice in Pitch and Intensity ----------4
2.1 Introduction -------------------------------------------------4
2.2 Method -------------------------------------------------------5
2.2.1 Recordings -------------------------------------------------5
2.2.2 Labeling Process -------------------------------------------6
2.2.3 Features ---------------------------------------------------7
2.2.4 Models for Analysis ----------------------------------------8
2.2.4.1 One-way ANOVA --------------------------------------------8
2.2.4.2 Linear Discriminant Analysis ----------------------------10
2.2.4.3 Random Forest -------------------------------------------10
2.3 Results of Experiments and Discussion -----------------------10
2.3.1 Statistical Results ---------------------------------------10
2.3.2 Visualization ---------------------------------------------12
2.3.3 Singer Identification Results -----------------------------12
2.4 Summary -----------------------------------------------------13
3 Influences of Prosodic Feature Replacement on the Perceived Singing Voice Identity --------------------------------------------------15
3.1 Introduction ------------------------------------------------15
3.2 Methods -----------------------------------------------------16
3.2.1 Recordings ------------------------------------------------16
3.2.2 Participants ----------------------------------------------17
3.2.3 Experiment Design -----------------------------------------17
3.2.3.1 Singer Identification -----------------------------------17
3.2.3.2 Identification and Similarity Task of the Timbre-carrying Singer ----------------------------------------------------------17
3.2.3.3 Identification and Similarity Task of the Timbre-converted Singer ----------------------------------------------------------19
3.3 Results of Experiments and Discussion -----------------------20
3.3.1 Singer Identification -------------------------------------20
3.3.2 Identification and Similarity Task of the Timbre-carrying Singer ----------------------------------------------------------21
3.3.3 Identification and Similarity Task of the Timbre-converted Singer ----------------------------------------------------------24
3.3.4 Discussion ------------------------------------------------26
3.4 Summary -----------------------------------------------------27
4 Dictionary-based Singing Voice Conversion ---------------------29
4.1 Introduction ------------------------------------------------29
4.2 Methods -----------------------------------------------------31
4.2.1 Timbre Conversion -----------------------------------------31
4.2.1.1 Gaussian Mixture Model ----------------------------------31
4.2.1.2 Gaussian Mixture Model with Differential Filter ---------32
4.2.1.3 Locally Linear Embedded Method --------------------------34
4.2.1.4 Locally Linear Embedded Method with Differential Filter -35
4.2.2 Expression Conversion -------------------------------------35
4.2.2.1 Feature Extraction on Expression ------------------------35
4.2.2.2 Dictionary-based Expression Conversion ------------------36
4.3 Results of Experiments and Discussion -----------------------38
4.3.1 Experiment Setup ------------------------------------------38
4.3.2 Objective Test --------------------------------------------39
4.3.3 Subjective Test -------------------------------------------40
4.4 Conclusion --------------------------------------------------41
5 Conclusion and Future Work ------------------------------------42
Reference -------------------------------------------------------44
Appendix --------------------------------------------------------51
1. T. Nose, M. Kanemoto, T. Koriyama, T. Kobayashi, “HMM-based expressive singing voice synthesis with singing style control and robust pitch modeling,” Comput. Speech Lang., vol. 34, no. 1, pp. 308-322, Nov. 2015.
2. E. Molina, I. Barbancho, A. M. Barbancho and L. J. Tardón, “Parametric model of spectral envelope to synthesize realistic intensity variations in singing voice,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014, pp. 634-638.
3. M. Blaauw, J. Bonada, V. Välimäki, “A neural parametric singing synthesizer modeling timbre and expression from natural songs,” Applied Sciences, 2017.
4. M. Umbert, J. Bonada, M. Goto, T. Nakano and J. Sundberg, “Expression control in singing voice synthesis: features, approaches, evaluation, and challenges,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 55-73, Nov. 2015.
5. P.-C. Li, L. Su, Y.-H. Yang and A. W. Y. Su, “Analysis of expressive musical terms in violin using score-informed and expression-based audio features,” In, Proc. Int. Society for Music Information Retrieval Conf., 2015, pp. 809-815.
6. B. Gingras, P. Y. Asselin, and S. McAdams, “Individuality in harpsichord performance: disentangling performer-and piece-specific influences on interpretive choices,” Frontiers in psychology, vol. 4, no. 895, 2013.
7. M. Bernays, and C. Traube, “Investigating pianists' individuality in the performance of five timbral nuances through patterns of articulation, touch, dynamics, and pedaling,” Frontiers in psychology, vol. 5, no. 157, 2014.
8. B. Gingras, T. Lagrandeur-Ponce, B. L. Giordano, and S. McAdams, “Perceiving musical individuality: performer identification is dependent on performer expertise and expressiveness, but not on listener expertise,” Perception, vol. 40, no. 10, pp. 1206-1220, 2011.
9. R. Koren and B. Gingras, “Perceiving individuality in harpsichord performance,” Frontiers in psychology, vol. 5, no. 141, 2014.
10. T. L. New and H. Li, “Exploring vibrato-motivated acoustic features for singer identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 2, pp. 519-530, 2007.
11. R. Ramirez, E. Maestre, A. Pertusa, E. Gomez and X. Serra, “Performance-based interpreter identification in saxophone audio recordings,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no.3 , pp. 356-364, 2007.
12. M. Schröder, “Emotional speech synthesis: a review,” Proc. Eurospeech, pp. 561-564, 2001.
13. Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia and R. A. Saurous, “Style tokens: unsupervised style modeling control and transfer in end-to-end speech synthesis,” In International Conference on Machine Learning, 2018.
14. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron,” In International Conference on Machine Learning, 2018.
15. J. Tao, Y. Kang, and A. Li, “Prosody conversion from neutral speech to emotional speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1145-1154, 2006.
16. H. Ming, D. Huang, L. Xie, S. Zhang, M. Dong and H. Li, “Exemplar-based sparse representation of timbre and prosody for voice conversion,” In IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5175-5179.
17. L. He and V. Dellwo, “Between-speaker variability in temporal organizations of intensity contours,” The Journal of the Acoustical Society of America, vol. 141, no. 5, pp. 488–494, 2017.
18. V. Dellwo, A. Leemann, and M.-J. Kolly, “Rhythmic variability between speakers: articulatory, prosodic, and linguistic factors,” The Journal of the Acoustical Society of America, vol. 137, no. 3, pp. 1513–1528, 2015.
19. T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: from features to supervectors,” Speech communication, vol. 52, no. 1, pp. 12-40, 2010.
20. S. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Communication, vol. 88, no. Supplement C, pp. 65-82, 2017.
21. C. E. Cancino-Chacón, M. Grachten, W. Goebl and G. Widmer, “Computational models of expressive music performance: a comprehensive and critical review,” Frontiers in Digital Humanities, vol. 5, no. 25, 2018.
22. K. Kosta, R. Ramírez, O. F. Bandtlow and E. Chew, “Mapping between dynamic markings and performed loudness: a machine learning approach,” Journal of Mathematics and Music, vol. 10, no. 2, pp. 149-172, 2016.
23. C. E. Cancino-Chacón, T. Gadermaier, G. Widmer and M. Grachten, “An evaluation of linear and non-linear models of expressive dynamics in classical piano and symphonic music,” Machine Learning, vol. 106, no.6, pp. 887-909, 2017.
24. M. Grachten and C. E. Cancino Chacón, “Temporal dependencies in the expressive timing of classical piano performances,” in The routledge companion to embodied music interaction, pp. 360-369, 2017.
25. B. Gingras, P. Y. Asselin and S. McAdams, “Individuality in harpsichord performance: disentangling performer and piece-specific influences on interpretive choices,” Frontiers in psychology, vol. 4, no. 895, 2013.
26. R. Ramirez, E. Maestre, A. Pertusa, E. Gomez and X. Serra, “Performance-based interpreter identification in saxophone audio recordings,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 3, pp. 356-364, 2007.
27. J. Devaney, “Inter-versus intra-singer similarity and variation in vocal performances,” Journal of New Music Research, vol. 45, no. 3, pp. 252-264, 2016.
28. M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99-D, no. 7, pp. 1877–1884, 2016.
29. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” In IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 1315–1318.
30. A. Leemann, M.-J. Kolly, and V. Dellwo, “Speaker-individuality in suprasegmental temporal features: implications for forensic voice comparison,” Forensic Science International, vol. 238, pp. 59–67, 2014.
31. A. G. Adami, R. Mihaescu, D. A. Reynolds and J. J. Godfrey, “Modeling prosodic dynamics for speaker recognition,” In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003, pp. IV-788.
32. T. Kako, Y. Ohishi, H. Kameoka, K. Kashino and K. Takeda, “Automatic identification for singing style based on sung melodic contour characterized in phase plane,” In ISMIR, 2009, pp. 393-398.
33. S. Schweinberger, et al., “Speaker perception,” Wiley Interdisciplinary Reviews: Cognitive Science, vol. 5, no. 1, pp. 15-25, 2014.
34. H. Kuwabara and Y. Sagisak, “Acoustic characteristics of speaker individuality: control and conversion,” Speech Communication, vol. 16, no. 2, pp. 165–173, 1995.
35. Y. Lavner, I. Gath, and J. Rosenhouse, “The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels,” Speech Communication, vol. 30, no. 1, pp. 9–26, 2000.
36. Z. Inanoglu and S. Young, “Data-driven emotion conversion in spoken English,” Speech Communication, vol. 51, no. 3, pp. 268-283, 2009.
37. F. Villavicencio and J. Bonada, “Applying voice conversion to concatenative singing-voice synthesis,” In Eleventh Annual Conference of the International Speech Communication Association, 2010.
38. H. Doi, T. Toda, T. Nakano, M. Goto and S. Nakamura, “Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system,” In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, 2012, pp. 1-6.
39. K. Kobayashi, et al., “Voice timbre control based on perceived age in singing voice conversion,” IEICE TRANSACTIONS on Information and Systems, vol. 97, no. 6, pp. 1419-1428, 2014.
40. K. Kobayashi, T. Toda and S. Nakamura, “Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential,” Speech Communication, vol. 99, pp. 211-220, 2018.
41. E. Nachmani and L. Wolf, “Unsupervised singing voice conversion,” arXiv preprint arXiv:1904.06590, 2019.
42. T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
43. Y. C. Wu, H. T. Hwang, C. C. Hsu, Y. Tsao and H. M. Wang, “Locally linear embedding for exemplar-based spectral conversion,” In Proc. INTERSPEECH, pp. 1652-1656, 2016.
44. L. Ardaillon, G. Degottex, A. Roebel, “A multi-layer f0 model for singing voice synthesis using a b-spline representation with intuitive controls,” In Proc. INTERSPEECH, 2015.
45. X. Chen, W. Chu, J. Guo and N. Xu, “Singing voice conversion with non-parallel data,” In 2019 IEEE Conference on Multimedia Information Processing and Retrieval, 2019, pp. 292-296.
46. H. Silen, E. Helander, J. Nurminen, and M. Gabbouj, “Ways to implement global variance in statistical speech synthesis,” In Proc. INTERSPEECH, pp. 1436–1439, 2012.
47. S. Imai, K. Sumita and C. Furuichi, “Mel log spectrum approximation (MLSA) filter for speech synthesis,” Electronics and Communications in Japan (Part I: Communications), vol. 66, no. 2, pp. 10-18, 1983.
48. Z. M. Smith, B. Delgutte, and A. J. Oxenham, “Chimaeric sounds reveal dichotomies in auditory perception,” Nature, vol. 416, no. 6876, pp. 87-70, 2002.
49. L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5-32, 2001.
50. R.A. Fisher, “The statistical utilization of multiple measurements,” Annals of Eugenics, vol. 8, pp. 376-386, 1938.
51. K. Fukunaga, “Introduction to statistical pattern recognition,” Academic Press, 2013.
52. S. W. Huck, W. H., Cormier, and W. G, Bounds, “Reading statistics and research,” New York: Harper & Row, 1974.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *