帳號:guest(3.135.196.103)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):彭玉淮
作者(外文):Peng, Yu-Huai
論文名稱(中文):基於局部線性嵌入語音轉換的快速演算法與改良
論文名稱(外文):Fast Algorithm and Quality Improvement for Locally Linear Embedding based Voice Conversion
指導教授(中文):劉奕汶
王新民
指導教授(外文):Liu, Yi-Wen
Wang, Hsin-Min
口試委員(中文):賴穎暉
陳新
吳順吉
口試委員(外文):Lai, Ying-Hui
CHEN, HSIN
WU, SHUN-CHI
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:104061618
出版年(民國):107
畢業學年度:106
語文別:中文
論文頁數:46
中文關鍵詞:語音轉換局部線性嵌入演算法快速演算法音質改良
外文關鍵詞:Voice ConversionLocally Linear Embedding AlgorithmFast AlgorithmQuality Improvement
相關次數:
  • 推薦推薦:0
  • 點閱點閱:249
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在語音轉換的領域中,前人已經證明了局部線性嵌入的語音轉換擁有不錯的轉換音質、相似度與應用性。但它主要的問題是,在轉換階段的運算複雜度太高,導致它很難被使用在實時轉換。與此同時,我們認為轉換的音質依然有提升的可能。於是在這篇論文中,我們提出了若干提升音質的方法和一個快速的局部線性嵌入演算法,可以大幅的降低運算的時間複雜度。
在音質提升上,我們將局部線性嵌入演算法用於預測轉換頻譜損失的細節(或稱作餘值),並將預測的餘值用於補償轉換後的頻譜。實驗結果顯示,這個方法確實可以有效地提升音質。
而在加速演算法上,我們使用預先分群的字典來描述局部線性區域的資料流形,於是可以將轉換時大部分的運算工作事先在離線階段預做計算。實驗結果顯示,我們所提出的加速演算法與原來局部線性嵌入語音轉換的標準做法在轉換後音質是不相上下的。但運算複雜度大幅降低,可以實現實時轉換的系統,並且可以輕易地被應用在餘值補償與多對一轉換系統。
The locally linear embedding (LLE) algorithm has been proven to have high output quality and applicability for voice conversion (VC) tasks. However, the major shortcoming of the LLE-based VC approach is the high time complexity during the conversion phase leading to difficulty for real-time applications. At the same time, we also hope the speech quality could be even better. Therefore, we propose several methods to improve the speech quality and a fast version of the LLE algorithm that significantly reduces the complexity.
For quality improvement, we apply the LLE algorithm to predict the loss of spectral details (or called residuals) in the converted spectral envelopes (SEs), then the predicted residuals are used to compensate the converted SEs. Experimental results show that this method can notably improve the speech quality.
For the fast version of the LLE algorithm, we describe the data manifold of each locally linear patch by a pre-computed cluster of exemplars, and thus the major part of on-line computation can be carried out beforehand in the off-line phase. Experimental results demonstrate that the VC performance of the proposed fast LLE algorithm is comparable to that of the original LLE algorithm. Therefore, a real-time VC system becomes possible because of the highly reduced time complexity. In addition, the fast LLE algorithm can be readily applied to the residual compensation method and many-to-one VC system.
摘要…………………………………………………………………………………….i
Abstract……………………………………………………………………………..…ii
目錄…………………………………………………………………………………….I
圖目錄………………………………………………………………………………..III
表目錄……………………………………………………………………………….IV
一、緒論………………………………………………………………………………1
1.1 研究背景…………………………………………………………………….1
1.2 文獻回顧…………………………………………………………………….3
1.3 章節介紹…………………………………………………………………….6
二、基於局部線性嵌入的語音轉換(LLE-SC)系統…………………………….7
2.1 離線階段……………………………………………………………………7
2.2 上線階段……………………………………………………………………8
2.3 後處理(poster filtering) …………………………………………………….9
2.4 特徵值抽取(feature extraction)與合成………………………………….11
2.5 韻律(prosody)轉換………………………………………………………..14
2.6 多對一語音轉換…………………………………………………………..15
三、研究方法………………………………………………………………………..18
3.1 改善音質………………………………………………………………..…18
3.1.1 迭代更新字典………………………………………………………18
3.1.2 上下文特徵值(contextual feature) ………………………………...18
3.1.3 餘值補償(residual compensation) …………………………………19
3.2 快速演算法………………………………………………………………..22
3.2.1 演算法流程…………………………………………………………24
3.2.2 參數計算……………………………………………………………25
3.2.3 線上階段……………………………………………………………26
四、實驗結果與分析………………………………………………………………..28
4.1 實驗設計…………………………………………………………………..28
4.2 實驗結果:改善音質……………………………………………………..31
4.3 實驗結果:快速演算法…………………………………………………..38
五、結論與未來展望………………………………………………………………..41
六、參考資料………………………………………………………………………..42
[1] S. Takamichi, T. Toda, A. W. Black, and S. Nakamura, “Modulation spectrum-constrained trajectory training algorithm for GMM-based voice conversion,” Proc. ICASSP, pp. 4859–4863, 2015.
[2] H.-T. Hwang, Y. Tsao, H.-M. Wang, Y.-R. Wang, and S.-H. Chen, “Incorporating global variance in the training phase of GMM-based voice conversion,” Proc. APSIPA ASC, pp. 1–6, 2013.
[3] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Trans. Audio, Speech, Lang., Process., vol. 22, no. 12, pp. 1859–1872, 2014.
[4] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Dictionary update for NMF-based voice conversion using an encoder-decoder network,” Proc. ISCSLP, pp. 1–5, Oct. 2016.
[5] D. Erro, A. Moreno, and A. Bonafonte, “Voice conversion based on weighted frequency warping,” IEEE Trans. Audio, Speech, Lang., Process., vol. 18, no. 5, pp. 922–931, Jul. 2010.
[6] E. Godoy, O. Rosec, and T. Chonavel, “Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora,” IEEE Trans. Audio, Speech, Lang., Process., vol. 20, no. 4, pp. 1313–1323, May 2012.
[7] R. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based voice conversion in noisy environment,” Proc. SLT Workshop, pp. 313 – 317, 2012.
[8] Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplar-based sparse representation with residual compensation for voice conversion,” IEEE/ACM Trans. Audio, Speech, Lang., Process., vol. 22, no. 10, pp. 1506–1521, Oct. 2014.
[9] Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang, “Locally linear embedding for exemplar-based spectral conversion,” Proc. INTERSPEECH, pp. 1652–1656, 2016.
[10] Z. Wu, T. Virtanen, T. Kinnunen, E. S. Chng, and H. Li, “Exemplar based voice conversion using non-negative spectrogram deconvolution,” Proc. 8th ISCA Speech Synth. Workshop (SSW8), pp. 201–206, 2013.
[11] R. Aihara, T. Takiguchi, and Y. Ariki, “Many-to-one voice conversion using exemplar-based sparse representation,” in Proc. WASPAA, 2015, pp. 1–5.
[12] T. Toda, Y. Ohtani, and K. Shikano, “One-to-many and many-to-one voice conversion based on eigenvoices,” in Proc. ICASSP, 2007, pp. 1249–1252.
[13] R. Aihara, T. Takiguchi, and Y. Ariki., “Multiple Non-negative Matrix Factorization for Many-to-many Voice Conversion,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, 24(7), pp.1175–1184, 2016.
[14] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Many-to-many eigenvoice conversion with reference voice,” in Proc. Interspeech2009, pp. 1623–1626.
[15] D. Erro, A. Moreno, and A. Bonafonte, “INCA Algorithm for Training Voice Conversion Systems from Nonparallel Corpora,” IEEE Trans. on Audio, Speech, and Language Processing, 18(5), pp. 944–953, July 2010.
[16] P. Song, W. Zheng, and L. Zhao, “Non-parallel Training for Voice Conversion based on Adaptation Method,” in Proc. ICASSP2013.
[17] M. Dong, C. Yang, Y. Lu, J. W. Ehnes, D. Huang, H. Ming, R. Tong, S. W. Lee, and H. Li, “Mapping Frames with DNN-HMM Recognizer for Non-parallel Voice Conversion,” in Proc. APSIPA ASC 2015.
[18] C. C. Hsu, H. T. Hwang, Y. C. Wu, Y. Tsao and H. M. Wang, "Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder," in Proc. APSIPA ASC 2016.
[19] C. C. Hsu, H. T. Hwang, Y. C. Wu, Y. Tsao, and H. M. Wang, "Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks," in Proc. Interspeech2017.
[20] H. T. Hwang, Y. C. Wu, Y. H. Peng, C. C. Hsu, Y. Tsao, H. M. Wang, Y. R. Wang, and S. H. Chen, "Voice Conversion Based on Locally Linear Embedding," accepted to appear in Journal of Information Science and Engineering.
[21] T. Fujii, R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki, “Voice Conversion Based on Nonnegative Matrix Factorization in Noisy Environments,” in Proc. 2013 IEEE/SICE International Symposium on System Integration, pp. 495–498.
[22] R. Aihara, T. Takiguchi, and Y. Ariki, “Activity-mapping Non-negative Matrix Factorization for Exemplar-based Voice Conversion,” in Proc. ICASSP2015.
[23] L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik. “Dimensionality reduction: A comparative review.” Journal of Machine Learning Research 10.1–41 (2009): 66–71.
[24] J.B. Tenenbaum, V. De Silva, and J.C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
[25] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” Advances in neural information processing systems, vol. 14, pp. 585–591, 2001.
[26] S.T. Roweis and L.K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
[27] L.K. Saul and S.T. Roweis, “An introduction to locally linear embedding,” (2001) Available from https://www.cs.nyu.edu/~roweis/lle/papers/lleintro.pdf.
[28] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation from HMM using dynamic features,” Proc. ICASSP, pp. 660–663, 1995.
[29] T. Toda, A.W. Black, and K. Tokuda, “Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory,” IEEE Trans. Audio, Speech, and Language Processing, 15(8), pp. 2222-2235, November 2007.
[30] H. Sil´ en, E. Helander, J. Nurminen, and M. Gabbouj, “Ways to implement global variance in statistical speech synthesis,” Proc. INTERSPEECH, pp. 1436–1439, 2012.
[31] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign´ e, “Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., no. 3-4, pp. 187–207, 1999.
[32] H. Kawahara, J. Estill and O. Fujimura, “Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT,” MAVEBA, September 13-15, 2001
[33] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. ICASSP, 1992, pp. 137–140.
[34] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis — a unified approach to speech spectral estimation,” Proc. ICSLP, pp.1043–1046, Sep. 1994.
[35] Y. C. Wu, H. T. Hwang, S. S. Wang, C. C. Hsu, Y. H. Lai, Y. Tsao, and H. M. Wang, "A Locally Linear Embedding Based Postfiltering Approach for Speech Enhancement," in Proc. ICASSP 2017.
[36] Y. C. Wu, H. T. Hwang, S. S. Wang, C. C. Hsu, Y. Tsao, and H. M. Wang, "A Post-filtering Approach Based on Locally Linear Embedding Difference Compensation for Speech Enhancement," in Proc. Interspeech2017.
[37] Z. Zhang, J. Wang, “MLLE: Modified Locally Linear Embedding Using Multiple Weights,” Proceedings of the 19th International Conference on Neural Information Processing Systems, pp. 1593-1600, Dec. 2006
[38] C.-Y. Tseng, Y.-C. Cheng, and C.-H. Chang, “Sinica COSPRO and toolkit– corpora and platform of Mandarin Chinese fluent speech,” Proc. Oriental COCOSDA, pp. 23–28, 2005.
[39] T. Toda, L. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The voice conversion challenge 2016,” Proc. INTERSPEECH, 2016.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *