帳號:guest(18.191.44.188)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):李 享
作者(外文):Lee, Shiang
論文名稱(中文):使用超長程關聯卷積來處理泛音與相位回復以解決歌聲分離問題
論文名稱(外文):Using a Long Range U-Net to Deal with Overtones and Phase Restoration in Singing Voice Separation Problems
指導教授(中文):蘇豐文
指導教授(外文):Soo, Von-Wun
口試委員(中文):沈之涯
邱瀞德
口試委員(外文):Shen, Chih-Ya
Chiu, Ching-Te
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:106062620
出版年(民國):109
畢業學年度:108
語文別:英文
論文頁數:34
中文關鍵詞:歌聲分離卷積神經網路相位回復
外文關鍵詞:singing voice separationconvolutional layersdeep learningphase reconstruction
相關次數:
  • 推薦推薦:0
  • 點閱點閱:532
  • 評分評分:*****
  • 下載下載:18
  • 收藏收藏:0
音訊分離如今是一個具有挑戰的研究方向,不僅吸引許多的研究者投入開發相關技術之外,甚至從2008年開始近乎每年一度的頻繁舉辦Signal SeparationEvaluation Campaign (SiSEC)比賽。觀察了2018年的比賽參加選手與大會的分析報告後我們發現,絕大多數的參賽者因為缺乏效果與效率兼具的資料壓縮技術,導致參賽者們皆會捨去訓練資料中較高頻率的資訊與頻譜圖相位的資訊。經這樣的觀察後,我們針對相對應的問題研發了一種新的深度學習模型—OvertoneNet(OveNet),其中利用了兩個新的技術:頻率1x1卷積層(F1x1 convolution layers)與複數的頻譜圖訓練法(complex-spectrogram channels),使得我們可以處理完整的44.1千赫的音訊(高解析度的音訊),也使得我們有能力利用音樂中頻繁存在的泛音關係加強訓練模型的效率與效果,這樣的優勢在其他模型中是無法被實現的。這次實驗結果顯示,我們在客觀與主觀的測試中之分離能力完全超越所有參賽SiSEC2018的對手,這樣的結果也證明了我們的方法效果顯著。
Audio source separation is a challenging topic that attracted various research teamsto attend the Signal Separation Evaluation Campaign (SiSEC) in 2018. Most top-rankedcompetitors based on deep learning methods ignored higher-frequency signals of the har-monic and the phase information due to the lack of efficient data compression method. Wepropose a new deep learning model named OvertoneNet (OveNet) that adopts two novelconcepts, frequency 1x1 convolution layers and complex-spectrogram channels, to handlethe 44.1k audio signals (Hi-Res audio signals) with a wide range of overtones. The re-sults of our experiment show that OveNet performs well in both objective and subjectiveevaluation on interference using limited training data from SiSEC2018.
摘要 i
Abstract ii
Acknowledgement iii
List of Tables vi
List of Figures vii

1 Introduction 1

2 Related Work 5

3 Methodology 8
3.1 The Phase Interference Problem 9
3.2 Utilize Phase information 10
3.3 Convolution with 1x1 kernel 12
3.4 Frequency 1x1 convolution layer 13
3.5 The Model Architecture 16
3.6 The Loss Functions 19
3.7 The Dataset 20

4 Experiments and Results 22
4.1 The Objective Score Metrics 22
4.2 Evaluation on Model Improvement 23
4.3 The Objective Evaluation 24
4.4 The Subjective Evaluation 26

5 Conclusion and Future Work 31

References 32
[1] Fabian-Robert St ̈oter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separationevaluation campaign.ArXiv, abs/1804.06267, 2018.
[2] Emmanuel Vincent, R ́emi Gribonval, and C ́edric F ́evotte. Performance measurementin blind audio source separation.IEEE Transactions on Audio, Speech, and LanguageProcessing, 14:1462–1469, 2006.
[3] Zafar Rafii, Antoine Liutkus, Fabian-Robert St ̈oter, Stylianos Ioannis Mimilakis, andRachel M. Bittner. Musdb18 - a corpus for music separation. 2017.
[4] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia G ́omez. A vocoder basedmethod for singing voice extraction.ICASSP 2019 - 2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), pages 990–994, 2019.
[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networksfor biomedical image segmentation.ArXiv, abs/1505.04597, 2015.
[6] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, AparnaKumar, and Tillman Weyde. Singing voice separation with deep u-net convolutionalnetworks. InISMIR, 2017.
[7] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. Mmdenselstm: An effi-cient combination of convolutional and recurrent neural networks for audio sourceseparation.2018 16th International Workshop on Acoustic Signal Enhancement(IWAENC), pages 106–110, 2018.
[8] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networksfor semantic segmentation. InCVPR, 2015.
[9] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. Road extraction by deep residualu-net.IEEE Geoscience and Remote Sensing Letters, 15:749–753, 2018.
[10] ̈Ozg ̈un iek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ron-neberger. 3d u-net: Learning dense volumetric segmentation from sparse annotation.InMICCAI, 2016.
[11] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, NaoyaTakahashi, and Yuki Mitsufuji. Improving music source separation based on deepneural networks through data augmentation and network blending.2017 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages261–265, 2017.
[12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Goingdeeper with convolutions.2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 1–9, 2014.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-ing for image recognition.2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2015.
[14] A ̈aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals,Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu.Wavenet: A generative model for raw audio. InSSW, 2016.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *