帳號:guest(216.73.216.146)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):李沂珊
作者(外文):Lee, Yi-Shan
論文名稱(中文):基於跨模態提升匹配影片和音樂
論文名稱(外文):Video and Music Matching via Cross-Modality Lifting
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):陳奕廷
劉奕汶
口試委員(外文):Chen, Yi-Ting
Liu, Yi-Wen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:110061623
出版年(民國):112
畢業學年度:111
語文別:英文
論文頁數:29
中文關鍵詞:跨模態度量學習匹配系統短影片配樂深度學習
外文關鍵詞:Cross-ModalityMetric LearningMatching SystemShort Video Background MusicDeep Learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:370
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
我們提出了一個基於內容用來匹配影片和背影音樂的系統。該系統主要解決
針對新用戶或新音樂提供推薦音樂給短影片的挑戰。為此,我們提出了一個跨模
態框架,使找到影片和音樂表示之間的共享嵌入空間。為了確保嵌入空間能夠有
效的被兩種表示共享,我們利用基於邊界的餘弦相似度損失函數CosFace。此外,
為了確認音樂不是影片的原始聲音以及多個視頻與相同的音樂匹配,我們遵循該
規則並從知名多媒體平台收集影片和音樂,因為沒有數據及符合此條件。我們建
立了名為MSV的數據集,在該數據集中提供了390首音樂及相對應的150000 個
短影片。我們在Youtube-8M和我們的數據集上進行實驗,我們的定量與定性結果
證明了我們提出的誇架的有效性,並實現了最先進的影片和音樂匹配性能。
3
We propose a content-based system for matching video and background music. The system aims to address the challenges in music recommendation for new users or new music give short-form videos. To this end, we propose a cross-modality framework VMCML(Video and Music Matching via Cross-Modality Lifting) that finds a shared embedding space between video and music representations. To ensure the embedding space can
be effectively shared by both representations, we leverage CosFace loss based on margin-based cosine similarity loss. Furthermore, to confirm the music is not the original sound of the video and that more than one video is matched to the same music, we follow the rule and collect videos and music from a well-known multimedia platform because no
dataset matches this condition. We establish a dataset called MSV, which provides 390 individual music and the corresponding matched 150,000 videos. We conduct extensive experiments on Youtube-8M and our MSV datasets. Our quantitative and our qualitative results demonstrate the effectiveness of our proposed framework and achieve state-of-the-
art video and music matching performance.
摘要--------------------------------------3
Abstract---------------------------------5
Contents---------------------------------7
List of Figures--------------------------9
List of Tables---------------------------11
1 Introduction---------------------------1
2 Related Work---------------------------5
2.1 Cross-Modal Matching-----------------5
2.2 Metric learning----------------------6
3 Approach-------------------------------7
3.1 Problem Definition-------------------7
3.2 Cross-Modality Training Objectives---8
3.3 VMCML Framework----------------------10
4 Experiments----------------------------13
4.1 Dataset------------------------------13
4.2 Implementation Details---------------15
4.3 Experimental Results-----------------16
4.4 Ablation Study-----------------------19
5 Conclusion-----------------------------23
References-------------------------------25
[1] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cos-
face: Large margin cosine loss for deep face recognition,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018.
[2] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss
for deep face recognition,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 4690–4699, 2019.
[3] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and
S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,”
arXiv preprint arXiv:1609.08675, 2016.
[4] S. Hong, W. Im, and H. S. Yang, “Cbvmr: content-based video-music retrieval us-
ing soft intra-modal structure constraint,” in ACM Conference on Multimedia (MM),
pp. 353–361, 2018.
[5] D. Surís, A. Duarte, A. Salvador, J. Torres, and X. Giró-i Nieto, “Cross-modal em-
beddings for video and audio retrieval,” in European Conference on Computer Vision
Workshops (ECCV Workshops), pp. 0–0, 2018.
25
[6] J. Yi, Y. Zhu, J. Xie, and Z. Chen, “Cross-modal variational auto-encoder for content-
based micro-video background music recommendation,” IEEE Transactions on Mul-
timedia (TMM), 2021.
[7] M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,” in Proceedings.
1991 IEEE computer society conference on computer vision and pattern recognition,
pp. 586–587, IEEE Computer Society, 1991.
[8] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by
joint identification-verification,” Advances in neural information processing sys-
tems, vol. 27, 2014.
[9] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere
embedding for face recognition,” in IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pp. 212–220, 2017.
[10] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The megaface
benchmark: 1 million faces for recognition at scale,” in IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pp. 4873–4882, 2016.
[11] Y. Jafarian and H. S. Park, “Learning high fidelity depths of dressed humans by
watching social media dance videos,” in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 12753–12762, June 2021.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L.
Zitnick, “Microsoft coco: Common objects in context,” in European Conference on
Computer Vision (ECCV), pp. 740–755, Springer, 2014.
26
[13] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and
S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for
richer image-to-sentence models,” in ICCV, pp. 2641–2649, 2015.
[14] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning
events in videos,” in IEEE International Conference on Computer Vision (ICCV),
pp. 706–715, 2017.
[15] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for
bridging video and language,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 5288–5296, 2016.
[16] L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modal retrieval,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[17] J. Wei, X. Xu, Y. Yang, Y. Ji, Z. Wang, and H. T. Shen, “Universal weighting metric
learning for cross-modal matching,” in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 13005–13014, 2020.
[18] B. Li and A. Kumar, “Query by video: Cross-modal music retrieval.,” in ISMIR,
pp. 604–611, 2019.
[19] D. Zeng, Y. Yu, and K. Oyama, “Audio-visual embedding for cross-modal music
video retrieval through supervised deep cca,” in 2018 IEEE International Symposium
on Multimedia (ISM), pp. 143–150, IEEE, 2018.
[20] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively,
with application to face verification,” in 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546, IEEE,
2005.
27
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale
hierarchical image database,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2009.
[22] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu,
“Learning fine-grained image similarity with deep ranking,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 1386–1393, 2014.
[23] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Similarity-
Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copen-
hagen, Denmark, October 12-14, 2015. Proceedings 3, pp. 84–92, Springer, 2015.
[24] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach
for deep face recognition,” in Computer Vision–ECCV 2016: 14th European Con-
ference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII
14, pp. 499–515, Springer, 2016.
[25] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at
spatiotemporal convolutions for action recognition,” in IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2018.
[26] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Vi-
ola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,”
arXiv preprint arXiv:1705.06950, 2017.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
28
[28] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast
open-source audio feature extractor,” in ACM Conference on Multimedia (MM),
pp. 1459–1462, 2010.
[29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance
deep learning library,” Advances in neural information processing systems, vol. 32,
2019.
[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[31] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore,
M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for
audio events,” in 2017 IEEE international conference on acoustics, speech and sig-
nal processing (ICASSP), pp. 776–780, IEEE, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *