|
[1] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cos- face: Large margin cosine loss for deep face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [2] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699, 2019. [3] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016. [4] S. Hong, W. Im, and H. S. Yang, “Cbvmr: content-based video-music retrieval us- ing soft intra-modal structure constraint,” in ACM Conference on Multimedia (MM), pp. 353–361, 2018. [5] D. Surís, A. Duarte, A. Salvador, J. Torres, and X. Giró-i Nieto, “Cross-modal em- beddings for video and audio retrieval,” in European Conference on Computer Vision Workshops (ECCV Workshops), pp. 0–0, 2018. 25 [6] J. Yi, Y. Zhu, J. Xie, and Z. Chen, “Cross-modal variational auto-encoder for content- based micro-video background music recommendation,” IEEE Transactions on Mul- timedia (TMM), 2021. [7] M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,” in Proceedings. 1991 IEEE computer society conference on computer vision and pattern recognition, pp. 586–587, IEEE Computer Society, 1991. [8] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” Advances in neural information processing sys- tems, vol. 27, 2014. [9] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 212–220, 2017. [10] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The megaface benchmark: 1 million faces for recognition at scale,” in IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 4873–4882, 2016. [11] Y. Jafarian and H. S. Park, “Learning high fidelity depths of dressed humans by watching social media dance videos,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12753–12762, June 2021. [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), pp. 740–755, Springer, 2014. 26 [13] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, pp. 2641–2649, 2015. [14] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning events in videos,” in IEEE International Conference on Computer Vision (ICCV), pp. 706–715, 2017. [15] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296, 2016. [16] L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modal retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [17] J. Wei, X. Xu, Y. Yang, Y. Ji, Z. Wang, and H. T. Shen, “Universal weighting metric learning for cross-modal matching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13005–13014, 2020. [18] B. Li and A. Kumar, “Query by video: Cross-modal music retrieval.,” in ISMIR, pp. 604–611, 2019. [19] D. Zeng, Y. Yu, and K. Oyama, “Audio-visual embedding for cross-modal music video retrieval through supervised deep cca,” in 2018 IEEE International Symposium on Multimedia (ISM), pp. 143–150, IEEE, 2018. [20] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546, IEEE, 2005. 27 [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. [22] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1386–1393, 2014. [23] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Similarity- Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copen- hagen, Denmark, October 12-14, 2015. Proceedings 3, pp. 84–92, Springer, 2015. [24] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in Computer Vision–ECCV 2016: 14th European Con- ference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pp. 499–515, Springer, 2016. [25] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2018. [26] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Vi- ola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 28 [28] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in ACM Conference on Multimedia (MM), pp. 1459–1462, 2010. [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019. [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [31] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and sig- nal processing (ICASSP), pp. 776–780, IEEE, 2017. |