帳號:guest(18.191.238.60)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):王冠勛
作者(外文):Wang, Kuan-Hsun
論文名稱(中文):基於注意力之深度度量學習應用至幾近重複視頻檢索
論文名稱(外文):Attention Based Deep Metric Learning for Near-Duplicate Video Retrieval
指導教授(中文):賴尚宏
指導教授(外文):Lai, Shang-Hong
口試委員(中文):邱維辰
李哲榮
許秋婷
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:106062535
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:38
中文關鍵詞:注意力度量學習重複視頻檢索重複視頻偵測
外文關鍵詞:attentionmetric learningnear-duplicate video retrievalvideo copy detection
相關次數:
  • 推薦推薦:0
  • 點閱點閱:661
  • 評分評分:*****
  • 下載下載:32
  • 收藏收藏:0
隨著越來越多影片上傳到網路上, 近似重複視頻檢索 這個議題逐漸受到關注,許多使用深度網路的方法相繼被提出。在我們的方法中,我們使用 基於注意深度學習的架構來解決視頻檢索的問題,透過注意的機制,網路能找尋到較為重要的特徵並學習到一個映射函數,在這個空間中,相似的影片會有較近的距離使得他們更容易被檢索。除此之外,我們也使用兩個分支的架構同時使用 普通影像和光流作為特徵輸入,多個特徵向量也被使用來代表一部影片。實驗中顯示出注意網路能夠消除冗餘和嘈雜的片段進而專注於尋找兩部視頻相似的部分,多個特徵向量也能保存場景的資訊,我們的深度網路也測試在三個大規模的資料集上,分別為 VCDB、FIVR 和 CC_WEB_VIDEO。為了展現我們方法的能力,同時進行了資料內跟資料間的測試,也證明了我們的方法超越了其他先進的方法。
Near-duplicate video retrieval (NDVR) has recently caught great attention as more videos are uploaded to the Internet. Several methods are recently proposed, especially leveraging deep neural networks. In this paper, we propose an attention based deep metric learning method for NDVR. Through our attention mechanism, more important features are found to learn an embedding function. In this embedding space, near-duplicate videos have closer distance which makes them easy to be retrieved. Additionally, we use a two-stream architecture to take advantage of both RGB frame features and optical flow features. Multiple features are used to represent the video. Our experiments show that the attention network can eliminate redundant and noisy frames, while focusing on finding similar parts. Multiple features store more scene information from the input video.
We evaluate our approach on the recent large-scale NDVR datasets VCDB, FIVR, and CC_WEB_VIDEO. To demonstrate the generalization ability of our approach, we report results in both within- and cross-dataset settings, and show that the proposed method significantly outperforms the state-of-the-art approaches.
Contents
1 Introduction 1
1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.2 Problem Statement. . . . . . . . . . . . . . . . . . . . . . . . . .2
1.3 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.4 Thesis Organization. . . . . . . . . . . . . . . . . . . . . . . . . .4
2 Related Work 6
2.1 Local feature-based method. . . . . . . . . . . . . . . . . . . . . .6
2.2 Global feature-based method. . . . . . . . . . . . . . . . . . . . .7
2.3 CNN feature-based method. . . . . . . . . . . . . . . . . . . . . .8
3 Method 10
3.1 Frame-level Feature Extraction. . . . . . . . . . . . . . . . . . . .11
3.2 Attention Module. . . . . . . . . . . . . . . . . . . . . . . . . . .12
3.3 Global Video Feature Learning. . . . . . . . . . . . . . . . . . . .14
3.4 Multiple Features. . . . . . . . . . . . . . . . . . . . . . . . . . .15
3.5 Loss Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
3.6 Implementation Details. . . . . . . . . . . . . . . . . . . . . . . .18
4 Experiments 20
4.1 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
4.2 Evaluation Metrics. . . . . . . . . . . . . . . . . . . . . . . . . .21
4.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . .22
4.3.1 Cross-dataset evaluation. . . . . . . . . . . . . . . . . . .22
4.3.2 Within-dataset evaluation. . . . . . . . . . . . . . . . . .28
4.3.3 Frame-level evaluation . . . . . . . . . . . . . . . . . . . . 29
4.4 Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
4.5 Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
4.5.1 Attention. . . . . . . . . . . . . . . . . . . . . . . . . . .31
4.5.2 Video Feature Space. . . . . . . . . . . . . . . . . . . . .33
5 Conclusions 35
References 36
[1]Baraldi, L., Douze, M., Cucchiara, R., and Jégou, H. LAMV: Learning to alignand match videos with kernelized temporal layers. InProceedings of the IEEEConference on Computer Vision and Pattern Recognition(2018), pp. 7804–7813.[2]Cai, Y., Yang, L., Ping, W., Wang, F., Mei, T., Hua, X.-S., and Li, S. Million-scale near-duplicate video retrieval system. InACM Multimedia(2011),pp. 837–838.[3]Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: Alarge-scale hierarchical image database. In2009IEEEconferenceoncomputervision and pattern recognition(2009), Ieee, pp. 248–255.[4]Douze, M., Jégou, H., and Schmid, C. An image-based approach to video copydetection with spatio-temporal post-filtering.IEEE Transactions on Multime-dia 12, 4 (2010), 257–266.[5]Douze, M., Jégou, H., Schmid, C., and Pérez, P. Compact video descriptionfor copy detection with precise temporal alignment. InEuropean Conferenceon Computer Vision(2010), Springer, pp. 522–535.[6]Esmaeili, M. M., Fatourechi, M., and Ward, R. K. A robust and fast videocopy detection system using content-based fingerprinting.IEEE Transactionson information forensics and security 6, 1 (2011), 213–226.[7]Guzman-Zavaleta, Z. J., and Feregrino-Uribe, C. Partial-copy detection ofnon-simulated videos using learning at decision level.Multimedia Tools andApplications(2018), 1–20.[8]Heikkilä, M., Pietikäinen, M., and Schmid, C. Description of interest regionswith local binary patterns.Pattern recognition 42, 3 (2009), 425–436.[9]Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., and Brox, T.Flownet 2.0: Evolution of optical flow estimation with deep networks. InPro-ceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition(2017), pp. 2462–2470.[10]Jiang, Y.-G., Jiang, Y., and Wang, J. VCDB: a large-scale database for partialcopy detection in videos. InEuropean conference on computer vision(2014),Springer, pp. 357–371.[11]Jiang, Y.-G., and Wang, J. Partial copy detection in videos: A benchmark andan evaluation of popular methods.IEEE Trans. Big Data 2, 1 (2016), 32–42.[12]Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., and Kompatsiaris, I. FIVR:Fine-grained incident video retrieval.IEEE Transactions on Multimedia(2019), 1–1.[13]Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., and Kompatsiaris, Y. Near-duplicate video retrieval by aggregating intermediate CNN layers. InInterna-tional Conference on Multimedia Modeling(2017), Springer, pp. 251–263.35
[14]Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., and Kompatsiaris, Y. Near-duplicate video retrieval with deep metric learning. InComputer VisionWorkshop (ICCVW), 2017 IEEE International Conference on(2017), IEEE,pp. 347–356.[15]Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification withdeep convolutional neural networks. InAdvances in neural information pro-cessing systems(2012), pp. 1097–1105.[16]Long, X., Gan, C., de Melo, G., Wu, J., Liu, X., and Wen, S. Attention clusters:Purely attention based local feature integration for video classification. InPro-ceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition(2018), pp. 7834–7843.[17]Lowe, D. G. Distinctive image features from scale-invariant keypoints.Inter-national journal of computer vision 60, 2 (2004), 91–110.[18]Maaten, L. v. d., and Hinton, G. Visualizing data using t-SNE.Journal ofmachine learning research 9, Nov (2008), 2579–2605.[19]Schroff, F., Kalenichenko, D., and Philbin, J. FaceNet: A unified embeddingfor face recognition and clustering. InProceedings of the IEEE conference oncomputer vision and pattern recognition(2015), pp. 815–823.[20]Simonyan, K., and Zisserman, A. Two-stream convolutional networks for ac-tion recognition in videos. InAdvances in neural information processing sys-tems(2014), pp. 568–576.[21]Smeaton, A. F., Over, P., and Kraaij, W. Evaluation campaigns and trecvid.InProceedings of the 8th ACM international workshop on Multimedia infor-mation retrieval(2006), ACM, pp. 321–330.[22]Song, J., Yang, Y., Huang, Z., Shen, H. T., and Hong, R. Multiple featurehashing for real-time large scale near-duplicate video retrieval. InProceed-ings of the 19th ACM international conference on Multimedia(2011), ACM,pp. 423–432.[23]Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinkingthe inception architecture for computer vision. InProceedings of the IEEEConference on Computer Vision and Pattern Recognition(2016), pp. 2818–2826.[24]Tan, H.-K., Ngo, C.-W., Hong, R., and Chua, T.-S. Scalable detection of partialnear-duplicate videos by visual-temporal consistency. InProceedings of the17th ACM international conference on Multimedia(2009), ACM, pp. 145–154.[25]Uchida, Y., Takagi, K., and Sakazawa, S. Fast and accurate content-basedvideo copy detection using bag-of-global visual features. InAcoustics, Speechand Signal Processing (ICASSP), 2012 IEEE International Conference on(2012), IEEE, pp. 1029–1032.36
[26]Wang, L., Bao, Y., Li, H., Fan, X., and Luo, Z. Compact cnn based videorepresentation for efficient video copy detection. InInternational conferenceon multimedia modeling(2017), Springer, pp. 576–587.[27]Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L.Temporal segment networks: Towards good practices for deep action recogni-tion. InEuropean conference on computer vision(2016), Springer, pp. 20–36.[28]Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P. Sampling mat-ters in deep embedding learning. InProceedings of the IEEE InternationalConference on Computer Vision(2017), pp. 2840–2848.[29]Wu, X., Hauptmann, A. G., and Ngo, C.-W. Practical elimination of near-duplicates from web video search. InProceedings of the 15th ACM interna-tional conference on Multimedia(2007), ACM, pp. 218–227.[30]Zhang, Y., and Zhang, X. Effective real-scenario video copy detection. InPattern Recognition (ICPR), 2016 23rd International Conference on(2016),IEEE, pp. 3951–3956.[31]Zhu, Y., Huang, X., Huang, Q., and Tian, Q. Large-scale video copy retrievalwith temporal-concentration sift.Neurocomputing 187(2016), 83–91
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *