基於時域轉換器網路之物體動作識別__國立清華大學博碩士論文全文影像系統

帳號：guest(3.145.163.148) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	林俞攸
作者(外文):	Lin , Yu-Yu
論文名稱(中文):	基於時域轉換器網路之物體動作識別
論文名稱(外文):	Temporal-based Transformer Network for Video Recognition
指導教授(中文):	林嘉文
指導教授(外文):	Lin, Chia-Wen
口試委員(中文):	許秋婷林彥宇彭彥璁
口試委員(外文):	Hsu, Chiu-Ting Lin, Yen-Yu
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	107061536
出版年(民國):	110
畢業學年度:	110
語文別:	英文
論文頁數:	20
中文關鍵詞:	影像動作識別、影像表現、轉換器
外文關鍵詞:	Video Action Recognition、Video Representation、Transformer
相關次數:	推薦:0 點閱:397 評分: 下載:0 收藏:0

近年來，Transformer架構已經廣泛被使用在自然語言處理的領域，利用學習自注意力的方式，得到句子中上下文的關係，藉此來完成各式各樣的語言任務。這樣的架構也開始應用在圖片，甚至是影片上，讓模型學習在圖片或影片中重要的注意力資訊，使其能夠更準確地完成目標任務。
本文將要介紹一種不再使用傳統影像動識別的3D卷積網絡架構，而是透過關注整個影片序列來對影像中動作進行分類的方法。在我們的方法中，利用Transformer的特性，對於影片時間上進行注意力的學習，並且透過注意力的計算，得到更針對動作片段重要部份的時間點，近而協助完成影像動作識別的任務。希望可以藉由這樣的架構，讓影像更準確地進行分類。近年來，影像辨識的技術不斷提升，但在對於處理較長的影片並且進行動作辨識仍稍有難度。因此，在我們的方法中，特別強調使用針對時間上的注意力，並且改善傳統Transformer架構中的注意力機制，降低計算複雜度，使其能夠處理時間更長的影片，讓影片中的重點更能夠被突顯出來。同時，我們也利用注意力的特性，讓整體影片在動作分類的時候能夠有更傑出的表現。

Recently, transformer network has been widely used in the field of NLP (Natural Language Processing). Take advantage of learning self-attention to get the contextual information in the sentence to complete a variety of tasks. This kind of framework has been applied to process in computer vision. The model learns the important attention information from images and videos so that it can complete the target task more accurately.
This paper will introduce a method of classifying the action by the attention mechanism in the entire video sequence that no longer uses the traditional way of video recognition such as the 3D convolution network. In our approach, the transformer network learns the temporal attention, and we can get the important part of the action that appears in the video to complete the task of video action recognition. The longer videos still face the problem to recognize the action due to the training time and the loss of information when propagating in the training network. Therefore, we emphasize using temporal attention and changing the attention mechanism of the traditional transformer network to reduce the computational complexity. In this way, it can process a long sequence, and highlight the key part of the video by the attention result. At the same time, the overall can achieve a better result because of the temporal attention mechanism.

摘要--------------------------------------------i
Abstract---------------------------------------ii
Content-----------------------------------------1
Chapter 1 Introduction--------------------------2
Chapter 2 Related Work--------------------------4
2.1 ConvNet-based Video Action Recognition------4
2.2 Transformer---------------------------------4
2.3 Vision Transformer--------------------------5
2.4 Transformers in computer vision-------------5
2.5 Summary of related works--------------------6
Chapter 3 Proposed Method-----------------------7
3.1 Overview------------------------------------7
3.2 2D Spatial Feature Extraction---------------7
3.3 Temporal-based Transformer Encoder----------8
3.4 MLP Head to Classify------------------------9
Chapter 4 Experiments--------------------------10
4.1 Training and Testing dataset---------------10
4.2 Experiment Setup---------------------------11
4.3 Objective accuracy results-----------------11
Chapter 5 Conclusion---------------------------17
Reference--------------------------------------18

[1] R Christoph and Feichtenhofer Axel Pinz. Spatiotemporal residual networks for video action recognition. Advances in Neural Information Processing Systems, pages 3468–3476, 2016.
[2] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[4] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve J ´ egou. Training ´ data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
[7] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[8] Eberts, Markus AND Ulges, Adrian. Deep Convolutional Neural Networks for Pose Estimation in Image-Graphics Search. In INFORMATIK 2017, pages 907–914, 2017.
[9] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
[10] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
[11] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison W. Cottrell. Understanding Convolution for Semantic Segmentation, 2017.
[12] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
[13] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proc. CVPR, 2018.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. “BERT” Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018
[15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[16] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019.
[17] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
[18] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
[19] Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 535–544, 2021.
[20] Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 203–213, 2020.
[21] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pret
[22] Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Benjamin Rozenfeld,etc. Learning realistic human actions from movies. In CVPR, 2008.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文