作者(外文):Lin , Yu-Yu
論文名稱(外文):Temporal-based Transformer Network for Video Recognition
指導教授(外文):Lin, Chia-Wen
口試委員(外文):Hsu, Chiu-Ting
Lin, Yen-Yu
外文關鍵詞:Video Action RecognitionVideo RepresentationTransformer
Recently, transformer network has been widely used in the field of NLP (Natural Language Processing). Take advantage of learning self-attention to get the contextual information in the sentence to complete a variety of tasks. This kind of framework has been applied to process in computer vision. The model learns the important attention information from images and videos so that it can complete the target task more accurately.
This paper will introduce a method of classifying the action by the attention mechanism in the entire video sequence that no longer uses the traditional way of video recognition such as the 3D convolution network. In our approach, the transformer network learns the temporal attention, and we can get the important part of the action that appears in the video to complete the task of video action recognition. The longer videos still face the problem to recognize the action due to the training time and the loss of information when propagating in the training network. Therefore, we emphasize using temporal attention and changing the attention mechanism of the traditional transformer network to reduce the computational complexity. In this way, it can process a long sequence, and highlight the key part of the video by the attention result. At the same time, the overall can achieve a better result because of the temporal attention mechanism.
Chapter 1 Introduction--------------------------2
Chapter 2 Related Work--------------------------4
2.1 ConvNet-based Video Action Recognition------4
2.2 Transformer---------------------------------4
2.3 Vision Transformer--------------------------5
2.4 Transformers in computer vision-------------5
2.5 Summary of related works--------------------6
Chapter 3 Proposed Method-----------------------7
3.1 Overview------------------------------------7
3.2 2D Spatial Feature Extraction---------------7
3.3 Temporal-based Transformer Encoder----------8
3.4 MLP Head to Classify------------------------9
Chapter 4 Experiments--------------------------10
4.1 Training and Testing dataset---------------10
4.2 Experiment Setup---------------------------11
4.3 Objective accuracy results-----------------11
Chapter 5 Conclusion---------------------------17
