作者(外文):Chen, Shu-Ping
論文名稱(外文):Video Object Detection with Temporal Feature Fusion
指導教授(外文):LAI, SHANG-HONG
口試委員(外文):CHIU, CHING-TE
外文關鍵詞:object detectiondeep learning
Object detection is a classical problem in computer vision. It has achieved significant improvement in recent years thanks to deep learning techniques. However, it is challenging to extend the state-of-the-art static image object detection techniques to video object detection since traditional object detectors usually work on a single frame and do not utilize rich temporal information from video.

In this thesis, we propose a ConvNet architecture that can utilize temporal information and jointly train the whole model to perform video object detection. Our model uses optical flow to guide the feature fusion process between current frame and the previous frame. We also utilize dense recursive aggregation to integrate features computed from the past frames and make use of temporal information. Our experiments on ImageNet dataset and ITRI dataset show that the proposed architecture can achieve competitive detection result without significant time cost.
ჯ⥱ i
Abstract ii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 6
2.1 Object detection in static images . . . . . . . . . . . . . . . . . . . . . 6
2.2 Object detection in videos . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Method 10
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Optical Flow Network . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Weight Map from Optical Flow . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Recursive Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Experiments 19
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 ITRI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . 21
4.2.2 FlowNetS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.3 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 ImageNet dataset . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 ITRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.4 Comparison with the state-of-the-art . . . . . . . . . . . . . . . 26
4.3.5 Demo Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Conclusion 32
References 33
