帳號:guest(          離開系統
字體大小: 字級放大   字級縮小   預設字形  


作者(外文):Ho, Pin-Hsuan
論文名稱(外文):Decomposed Representation Learning of Motion and Appearance for Video Action Recognition
指導教授(外文):Hsu, Chiou-Ting
口試委員(外文):Peng, Wen-Hsiao
Wang, Sheng-Jyh
外文關鍵詞:Action RecognitionDecomposed Representation LearningMotion Representation
  • 推薦推薦:0
  • 點閱點閱:546
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
Spatiotemporal representation learning in videos is essential to action recognition in computer vision. We address the problem of video representation learning for video action recognition and understanding through learning: (1) static appearance in each frame, and (2) temporal motion across consecutive frames. In this thesis, we propose a Flow-based Motion and Appearance Network (FMA-Net), which includes a generator network, a classification network and a discriminator network, to learn a decomposed representation of motion and appearance in videos. Furthermore, in order to capture motion details, we propose to learn the model from optical flow prediction without flow computation at test time. The proposed FMA-Net is an end-to-end framework and can simultaneously learn the classification network and generate accurate optical flow in adversarial training. We perform experiments on two action recognition benchmarks: UCF101 and HMDB51. Under the same setting, our experimental results show that the proposed FMA-Net not only outperforms the baseline network but also achieves competitive results with state-of-the-art methods on these two datasets.
List of Contents
中文摘要 I
Abstract II
1. Introduction 1
2. Related Work 4
3. Proposed Method 6
3.1 Motivation 6
3.2 Baseline: Motion and Appearance Network (MA-Net) 6
3.3 Flow-based Motion and Appearance Network 8
3.4 Objective Function 10
4. Experiments 12
4.1 Datasets 12
4.2 Implementation Details 14
4.3 Results 15
5. Conclusion 21
References 22
[1] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, 2013.
[2] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, pages 4489–4497, 2015.
[3] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
[4] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
[5] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
[6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
[7] L. Wang, W. Li, W. Li, L. V. Gool. Appearance-and-Relation Networks for Video Classification. In CVPR, 2018.
[8] N. Crasto, P. Weinzaepfel, K. Alahari, C. Schmid. MARS: Motion-Augmented RGB Stream for Action Recognition. In CVPR, 2019.
[9] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[10] S. Tulyakov, M. Liu, X. Yang, J. Kautz. MoCoGAN: Decomposing Motion and Content for Video Generation. In CVPR, 2018.
[11] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
[14] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant face recognition. In CVPR, 2017.
[15] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
[16] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, W. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28. 2015.
[17] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv:1610.02391, 2017.
[18] K. Soomro, A. R. Zamir, M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011.
[20] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[22] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium, pages 214–223. Springer, 2007.
[23] Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman. Convolutional Two-Stream Network Fusion for Video Action Recognition. In CVPR, 2016.
[24] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. CoRR, abs/1604.04494, 2016.
[25] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
第一頁 上一頁 下一頁 最後一頁 top
* *