帳號:guest(52.14.205.205)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):何品萱
作者(外文):Ho, Pin-Hsuan
論文名稱(中文):應用於影片中動作辨識之運動和外觀的分解表示法學習
論文名稱(外文):Decomposed Representation Learning of Motion and Appearance for Video Action Recognition
指導教授(中文):許秋婷
指導教授(外文):Hsu, Chiou-Ting
口試委員(中文):彭文孝
王聖智
口試委員(外文):Peng, Wen-Hsiao
Wang, Sheng-Jyh
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:106062553
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:24
中文關鍵詞:動作辨識分解表示法學習運動表示法
外文關鍵詞:Action RecognitionDecomposed Representation LearningMotion Representation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:546
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
影片中時空的表示法學習對於電腦視覺領域中的動作辨識研究是至關重要的。為了解決影片中動作識別和理解的表示法學習問題,我們通過學習:(1)每張影片中的靜態外觀,以及(2)連續影片中的時間上的運動資訊。在本篇論文中,我們提出了一種基於光流的運動和外觀網絡(FMA-Net),包括一個生成器網絡,一個分類網絡和一個鑑別器網絡,以學習影片中運動和外觀的分解表示法。此外,為了捕捉運動細節的資訊,我們建議從光流預測中學習,且無需在測試階段進行流量計算。我們提出的FMA-Net是一個端到端的架構,可以同時學習分類網絡,並在對抗訓練中生成準確的光流。我們在兩個動作識別的資料集上進行實驗:UCF101和HMDB51。在相同的設置下,我們的實驗結果表明,提出的FMA-Net不僅優於基線網絡,而且與其他採用最先進的方法相比我們方法獲得了有競爭力的結果。
Spatiotemporal representation learning in videos is essential to action recognition in computer vision. We address the problem of video representation learning for video action recognition and understanding through learning: (1) static appearance in each frame, and (2) temporal motion across consecutive frames. In this thesis, we propose a Flow-based Motion and Appearance Network (FMA-Net), which includes a generator network, a classification network and a discriminator network, to learn a decomposed representation of motion and appearance in videos. Furthermore, in order to capture motion details, we propose to learn the model from optical flow prediction without flow computation at test time. The proposed FMA-Net is an end-to-end framework and can simultaneously learn the classification network and generate accurate optical flow in adversarial training. We perform experiments on two action recognition benchmarks: UCF101 and HMDB51. Under the same setting, our experimental results show that the proposed FMA-Net not only outperforms the baseline network but also achieves competitive results with state-of-the-art methods on these two datasets.
List of Contents
中文摘要 I
Abstract II
1. Introduction 1
2. Related Work 4
3. Proposed Method 6
3.1 Motivation 6
3.2 Baseline: Motion and Appearance Network (MA-Net) 6
3.3 Flow-based Motion and Appearance Network 8
3.4 Objective Function 10
4. Experiments 12
4.1 Datasets 12
4.2 Implementation Details 14
4.3 Results 15
5. Conclusion 21
References 22
[1] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, 2013.
[2] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, pages 4489–4497, 2015.
[3] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
[4] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
[5] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
[6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
[7] L. Wang, W. Li, W. Li, L. V. Gool. Appearance-and-Relation Networks for Video Classification. In CVPR, 2018.
[8] N. Crasto, P. Weinzaepfel, K. Alahari, C. Schmid. MARS: Motion-Augmented RGB Stream for Action Recognition. In CVPR, 2019.
[9] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[10] S. Tulyakov, M. Liu, X. Yang, J. Kautz. MoCoGAN: Decomposing Motion and Content for Video Generation. In CVPR, 2018.
[11] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
[14] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant face recognition. In CVPR, 2017.
[15] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
[16] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, W. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28. 2015.
[17] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv:1610.02391, 2017.
[18] K. Soomro, A. R. Zamir, M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011.
[20] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[22] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium, pages 214–223. Springer, 2007.
[23] Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman. Convolutional Two-Stream Network Fusion for Video Action Recognition. In CVPR, 2016.
[24] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. CoRR, abs/1604.04494, 2016.
[25] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *