|
[1] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, 2013. [2] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, pages 4489–4497, 2015. [3] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014. [4] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016. [5] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016. [6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017. [7] L. Wang, W. Li, W. Li, L. V. Gool. Appearance-and-Relation Networks for Video Classification. In CVPR, 2018. [8] N. Crasto, P. Weinzaepfel, K. Alahari, C. Schmid. MARS: Motion-Augmented RGB Stream for Action Recognition. In CVPR, 2019. [9] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. [10] S. Tulyakov, M. Liu, X. Yang, J. Kautz. MoCoGAN: Decomposing Motion and Content for Video Generation. In CVPR, 2018. [11] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017. [12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014. [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. [14] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant face recognition. In CVPR, 2017. [15] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017. [16] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, W. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28. 2015. [17] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv:1610.02391, 2017. [18] K. Soomro, A. R. Zamir, M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011. [20] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. [22] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium, pages 214–223. Springer, 2007. [23] Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman. Convolutional Two-Stream Network Fusion for Video Action Recognition. In CVPR, 2016. [24] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. CoRR, abs/1604.04494, 2016. [25] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. |