|
[1] Arnab, A., Sun, C., and Schmid, C. Unified graph structured models for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 8117–8126. [2] Carreira, J., and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 6299–6308. [3] Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., and Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 13359–13368. [4] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (2009), Ieee, pp. 248–255. [5] Feichtenhofer, C. X3d: Expanding architectures for efficient video recogni- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 203–213. [6] Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 6202–6211. [7] Feichtenhofer, C., Pinz, A., and Zisserman, A. Convolutional two-stream net- work fusion for video action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR) (2016). [8] Girdhar, R., Carreira, J., Doersch, C., and Zisserman, A. Video action trans- former network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 244–253. [9] Girshick, R. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (2015), pp. 1440–1448. [10] Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., and He, K. Detectron. URL: https://github. com/facebookresearch/detectron (2011). [11] Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijaya- narasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6047–6056. [12] Gupta, P., Thatipelli, A., Aggarwal, A., Maheshwari, S., Trivedi, N., Das, S., and Sarvadevabhatla, R. K. Quo vadis, skeleton action recognition? Interna- tional Journal of Computer Vision 129, 7 (2021), 2097–2112. [13] He,K.,Gkioxari,G.,Dollár,P.,andGirshick,R.Maskr-cnn.InProceedingsof the IEEE international conference on computer vision (2017), pp. 2961–2969.
[14] Jhuang, H., Gall, J., Zuffi, S., Schmid, C., and Black, M. J. Towards under- standing action recognition. In International Conf. on Computer Vision (ICCV) (Dec. 2013), pp. 3192–3199. [15] Kahatapitiya, K., and Ryoo, M. S. Coarse-fine networks for temporal activity detection in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 8385–8394. [16] Köpüklü, O., Wei, X., and Rigoll, G. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644 (2019). [17] Kumar, A., and Rawat, Y. S. End-to-end semi-supervised learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 14700–14710. [18] Li, C., Zhong, Q., Xie, D., and Pu, S. Collaborative spatiotemporal feature learning for video action recognition. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2019), pp. 7872–7881. [19] Li, Y., Chen, L., He, R., Wang, Z., Wu, G., and Wang, L. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 13536–13545. [20] Li, Y., Lin, W., Wang, T., See, J., Qian, R., Xu, N., Wang, L., and Xu, S. Finding action tubes with a sparse-to-dense framework. In Proceedings of the AAAI Conference on Artificial Intelligence (2020), vol. 34, pp. 11466–11473. [21] Li,Y.,Wang,Z.,Wang,L.,andWu,G.Actionsasmovingpoints.InEuropean Conference on Computer Vision (2020), Springer, pp. 68–84. [22] Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 2117–2125. [23] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on com- puter vision (2017), pp. 2980–2988. [24] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision (2014), Springer, pp. 740–755. [25] Liu,Y.,Yang,F.,andGinhac,D.Acdnet:Anactiondetectionnetworkforreal- time edge computing based on flow-guided feature approximation and memory aggregation. Pattern Recognition Letters 145 (2021), 118–126. [26] Liu, Z., Zhang, H., Chen, Z., Wang, Z., and Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 143–152.
[27] Ma, C.-Y., Kadav, A., Melvin, I., Kira, Z., AlRegib, G., and Graf, H. P. Attend and interact: Higher-order object interactions for video understanding. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6790–6800. [28] Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., and Darrell, T. Something-else: Compositional action recognition with spatial-temporal in- teraction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 1049–1059. [29] Mo,S.,Xia,J.,Tan,X.,andRaj,B.Point3d:trackingactionsasmovingpoints with 3d cnns. [30] Ni, J., Qin, J., and Huang, D. Identity-aware graph memory network for ac- tion detection. In Proceedings of the 29th ACM International Conference on Multimedia (2021), pp. 3437–3445. [31] Pan, J., Chen, S., Shou, M. Z., Liu, Y., Shao, J., and Li, H. Actor-context-actor relation network for spatio-temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 464–474. [32] Pramono, R. R. A., Chen, Y.-T., and Fang, W.-H. Hierarchical self-attention network for action localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 61–70. [33] Qiu, Z., Yao, T., and Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Con- ference on Computer Vision (2017), pp. 5533–5541. [34] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time ob- ject detection with region proposal networks. Advances in neural information processing systems 28 (2015), 91–99. [35] Seong, H., Hyun, J., and Kim, E. Video multitask transformer network. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019), pp. 0–0. [36] Singh, G., Saha, S., Sapienza, M., Torr, P. H., and Cuzzolin, F. Online real- time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3637– 3646. [37] Song, L., Zhang, S., Yu, G., and Sun, H. Tacnet: Transition-aware context network for spatio-temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 11987– 11995. [38] Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human ac- tions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[39] Su,R.,Ouyang,W.,Zhou,L.,andXu,D.Improvingactionlocalizationbypro- gressive cross-stream cooperation. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2019), pp. 12016–12025. [40] Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., and Schmid, C. Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 318–334. [41] Sun, L., Jia, K., Yeung, D.-Y., and Shi, B. E. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE international conference on computer vision (2015), pp. 4597–4605. [42] Tang, J., Xia, J., Mu, X., Pang, B., and Lu, C. Asynchronous interaction ag- gregation for action detection. In European Conference on Computer Vision (2020), Springer, pp. 71–87. [43] Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. Learning spa- tiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (2015), pp. 4489–4497. [44] Wang, X., Girshick, R., Gupta, A., and He, K. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion (2018), pp. 7794–7803. [45] Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., and Girshick, R. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 284–293. [46] Wu, C.-Y., and Krahenbuhl, P. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 1884–1894. [47] Wu, C.-Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., and Feichten- hofer, C. Memvit: Memory-augmented multiscale vision transformer for ef- ficient long-term video recognition. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2022), pp. 13587–13597. [48] Wu, J., Kuang, Z., Wang, L., Zhang, W., and Wu, G. Context-aware rcnn: A baseline for action detection in videos. In European Conference on Computer Vision (2020), Springer, pp. 440–456. [49] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual trans- formations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1492–1500. [50] Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 1, 2 (2017), 5.
[51] Yang,X.,Fan,H.,Torresani,L.,Davis,L.S.,andWang,H.Beyondshortclips: End-to-end video-level learning with collaborative memories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 7567–7576. [52] Zhao, J., Zhang, Y., Li, X., Chen, H., Shuai, B., Xu, M., Liu, C., Kundu, K., Xiong, Y., Modolo, D., et al. Tuber: Tubelet transformer for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 13598–13607. [53] Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., and Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 11656– 11665. [54] Zhou, B., Andonian, A., Oliva, A., and Torralba, A. Temporal relational rea- soning in videos. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 803–818. |