|
[1] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 4724–4733, 5 2017. [2] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recog- nition in Videos,” Advances in Neural Information Processing Systems, vol. 1, pp. 568– 576, 6 2014. [3] L. Ding and C. Xu, “TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation,” arXiv preprint arXiv:1705.07818v1, 2017. [4] Y. A. Farha and J. Gall, “MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584, 3 2019. [5] Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-Aware Cascade Networks for Temporal Action Segmentation,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12370 LNCS, pp. 34–51, 2020. [6] F. Yi, H. Wen, and T. Jiang, “ASFormer: Transformer for Action Segmentation,” Pro- ceedings of the British Machine Vision Conference, 10 2021. [7] Y. Ishikawa, S. Kasai, Y. Aoki, and H. Kataoka, “Alleviating Over-segmentation Errors by Detecting Action Boundaries,” Proceedings of the IEEE Winter Conference on Appli- cations of Computer Vision (WACV), pp. 2321–2330, 7 2020. [8] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving the Robustness of Deep Neural Networks via Stability Training,” Proceedings of the IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 4480–4488, 4 2016. [9] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” International Conference on Machine Learning, pp. 1597–1607, 2020. [10] H. Kuehne, J. Gall, and T. Serre, “An End-to-End Generative Framework for Video Seg- mentation and Recognition,” Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 9 2015. [11] K. Tang, L. Fei-Fei, and D. Koller, “Learning Latent Temporal Structure for Complex Event Detection,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1250–1257, 2012. [12] J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12- June, pp. 4694–4702, 2015. [13] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 677–691, 11 2014. [14] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv preprint arXiv:1609.03499, 2016. [15] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. Van Den Oord, A. Graves, K. Kavukcuoglu, G. Deepmind, and L. Uk, “Neural Machine Translation in Linear Time,” arXiv preprint arXiv:1610.10099, 10 2016. [16] P. Lei and S. Todorovic, “Temporal Deformable Residual Networks for Action Segmen- tation in Videos,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6742–6751, 2018. [17] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal Convolutional Networks for Action Segmentation and Detection,” Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 1003– 1012, 11 2016. [18] S. Li, Y. A. Farha, Y. Liu, M.-M. Cheng, and J. Gall, “MS-TCN++: Multi-Stage Tem- poral Convolutional Network for Action Segmentation,” Proceedings of the IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 3570–3579, 6 2020. [19] D. Wang, D. Hu, X. Li, and D. Dou, “Temporal Relational Modeling with Self-Supervision for Action Segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2729–2737, 12 2020. [20] J. Mitrovic, B. McWilliams, and M. Rey, “Less Can Be More in Contrastive Learning,” Proceedings on ”I Can’t Believe It’s Not Better!” at NeurIPS Workshops, vol. 137, pp. 70– 75, 12 Dec 2020. [21] T. Wang and P. Isola, “Understanding Contrastive Representation Learning through Align- ment and Uniformity on the Hypersphere,” Proceedings of the 37th International Confer- ence on Machine Learning (ICML), vol. PartF16814, pp. 9871–9881, 5 2020. [22] F. Wang and H. Liu, “Understanding the Behaviour of Contrastive Loss,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2495–2504, 12 2020. [23] J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive Learning with Hard Neg- ative Samples,” arXiv preprint arXiv:2010.04592, 10 2020. [24] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised Feature Learning via Non- parametric Instance Discrimination,” Proceedings of the IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018. [25] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9726–9735, 2020. [26] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human Action Video Dataset,” arXiv preprint arXiv:1705.06950, 5 2017. [27] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imper- ative Style, High-Performance Deep Learning Library,” Advances in Neural Information Processing Systems, vol. 32, 12 2019. [28] S. Stein and S. J. McKenna, “Combining Embedded Accelerometers with Computer Vi- sion for Recognizing Food Preparation Activities,” Proceedings of the 2013 ACM Inter- national Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738, 2013.
|