|
[1] Bao, H., Dong, L., Piao, S., and Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021). [2] Chen, C.-H., Tyagi, A., Agrawal, A., Drover, D., Mv, R., Stojanov, S., and Rehg, J. M. Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 5714–5724. [3] Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., and Luo, J. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. arXiv preprint arXiv:2002.10322 (2020). [4] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., and Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 7103–7112. [5] Chen, Z., Sugimoto, A., and Lai, S.-H. Learning monocular 3d human pose estimation with skeletal interpolation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), IEEE, pp. 4218–4222. [6] Cheng, Y., Yang, B., Wang, B., Yan, W., and Tan, R. T. Occlusion-aware networks for 3d human pose estimation in video. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 723–732. [7] Choi, H., Moon, G., and Lee, K. M. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16 (2020), Springer, pp. 769–787. [8] Ci, H., Wang, C., Ma, X., and Wang, Y. Optimizing network structure for 3d human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 2262–2271. [9] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [10] Drover, D., MV, R., Chen, C.-H., Agrawal, A., Tyagi, A., and Phuoc Huynh, C. Can 3d pose be learned from 2d projections alone? In Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018), pp. 0–0. [11] Fang, H.-S., Xie, S., Tai, Y.-W., and Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision (2017), pp. 2334–2343. [12] Gholami, M., Rezaei, A., Rhodin, H., Ward, R., and Wang, Z. J. Selfsupervised 3d human pose estimation from video. Neurocomputing 488 (2022), 97–106. [13] Gong, K., Zhang, J., and Feng, J. Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 8575–8584. [14] Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., and Theobalt, C. In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 10905–10914. [15] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 16000–16009. [16] Hossain, M. R. I., and Little, J. J. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (2018), pp. 68–84. [17] Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 7 (2013), 1325–1339. [18] Iqbal, U., Molchanov, P., and Kautz, J. Weakly-supervised 3d human pose learning via multi-view images in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020). [19] Joo, H., Neverova, N., and Vedaldi, A. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In 2021 International Conference on 3D Vision (3DV) (2021), IEEE, pp. 42–52. [20] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [21] Kipf, T. N., and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016). [22] Kolotouros, N., Pavlakos, G., Black, M. J., and Daniilidis, K. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 2252–2261. [23] Lee, K., Lee, I., and Lee, S. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European conference on computer vision (ECCV) (2018), pp. 119–135. [24] Li, S., Ke, L., Pratama, K., Tai, Y.-W., Tang, C.-K., and Cheng, K.-T. Cascaded deep monocular 3d human pose estimation with evolutionary training data. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020). [25] Li, W., Liu, H., Ding, R., Liu, M., Wang, P., and Yang, W. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Transactions on Multimedia (2022). [26] Li, W., Liu, H., Tang, H., Wang, P., and Van Gool, L. Mhformer: Multihypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 13147–13156. [27] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (2014), Springer, pp. 740–755. [28] Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.-c., and Asari, V. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 5064–5073. [29] Martinez, J., Hossain, R., Romero, J., and Little, J. J. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017). [30] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., and Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV) (2017), IEEE, pp. 506–516. [31] Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 7753–7762. [32] Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pre-training. Technical report (2018). [33] Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., and Gao, W. P-stmo: Pretrained spatial temporal many-to-one model for 3d human pose estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V (2022), Springer, pp. 461–478. [34] Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 7464–7473. [35] Sun, X., Xiao, B., Wei, F., Liang, S., and Wei, Y. Integral human pose regression. In Proceedings of the European conference on computer vision (ECCV) (2018), pp. 529–545. [36] Sun, Y., Wang, S., Feng, S., Ding, S., Pang, C., Shang, J., Liu, J., Chen, X., Zhao, Y., Lu, Y., et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137 (2021). [37] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (2008), pp. 1096–1103. [38] Wandt, B., and Rosenhahn, B. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 7782–7791. [39] Wandt, B., Rudolph, M., Zell, P., Rhodin, H., and Rosenhahn, B. Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In Computer Vision and Pattern Recognition (CVPR) (June 2021). [40] Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y.-G., Zhou, L., and Yuan, L. Bevt: Bert pretraining of video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 14733–14743. [41] Yang, C.-Y., Luo, J., Xia, L., Sun, Y., Qiao, N., Zhang, K., Jiang, Z., Hwang, J.-N., and Kuo, C.-H. Camerapose: Weakly-supervised monocular 3d human pose estimation by leveraging in-the-wild 2d annotations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (January 2023), pp. 2924–2933. [42] Zhang, J., Tu, Z., Yang, J., Chen, Y., and Yuan, J. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 13232–13242. [43] Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., and Liu, Q. Ernie: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019). [44] Zhao, L., Peng, X., Tian, Y., Kapadia, M., and Metaxas, D. N. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 3425–3435. [45] Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., and Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 11656–11665. [46] Zhou, X., Huang, Q., Sun, X., Xue, X., and Wei, Y. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE international conference on computer vision (2017), pp. 398–407. [47] Zou, Z., and Tang, W. Modulated graph convolutional network for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 11477–11487 |