解剖學感知預訓練模型應用於三維人體姿態估測__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.96) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	楊晴雯
作者(外文):	Yang, Qing-Wen
論文名稱(中文):	解剖學感知預訓練模型應用於三維人體姿態估測
論文名稱(外文):	APTPose: Anatomy-aware Pre-Training for 3D Human Pose Estimation
指導教授(中文):	賴尚宏
指導教授(外文):	Lai, Shang-Hong
口試委員(中文):	郭柏志鄭嘉珉林彥宇
口試委員(外文):	Kuo, Po-Chih Cheng, Chia-Ming Lin, Yen-Yu
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	109062702
出版年(民國):	112
畢業學年度:	111
語文別:	英文
論文頁數:	44
中文關鍵詞:	三維人體姿態估測、預訓練模型、計算機視覺
外文關鍵詞:	3D human pose estimation、pre-training model、Computer Vision
相關次數:	推薦:0 點閱:260 評分: 下載:0 收藏:0

在本篇論文中，我們介紹了一種名為 APTPose 的新穎解剖感知預訓練方法，應用於準確的三維人體姿態估測。我們的方法引入了分層的遮罩姿態建模（HMPM）。該子任務以弱監督方式在身體組件層級上運作，將身體骨架分解為獨立的人體組件，超越了先前基於關鍵點層級遮罩策略的限制。此外，與先前僅考慮2D姿態重建的方法不同，我們的方法在預訓練中結合了 2D 和 3D 的信息，通過引入初始 3D 估測並利用現有數據集中的大量 3D 偽標籤進行預訓練。這種全面性的方法使我們能夠在 3D 空間中更好地建模人體骨架結構，提高 3D 人體姿態估測的準確度和穩定度。此外，我們在監督框架內引入了幾何知識約束，以增強運動表徵，捕捉骨骼方向和長度的特徵。這種約束使我們模型的預測能夠更加的一致。實驗結果表明，我們提出的方法在具有挑戰性的 MPI-INF-3DHP 數據集上表現優異，大幅度的超越了最先進方法的性能。

In this thesis, we present a novel anatomy-aware pre-training method, named APTPose, for accurate 3D human pose estimation. Our approach introduces Hierarchical Masked Pose Modeling (HMPM) subtask, which operates at the body component level with weak supervision. It decomposes the body skeleton into distinct components, surpassing the limitations of previous keypoint-level masking strategies. Moreover, unlike prior methods focusing solely on 2D pose reconstruction, our method leverages both 2D and 3D information by incorporating an initial 3D estimation and utilizing a large number of 3D pseudo-labels from existing datasets for pre-training. This comprehensive approach allows us to effectively model the human skeleton structure in 3D space, improving both the accuracy and robustness of 3D human pose estimation. Furthermore, we propose a geometric knowledge constraint within the supervised framework to enhance kinematic representation, capturing bone orientation and length characteristics. This constraint improves the consistency of the predictions and yields more realistic pose estimates. Experimental results demonstrate that our proposed method excels on the challenging MPI-INF-3DHP dataset, outperforming the state-of-the-art approaches by a large margin.

1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 3D Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Scarcity of 3D pose annotation in 3DHPE . . . . . . . . . . . . . . 6
2.3 Pre-Training of Transformer . . . . . . . . . . . . . . . . . . . . . 7
3 Proposed Method 9
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Model flow . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Masking strategy for Pre-Training . . . . . . . . . . . . . . . . . . 14
3.2.1 Preliminary on Masked Pose Modeling (MPM) . . . . . . . 14
3.2.2 Hierarchical MPM (HMPM) . . . . . . . . . . . . . . . . . 15
3.3 Reprojection Module and Noising Pipeline . . . . . . . . . . . . . . 18
3.4 Geometric Knowledge Constraints . . . . . . . . . . . . . . . . . . 19
3.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Experiments 22
4.1 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . . . . 22
4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Comparison with State-of-the-art Methods . . . . . . . . . . . . . . 23
4.3.1 Testing on Human3.6M . . . . . . . . . . . . . . . . . . . 23
4.3.2 Testing on MPI-INF-3DHP . . . . . . . . . . . . . . . . . . 25
4.3.3 Analysis on computational complexity . . . . . . . . . . . . 25
4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.1 Impact of Individual Components . . . . . . . . . . . . . . 29
4.5.2 Impact of Pre-Training . . . . . . . . . . . . . . . . . . . . 30
4.5.3 Impact of Geometric Knowledge . . . . . . . . . . . . . . . 32
5 Conclusions 39
References 40

[1] Bao, H., Dong, L., Piao, S., and Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
[2] Chen, C.-H., Tyagi, A., Agrawal, A., Drover, D., Mv, R., Stojanov, S., and Rehg, J. M. Unsupervised 3d pose estimation with geometric self-supervision.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 5714–5724.
[3] Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., and Luo, J. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. arXiv preprint
arXiv:2002.10322 (2020).
[4] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., and Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 7103–7112.
[5] Chen, Z., Sugimoto, A., and Lai, S.-H. Learning monocular 3d human pose estimation with skeletal interpolation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), IEEE, pp. 4218–4222.
[6] Cheng, Y., Yang, B., Wang, B., Yan, W., and Tan, R. T. Occlusion-aware networks for 3d human pose estimation in video. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 723–732.
[7] Choi, H., Moon, G., and Lee, K. M. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16 (2020), Springer, pp. 769–787.
[8] Ci, H., Wang, C., Ma, X., and Wang, Y. Optimizing network structure for 3d human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 2262–2271.
[9] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10] Drover, D., MV, R., Chen, C.-H., Agrawal, A., Tyagi, A., and Phuoc Huynh, C. Can 3d pose be learned from 2d projections alone? In Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018), pp. 0–0.
[11] Fang, H.-S., Xie, S., Tai, Y.-W., and Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision (2017), pp. 2334–2343.
[12] Gholami, M., Rezaei, A., Rhodin, H., Ward, R., and Wang, Z. J. Selfsupervised 3d human pose estimation from video. Neurocomputing 488
(2022), 97–106.
[13] Gong, K., Zhang, J., and Feng, J. Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 8575–8584.
[14] Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., and Theobalt, C. In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 10905–10914.
[15] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 16000–16009.
[16] Hossain, M. R. I., and Little, J. J. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV) (2018), pp. 68–84.
[17] Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.
IEEE transactions on pattern analysis and machine intelligence 36, 7
(2013), 1325–1339.
[18] Iqbal, U., Molchanov, P., and Kautz, J. Weakly-supervised 3d human pose learning via multi-view images in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020).
[19] Joo, H., Neverova, N., and Vedaldi, A. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In 2021 International Conference on 3D Vision (3DV) (2021), IEEE, pp. 42–52.
[20] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[21] Kipf, T. N., and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[22] Kolotouros, N., Pavlakos, G., Black, M. J., and Daniilidis, K. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 2252–2261.
[23] Lee, K., Lee, I., and Lee, S. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European conference on computer vision (ECCV) (2018), pp. 119–135.
[24] Li, S., Ke, L., Pratama, K., Tai, Y.-W., Tang, C.-K., and Cheng, K.-T. Cascaded deep monocular 3d human pose estimation with evolutionary training data.
In The IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) (June 2020).
[25] Li, W., Liu, H., Ding, R., Liu, M., Wang, P., and Yang, W. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Transactions on Multimedia (2022).
[26] Li, W., Liu, H., Tang, H., Wang, P., and Van Gool, L. Mhformer: Multihypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 13147–13156.
[27] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (2014), Springer, pp. 740–755.
[28] Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.-c., and Asari, V. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (2020), pp. 5064–5073.
[29] Martinez, J., Hossain, R., Romero, J., and Little, J. J. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Oct 2017).
[30] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., and Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV) (2017), IEEE, pp. 506–516.
[31] Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training.
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 7753–7762.
[32] Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pre-training. Technical report (2018).
[33] Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., and Gao, W. P-stmo: Pretrained spatial temporal many-to-one model for 3d human pose estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V (2022), Springer, pp. 461–478.
[34] Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of
the IEEE/CVF international conference on computer vision (2019), pp. 7464–7473.
[35] Sun, X., Xiao, B., Wei, F., Liang, S., and Wei, Y. Integral human pose regression.
In Proceedings of the European conference on computer vision (ECCV)
(2018), pp. 529–545.
[36] Sun, Y., Wang, S., Feng, S., Ding, S., Pang, C., Shang, J., Liu, J., Chen, X., Zhao, Y., Lu, Y., et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137 (2021).
[37] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and
composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (2008), pp. 1096–1103.
[38] Wandt, B., and Rosenhahn, B. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 7782–7791.
[39] Wandt, B., Rudolph, M., Zell, P., Rhodin, H., and Rosenhahn, B. Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In Computer Vision and Pattern Recognition (CVPR) (June 2021).
[40] Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y.-G., Zhou, L., and Yuan, L. Bevt: Bert pretraining of video transformers. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2022), pp. 14733–14743.
[41] Yang, C.-Y., Luo, J., Xia, L., Sun, Y., Qiao, N., Zhang, K., Jiang, Z., Hwang, J.-N., and Kuo, C.-H. Camerapose: Weakly-supervised monocular 3d human pose estimation by leveraging in-the-wild 2d annotations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (January 2023), pp. 2924–2933.
[42] Zhang, J., Tu, Z., Yang, J., Chen, Y., and Yuan, J. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 13232–13242.
[43] Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., and Liu, Q. Ernie: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019).
[44] Zhao, L., Peng, X., Tian, Y., Kapadia, M., and Metaxas, D. N. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 3425–3435.
[45] Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., and Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 11656–11665.
[46] Zhou, X., Huang, Q., Sun, X., Xue, X., and Wei, Y. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE international conference on computer vision (2017), pp. 398–407.
[47] Zou, Z., and Tang, W. Modulated graph convolutional network for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 11477–11487

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文