利用骨骼動畫及動作變換網路學習單眼三維人體姿態估測__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.198) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳子軼
作者(外文):	Chen, Zi-Yi
論文名稱(中文):	利用骨骼動畫及動作變換網路學習單眼三維人體姿態估測
論文名稱(外文):	Learning Monocular 3D Human Pose Estimation with Skeletal Animation and Motion Transformer
指導教授(中文):	賴尚宏
指導教授(外文):	Lai, Shang-Hong
口試委員(中文):	許秋婷徐繼聖陳祝嵩
口試委員(外文):	Hsu, Chiu-Ting Hsu, Gee-Sern Jison Chen, Chu-Song
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊系統與應用研究所
學號:	107065466
出版年(民國):	110
畢業學年度:	109
語文別:	中文
論文頁數:	34
中文關鍵詞:	計算機視覺、深度學習、三維人體姿態估計、數據增強、骨骼內插
外文關鍵詞:	computer vision、deep learning、3D human pose estimation、data augmentation、skeletal interpolation
相關次數:	推薦:0 點閱:229 評分: 下載:0 收藏:0

深度學習在單眼三維人體姿態估計方面取得了前所未有的準確性。然而，目前基於學習的3D人體姿勢估計仍然存在兩類問題，1）泛化能力差，2）模棱兩可的動作投影。當深度網絡遇到訓練集以外的姿勢時，由於有限的訓練數據和高度差異化的野外數據之間的差距，模型的性能很容易下降。受遊戲開發和動畫制作中流行的骨骼動畫技術的啟發，我們提出了一種簡單而有效的技術，即從現有的序列中合成新的三維人體姿勢序列來為模型訓練作數據增強，從而為訓練後的模型帶來強大的泛化能力。我們還提出了一種建立在變換網路編碼器基礎上的新的升維網絡，稱為動作變換網路，它利用強大的自我注意機制來實現從二維到三維的幾何映射，以解決投影模糊的問題。實驗結果表明在未見動作的領域下，通過使用我們在所提出的動作變換網路，配合提出的數據增強方法，我們在公開可用的數據集（如MPI-INF-3DHP）上實現了最先進的泛化精度，從而展現了卓越的三維人體姿勢估計精度。

Deep learning has achieved unprecedented accuracy for monocular 3D human pose estimation. However, current learning-based 3D human pose estimation still suffers two types of problems, 1) poor generalization, 2) projection ambiguity. When deep network encounter poses out of the training domain, model performance is prone to degrade due to the gap between limited training data and highly variant in-the-wild data. Inspired by skeletal animation, which is popular in game development and animation production, we propose a simple yet effective technique to synthesize new 3D human pose sequences from existing sequences as augmented data and thus bring strong generalization to the resulting model. We also put forward a new lifting network built upon transformer encoder, termed Motion Transformer, which utilizes the powerful self-attention mechanism to function the geometric mapping from 2D to 3D for solving pose ambiguity. Experimental results on the unseen domain demonstrate superior 3D human pose estimation accuracy by using our data augmentation method on the proposed Motion Transformer, where we achieve state-of-the-art generalization accuracy on publicly available datasets such as MPI-INF-3DHP.

1 Introduction 1
1.1 Problem Statement .......................... 2
1.2 Motivation............................... 3
1.3 Contributions ............................. 4
2 Related Work 5
2.1 Monocular 3D pose estimation.................... 5
2.2 Video based monocular 3D pose estimation ................... 5
2.3 Improve generalization for pose estimation................... 6
2.4 Transformer.............................. 6
3 Pose Augmentation 8
3.1 Local coordinate transformation ................... 9
3.2 Keyframe interpolation........................ 9
3.2.1 Interpolation processing ................... 10
3.2.2 Validation function...................... 10
3.3 Distribution of augmented posed ata ................. 11
4 Lifting Network 13
4.1 Preprocess the input and output data ................. 13
4.2 Architecture of Motion Transformer ................. 14
4.2.1 Transformer encoder ..................... 15
4.3 Multiscale training strategy ..................... 16
4.4 Loss function ............................. 17
4.5 Implementation details ........................ 17
5 Experimental Results 19
5.1 Evaluation under different scenarios................. 20
5.1.1 Intra dataset experiment ................... 20
5.1.2 Extra dataset experiment................... 21
5.2 Ablation study............................. 23
5.2.1 Ablation study on network architecture . . . . . . . . . . . 23 5.2.2 Ablation study on data augmentation . . . . . . . . . . . . 24
5.2.3 Ablation study on model type ................ 25
5.2.4 Ablation study on generated data amount . . . . . . . . . . 26
6 Discussion & Conclusion 28
6.1 Discussion............................... 28
6.2 Future work.............................. 29
6.3 Conclusion .............................. 29
References 30

1. [1] Akhter, I., and Black, M. J. Poseconditioned joint angle limits for 3d human pose reconstruction. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 1446–1455.
2. [2] Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. 2d human pose esti mation: New benchmark and state of the art analysis. 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014), 3686–3693.
3. [3] Ba,J.,Kiros,J.,andHinton,G.E.Layer normalization.ArXivabs/1607.06450 (2016).
4. [4] Carion,N.,Massa,F.,Synnaeve,G.,Usunier,N.,Kirillov,A.,andZagoruyko, S. Endtoend object detection with transformers. In European Conference on Computer Vision (2020), Springer, pp. 213–229.
5. [5] Catalin Ionescu, Fuxin Li, C. S. Latent structured models for human pose estimation. In International Conference on Computer Vision (2011).
6. [6] Chen,C.H.,andRamanan,D.3dhumanposeestimation=2dposeestimation + matching. 2017 IEEE Conference on Computer Vision and Pattern Recog nition (CVPR) (2017), 5759–5767.
7. [7] Chen, X., Lin, K.Y., Liu, W., Qian, C., Wang, X., and Lin, L. Weakly supervised discovery of geometryaware representation for 3d human pose estimation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 10887–10896.
8. [8] Chen,Y.,Wang,Z.,Peng,Y.,Zhang,Z.,Yu,G.,andSun,J.Cascadedpyramid network for multiperson pose estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7103–7112.
9. [9] Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. In NAACLHLT (2019).
10. [10] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv abs/2010.11929 (2020).
11. [11] Fang, H., Xie, S., Tai, Y.W., and Lu, C. Rmpe: Regional multiperson pose estimation. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2353–2362.
12. [12] Fang, H., Xu, Y., Wang, W., Liu, X., and Zhu, S. Learning knowledgeguided pose grammar machine for 3d human pose estimation. ArXiv abs/1710.06513 (2017).
13. [13] Habibie, I., Xu, W., Mehta, D., PonsMoll, G., and Theobalt, C. In the wild human pose estimation using explicit 2d features and intermediate 3d representations. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 10897–10906.
14. [14] Hendrycks, D., and Gimpel, K. Gaussian error linear units (gelus). arXiv: Learning (2016).
15. [15] Hossain, M. R. I., and Little, J. Exploiting temporal information for 3d human pose estimation. In ECCV (2018).
16. [16] Ionescu, C., Carreira, J., and Sminchisescu, C. Iterated secondorder label sensitive pooling for 3d human pose estimation. 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014), 1661–1668.
17. [17] Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural envi ronments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2014), 1325–1339.
18. [18] Iqbal, U., Milan, A., and Gall, J. Posetrack: Joint multiperson pose estima tion and tracking. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 4654–4663.
19. [19] Jiang, H. 3d human pose reconstruction using millions of exemplars. 2010 20th International Conference on Pattern Recognition (2010), 1674–1677.
20. [20] Johnson, S., and Everingham, M. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC (2010).
21. [21] Joo, H., Simon, T., Li, X., Liu, H., Tan, L., Gui, L., Banerjee, S., Godisart, T., Nabbe, B. C., Matthews, I., Kanade, T., Nobuhara, S., and Sheikh, Y. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019), 190– 204.
22. [22] Kanazawa, A., Black, M. J., Jacobs, D., and Malik, J. Endtoend recovery of human shape and pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7122–7131.
23. [23] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F., and Shah, M. Trans formers in vision: A survey. ArXiv abs/2101.01169 (2021).
24. [24] Lee, K., Lee, I., and Lee, S. Propagating lstm: 3d pose estimation based on joint interdependency. In ECCV (2018).
25. [25] Li, C., and Lee, G. H. Generating multiple hypotheses for 3d human pose estimation with mixture density network. 2019 IEEE/CVF Conference on Com puter Vision and Pattern Recognition (CVPR) (2019), 9879–9887.
26. [26] Li, S., and Chan, A. B. 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV (2014).
27. [27] Li,S.,Ke,L.,Pratama,K.,Tai,Y.W.,Tang,C.,andCheng,K.Cascadeddeep monocular 3d human pose estimation with evolutionary training data. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 6172–6182.
28. [28] Lin,T.Y.,Maire,M.,Belongie,S.J.,Hays,J.,Perona,P.,Ramanan,D.,Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV (2014).
29. [29] Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S., and Asari, V. Attention mechanism exploits temporal contexts: Realtime 3d human pose reconstruc tion. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni tion (CVPR) (2020), 5063–5072.
30. [30] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. ArXiv abs/1907.11692 (2019).
31. [31] Loshchilov, I., and Hutter, F. Decoupled weight decay regularization. In ICLR (2019).
32. [32] Martinez, J., Hossain, R., Romero, J., and Little, J. A simple yet effective baseline for 3d human pose estimation. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2659–2668.
33. [33] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., and Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. 2017 International Conference on 3D Vision (3DV) (2017), 506–516.
34. [34] Moon, G., Chang, J. Y., and Lee, K. M. Posefix: Modelagnostic general human pose refinement network. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 7765–7773.
35. [35] Papandreou, G., Zhu, T. L., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., and Murphy, K. Towards accurate multiperson pose estimation in the wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3711–3719.
36. [36] Pavlakos, G., Zhou, X., Derpanis, K., and Daniilidis, K. Coarsetofine volu metric prediction for singleimage 3d human pose. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 1263–1272.
37. [37] Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. 3d human pose estimation in video with temporal convolutions and semisupervised training. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 7745–7754.
38. [38] Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. 3d human pose estimation in video with temporal convolutions and semisupervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 7753–7762.
39. [39] Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., and Schiele, B. Articulated people detection and pose estimation: Reshaping the future. 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), 3178–3185.
40. [40] Rogez, G., and Schmid, C. Mocapguided data augmentation for 3d pose estimation in the wild. In NIPS (2016).
41. [41] Sigal, L., Balan, A., and Black, M. J. Humaneva: Synchronized video and moion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 87 (2009), 4–27.
42. [42] Sun, K., Xiao, B., Liu, D., and Wang, J. Deep highresolution representation learning for human pose estimation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 5686–5696.
43. [43] Tekin, B., MárquezNeila, P., Salzmann, M., and Fua, P. Learning to fuse 2d and 3d image cues for monocular body pose estimation. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 3961–3970.
44. [44] Tomè, D., Russell, C., and Agapito, L. Lifting from the deep: Convolutional 3d pose estimation from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5689–5698.
45. [45] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. ArXiv abs/1706.03762 (2017).
46. [46] Véges, M., Varga, V., and Lörincz, A. 3d human pose estimation with siamese equivariant embedding. Neurocomputing 339 (2019), 194–201.
47. [47] von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., and PonsMoll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV) (Sep 2018).
48. [48] Wang, L., Chen, Y., Guo, Z., Qian, K., Lin, M., Li, H., and Ren, J. Generaliz ing monocular 3d human pose estimation in the wild. 2019 IEEE/CVF Inter national Conference on Computer Vision Workshop (ICCVW) (2019), 4024– 4033.
49. [49] Wang, Z., Shin, D., and Fowlkes, C.C.Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation. In ECCV Work shops (2020).
50. [50] Xiao,B.,Wu,H.,andWei,Y.Simplebaselinesforhumanposeestimationand tracking. In ECCV (2018).
51. [51] Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., and Weng, X.3d human pose estimation in the wild by adversarial learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 5255–5264.
52. [52] Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., and Lin, S. Srnet: Improv ing generalization in 3d human pose estimation with a splitandrecombine approach. In ECCV (2020).
53. [53] Zhang, W., Zhu, M., and Derpanis, K. From actemes to action: A strongly supervised representation for detailed action understanding. 2013 IEEE Inter national Conference on Computer Vision (2013), 2248–2255.
54. [54] Zhao, L., Peng, X., Tian, Y., Kapadia, M., and Metaxas, D. N. Semantic graph convolutional networks for 3d human pose regression. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 3420– 3430.
55. [55] Zhou, X., Huang, Q., Sun, X., Xue, X., and Wei, Y. Towards 3d human pose estimation in the wild: A weaklysupervised approach. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 398–407.
56. [56] Zhou, X., Sun, X., Zhang, W., Liang, S., and Wei, Y. Deep kinematic pose regression. In ECCV Workshops (2016).
57. [57] Zhu, L., Rematas, K., Curless, B., Seitz, S., and KemelmacherShlizerman, I. Reconstructing nba players. In Proceedings of the European Conference on Computer Vision (ECCV) (August 2020).

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文