帳號:guest(18.190.239.40)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳子軼
作者(外文):Chen, Zi-Yi
論文名稱(中文):利用骨骼動畫及動作變換網路學習單眼三維人體姿態估測
論文名稱(外文):Learning Monocular 3D Human Pose Estimation with Skeletal Animation and Motion Transformer
指導教授(中文):賴尚宏
指導教授(外文):Lai, Shang-Hong
口試委員(中文):許秋婷
徐繼聖
陳祝嵩
口試委員(外文):Hsu, Chiu-Ting
Hsu, Gee-Sern Jison
Chen, Chu-Song
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:107065466
出版年(民國):110
畢業學年度:109
語文別:中文
論文頁數:34
中文關鍵詞:計算機視覺深度學習三維人體姿態估計數據增強骨骼內插
外文關鍵詞:computer visiondeep learning3D human pose estimationdata augmentationskeletal interpolation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:216
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
深度學習在單眼三維人體姿態估計方面取得了前所未有的準確性。然而,目前基於學習的3D人體姿勢估計仍然存在兩類問題,1)泛化能力差,2)模棱兩可的動作投影。當深度網絡遇到訓練集以外的姿勢時,由於有限的訓練數據和高度差異化的野外數據之間的差距,模型的性能很容易下降。受遊戲開發和動畫制作中流行的骨骼動畫技術的啟發,我們提出了一種簡單而有效的技術,即從現有的序列中合成新的三維人體姿勢序列來為模型訓練作數據增強,從而為訓練後的模型帶來強大的泛化能力。我們還提出了一種建立在變換網路編碼器基礎上的新的升維網絡,稱為動作變換網路,它利用強大的自我注意機制來實現從二維到三維的幾何映射,以解決投影模糊的問題。實驗結果表明在未見動作的領域下,通過使用我們在所提出的動作變換網路,配合提出的數據增強方法,我們在公開可用的數據集(如MPI-INF-3DHP)上實現了最先進的泛化精度,從而展現了卓越的三維人體姿勢估計精度。
Deep learning has achieved unprecedented accuracy for monocular 3D human pose estimation. However, current learning-based 3D human pose estimation still suffers two types of problems, 1) poor generalization, 2) projection ambiguity. When deep network encounter poses out of the training domain, model performance is prone to degrade due to the gap between limited training data and highly variant in-the-wild data. Inspired by skeletal animation, which is popular in game development and animation production, we propose a simple yet effective technique to synthesize new 3D human pose sequences from existing sequences as augmented data and thus bring strong generalization to the resulting model. We also put forward a new lifting network built upon transformer encoder, termed Motion Transformer, which utilizes the powerful self-attention mechanism to function the geometric mapping from 2D to 3D for solving pose ambiguity. Experimental results on the unseen domain demonstrate superior 3D human pose estimation accuracy by using our data augmentation method on the proposed Motion Transformer, where we achieve state-of-the-art generalization accuracy on publicly available datasets such as MPI-INF-3DHP.
1 Introduction 1
1.1 Problem Statement .......................... 2
1.2 Motivation............................... 3
1.3 Contributions ............................. 4
2 Related Work 5
2.1 Monocular 3D pose estimation.................... 5
2.2 Video­ based monocular 3D pose estimation ................... 5
2.3 Improve generalization for pose estimation................... 6
2.4 Transformer.............................. 6
3 Pose Augmentation 8
3.1 Local coordinate transformation ................... 9
3.2 Keyframe interpolation........................ 9
3.2.1 Interpolation processing ................... 10
3.2.2 Validation function...................... 10
3.3 Distribution of augmented posed ata ................. 11
4 Lifting Network 13
4.1 Preprocess the input and output data ................. 13
4.2 Architecture of Motion Transformer ................. 14
4.2.1 Transformer encoder ..................... 15
4.3 Multi­scale training strategy ..................... 16
4.4 Loss function ............................. 17
4.5 Implementation details ........................ 17
5 Experimental Results 19
5.1 Evaluation under different scenarios................. 20
5.1.1 Intra­ dataset experiment ................... 20
5.1.2 Extra­ dataset experiment................... 21
5.2 Ablation study............................. 23
5.2.1 Ablation study on network architecture . . . . . . . . . . . 23 5.2.2 Ablation study on data augmentation . . . . . . . . . . . . 24
5.2.3 Ablation study on model type ................ 25
5.2.4 Ablation study on generated data amount . . . . . . . . . . 26
6 Discussion & Conclusion 28
6.1 Discussion............................... 28
6.2 Future work.............................. 29
6.3 Conclusion .............................. 29
References 30
1. [1] Akhter, I., and Black, M. J. Pose­conditioned joint angle limits for 3d human pose reconstruction. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 1446–1455.
2. [2] Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. 2d human pose esti­ mation: New benchmark and state of the art analysis. 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014), 3686–3693.
3. [3] Ba,J.,Kiros,J.,andHinton,G.E.Layer normalization.ArXivabs/1607.06450 (2016).
4. [4] Carion,N.,Massa,F.,Synnaeve,G.,Usunier,N.,Kirillov,A.,andZagoruyko, S. End­to­end object detection with transformers. In European Conference on Computer Vision (2020), Springer, pp. 213–229.
5. [5] Catalin Ionescu, Fuxin Li, C. S. Latent structured models for human pose estimation. In International Conference on Computer Vision (2011).
6. [6] Chen,C.­H.,andRamanan,D.3dhumanposeestimation=2dposeestimation + matching. 2017 IEEE Conference on Computer Vision and Pattern Recog­ nition (CVPR) (2017), 5759–5767.
7. [7] Chen, X., Lin, K.­Y., Liu, W., Qian, C., Wang, X., and Lin, L. Weakly­ supervised discovery of geometry­aware representation for 3d human pose estimation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 10887–10896.
8. [8] Chen,Y.,Wang,Z.,Peng,Y.,Zhang,Z.,Yu,G.,andSun,J.Cascadedpyramid network for multi­person pose estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7103–7112.
9. [9] Devlin, J., Chang, M.­W., Lee, K., and Toutanova, K. Bert: Pre­training of deep bidirectional transformers for language understanding. In NAACL­HLT (2019).
10. [10] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Un­ terthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv abs/2010.11929 (2020).
11. [11] Fang, H., Xie, S., Tai, Y.­W., and Lu, C. Rmpe: Regional multi­person pose estimation. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2353–2362.
12. [12] Fang, H., Xu, Y., Wang, W., Liu, X., and Zhu, S. Learning knowledge­guided pose grammar machine for 3d human pose estimation. ArXiv abs/1710.06513 (2017).
13. [13] Habibie, I., Xu, W., Mehta, D., Pons­Moll, G., and Theobalt, C. In the wild human pose estimation using explicit 2d features and intermediate 3d represen­tations. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 10897–10906.
14. [14] Hendrycks, D., and Gimpel, K. Gaussian error linear units (gelus). arXiv: Learning (2016).
15. [15] Hossain, M. R. I., and Little, J. Exploiting temporal information for 3d human pose estimation. In ECCV (2018).
16. [16] Ionescu, C., Carreira, J., and Sminchisescu, C. Iterated second­order label sensitive pooling for 3d human pose estimation. 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014), 1661–1668.
17. [17] Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural envi­ ronments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2014), 1325–1339.
18. [18] Iqbal, U., Milan, A., and Gall, J. Posetrack: Joint multi­person pose estima­ tion and tracking. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 4654–4663.
19. [19] Jiang, H. 3d human pose reconstruction using millions of exemplars. 2010 20th International Conference on Pattern Recognition (2010), 1674–1677.
20. [20] Johnson, S., and Everingham, M. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC (2010).
21. [21] Joo, H., Simon, T., Li, X., Liu, H., Tan, L., Gui, L., Banerjee, S., Godisart, T., Nabbe, B. C., Matthews, I., Kanade, T., Nobuhara, S., and Sheikh, Y. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019), 190– 204.
22. [22] Kanazawa, A., Black, M. J., Jacobs, D., and Malik, J. End­to­end recovery of human shape and pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7122–7131.
23. [23] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F., and Shah, M. Trans­ formers in vision: A survey. ArXiv abs/2101.01169 (2021).
24. [24] Lee, K., Lee, I., and Lee, S. Propagating lstm: 3d pose estimation based on joint interdependency. In ECCV (2018).
25. [25] Li, C., and Lee, G. H. Generating multiple hypotheses for 3d human pose estimation with mixture density network. 2019 IEEE/CVF Conference on Com­ puter Vision and Pattern Recognition (CVPR) (2019), 9879–9887.
26. [26] Li, S., and Chan, A. B. 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV (2014).
27. [27] Li,S.,Ke,L.,Pratama,K.,Tai,Y.­W.,Tang,C.,andCheng,K.Cascadeddeep monocular 3d human pose estimation with evolutionary training data. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 6172–6182.
28. [28] Lin,T.­Y.,Maire,M.,Belongie,S.J.,Hays,J.,Perona,P.,Ramanan,D.,Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV (2014).
29. [29] Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S., and Asari, V. Attention mechanism exploits temporal contexts: Real­time 3d human pose reconstruc­ tion. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni­ tion (CVPR) (2020), 5063–5072.
30. [30] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. ArXiv abs/1907.11692 (2019).
31. [31] Loshchilov, I., and Hutter, F. Decoupled weight decay regularization. In ICLR (2019).
32. [32] Martinez, J., Hossain, R., Romero, J., and Little, J. A simple yet effective baseline for 3d human pose estimation. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2659–2668.
33. [33] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., and Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. 2017 International Conference on 3D Vision (3DV) (2017), 506–516.
34. [34] Moon, G., Chang, J. Y., and Lee, K. M. Posefix: Model­agnostic general human pose refinement network. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 7765–7773.
35. [35] Papandreou, G., Zhu, T. L., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., and Murphy, K. Towards accurate multi­person pose estimation in the wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3711–3719.
36. [36] Pavlakos, G., Zhou, X., Derpanis, K., and Daniilidis, K. Coarse­to­fine volu­ metric prediction for single­image 3d human pose. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 1263–1272.
37. [37] Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. 3d human pose estimation in video with temporal convolutions and semi­supervised training. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 7745–7754.
38. [38] Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. 3d human pose estimation in video with temporal convolutions and semi­supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 7753–7762.
39. [39] Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., and Schiele, B. Artic­ulated people detection and pose estimation: Reshaping the future. 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), 3178–3185.
40. [40] Rogez, G., and Schmid, C. Mocap­guided data augmentation for 3d pose estimation in the wild. In NIPS (2016).
41. [41] Sigal, L., Balan, A., and Black, M. J. Humaneva: Synchronized video and mo­ion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 87 (2009), 4–27.
42. [42] Sun, K., Xiao, B., Liu, D., and Wang, J. Deep high­resolution representation learning for human pose estimation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 5686–5696.
43. [43] Tekin, B., Márquez­Neila, P., Salzmann, M., and Fua, P. Learning to fuse 2d and 3d image cues for monocular body pose estimation. 2017 IEEE Interna­tional Conference on Computer Vision (ICCV) (2017), 3961–3970.
44. [44] Tomè, D., Russell, C., and Agapito, L. Lifting from the deep: Convolutional 3d pose estimation from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5689–5698.
45. [45] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. ArXiv abs/1706.03762 (2017).
46. [46] Véges, M., Varga, V., and Lörincz, A. 3d human pose estimation with siamese equivariant embedding. Neurocomputing 339 (2019), 194–201.
47. [47] von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., and Pons­Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV) (Sep 2018).
48. [48] Wang, L., Chen, Y., Guo, Z., Qian, K., Lin, M., Li, H., and Ren, J. Generaliz­ ing monocular 3d human pose estimation in the wild. 2019 IEEE/CVF Inter­ national Conference on Computer Vision Workshop (ICCVW) (2019), 4024– 4033.
49. [49] Wang, Z., Shin, D., and Fowlkes, C.C.Predicting camera viewpoint improves cross-­dataset generalization for 3d human pose estimation. In ECCV Work­ shops (2020).
50. [50] Xiao,B.,Wu,H.,andWei,Y.Simplebaselinesforhumanposeestimationand tracking. In ECCV (2018).
51. [51] Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., and Weng, X.3d human pose estimation in the wild by adversarial learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 5255–5264.
52. [52] Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., and Lin, S. Srnet: Improv­ ing generalization in 3d human pose estimation with a split­and­recombine approach. In ECCV (2020).
53. [53] Zhang, W., Zhu, M., and Derpanis, K. From actemes to action: A strongly­ supervised representation for detailed action understanding. 2013 IEEE Inter­ national Conference on Computer Vision (2013), 2248–2255.
54. [54] Zhao, L., Peng, X., Tian, Y., Kapadia, M., and Metaxas, D. N. Semantic graph convolutional networks for 3d human pose regression. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 3420– 3430.
55. [55] Zhou, X., Huang, Q., Sun, X., Xue, X., and Wei, Y. Towards 3d human pose estimation in the wild: A weakly­supervised approach. 2017 IEEE Interna­tional Conference on Computer Vision (ICCV) (2017), 398–407.
56. [56] Zhou, X., Sun, X., Zhang, W., Liang, S., and Wei, Y. Deep kinematic pose regression. In ECCV Workshops (2016).
57. [57] Zhu, L., Rematas, K., Curless, B., Seitz, S., and Kemelmacher­Shlizerman, I. Reconstructing nba players. In Proceedings of the European Conference on Computer Vision (ECCV) (August 2020).
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *