作者(外文):Chen, Zi-Yi
論文名稱(外文):Learning Monocular 3D Human Pose Estimation with Skeletal Animation and Motion Transformer
指導教授(外文):Lai, Shang-Hong
口試委員(外文):Hsu, Chiu-Ting
Hsu, Gee-Sern Jison
Chen, Chu-Song
外文關鍵詞:computer visiondeep learning3D human pose estimationdata augmentationskeletal interpolation
Deep learning has achieved unprecedented accuracy for monocular 3D human pose estimation. However, current learning-based 3D human pose estimation still suffers two types of problems, 1) poor generalization, 2) projection ambiguity. When deep network encounter poses out of the training domain, model performance is prone to degrade due to the gap between limited training data and highly variant in-the-wild data. Inspired by skeletal animation, which is popular in game development and animation production, we propose a simple yet effective technique to synthesize new 3D human pose sequences from existing sequences as augmented data and thus bring strong generalization to the resulting model. We also put forward a new lifting network built upon transformer encoder, termed Motion Transformer, which utilizes the powerful self-attention mechanism to function the geometric mapping from 2D to 3D for solving pose ambiguity. Experimental results on the unseen domain demonstrate superior 3D human pose estimation accuracy by using our data augmentation method on the proposed Motion Transformer, where we achieve state-of-the-art generalization accuracy on publicly available datasets such as MPI-INF-3DHP.
1 Introduction 1
1.1 Problem Statement .......................... 2
1.2 Motivation............................... 3
1.3 Contributions ............................. 4
2 Related Work 5
2.1 Monocular 3D pose estimation.................... 5
2.2 Video­ based monocular 3D pose estimation ................... 5
2.3 Improve generalization for pose estimation................... 6
2.4 Transformer.............................. 6
3 Pose Augmentation 8
3.1 Local coordinate transformation ................... 9
3.2 Keyframe interpolation........................ 9
3.2.1 Interpolation processing ................... 10
3.2.2 Validation function...................... 10
3.3 Distribution of augmented posed ata ................. 11
4 Lifting Network 13
4.1 Preprocess the input and output data ................. 13
4.2 Architecture of Motion Transformer ................. 14
4.2.1 Transformer encoder ..................... 15
4.3 Multi­scale training strategy ..................... 16
4.4 Loss function ............................. 17
4.5 Implementation details ........................ 17
5 Experimental Results 19
5.1 Evaluation under different scenarios................. 20
5.1.1 Intra­ dataset experiment ................... 20
5.1.2 Extra­ dataset experiment................... 21
5.2 Ablation study............................. 23
5.2.1 Ablation study on network architecture . . . . . . . . . . . 23 5.2.2 Ablation study on data augmentation . . . . . . . . . . . . 24
5.2.3 Ablation study on model type ................ 25
5.2.4 Ablation study on generated data amount . . . . . . . . . . 26
6 Discussion & Conclusion 28
6.1 Discussion............................... 28
6.2 Future work.............................. 29
6.3 Conclusion .............................. 29
References 30
