帳號:guest(18.118.137.7)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):許廷睿
作者(外文):Hsu, Ting-Jui
論文名稱(中文):適用於多場景渲染之基於深度引導的人體神經輻射場模型
論文名稱(外文):Improving One Model for One Scene in HumanNeRF via Depth Guidance
指導教授(中文):林嘉文
指導教授(外文):Lin, Chia-Wen
口試委員(中文):胡敏君
林彥宇
劉育綸
口試委員(外文):Hu, Min-Chun
Lin, Yen-Yu
Liu, Yu-Lun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:通訊工程研究所
學號:110064524
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:48
中文關鍵詞:人體神經輻射場人體神經輻射場泛用性參數化人體模型深度圖點雲
外文關鍵詞:Human NeRFGeneralizable Human NeRFParameteric Human ModelDepth MapPoint Cloud
相關次數:
  • 推薦推薦:0
  • 點閱點閱:0
  • 評分評分:*****
  • 下載下載:3
  • 收藏收藏:0
人體神經輻射場的目標是從一段單一視角的影片中,還原出三維人體,並能合成沒看過的新視角。然而他們的難點為一個場景就需要一個模型,這會造成重新訓練上的高時間成本消耗,且難以泛用至多場景。在過去的方法中,必須輸入各視角的人體影像,然而當輸入缺少足夠的多視角資訊時就會失效。近年有方法提出點級特徵,然而會因為其中的顯性參數化人體模型在四肢末端細節表現不佳,而渲染出模糊和錯誤的動作。

為了解決上述問題,考量參數化人體模型使用三維頂點,但在末端細節不準確,我想到可以用點雲來輔助。點雲是有關三維資訊,因為深度圖可以提供良好的場景前後關係資訊,故我們想從預估出來的深度圖著手,再從深度生成更好的點雲。然而要從深度轉到人體的點雲也不容易,會受限於某一特定視角,而產生具有缺失訊息而不準確的三維點雲。因此,我們提出深度引導模組,從預測的深度得到比較準的點雲來引導動作,並且可以提升渲染結果。實驗顯示我們不但在新視角合成能達到超過目前最好方法的效果,也解決了人體輻射場一個場景就需要一個模型的難題。
HumanNeRFaimstoreconstruct 3D human from a monocular video and synthesize novel view that has not been seen through feeding other perspective cameraparameters. However, theirdifficultylies in the problem of one model for one scene, leading to high time costs for retraining and makes it difficult to generalize to multiple scenes. In previous methods, it was necessary to input multi-view human images. Yet, it would fail when input contains insufficient multi-view information. Recent methods have proposed point-level features. Nevertheless, due to the poor performance of the explicitly parametric human body model in the details of the limbs, blurry and erroneous movements are rendered.

To solve the above problems, considering the use of a parameterized human body model with three-dimensional vertices, but inaccurate in fine details at the extremities, I thought of using point clouds as a supplement. Taking into account that point clouds are related to three dimensional information, and depth maps can provide reasonable scene depth information, we aim to start from the estimated depth map and then generate better point clouds from the depth. However, it is not easy to convert from depth to point clouds of the human body. Since it will be limited to a specific viewpoint and result in three-dimensional point clouds with missing and inaccurate information. Therefore, we propose a depth-guided module to use the predicted depth map to predict accurate 3D point clouds to guide poses and improve rendering results. Experiments demonstrate that not only can we achieve better results in synthesizing novel view than the current best methods, but our method only requires training a single model to be used across multiple scenes and thus solves the problem of one model for one scene in HumanNeRF.
摘要i
Abstract ii
1 Introduction 1
1.1 ResearchBackground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 RelatedWork 7
2.1 Static-sceneGeneralizableNeRFApproach . . . . . . . . . . . . . . . . . . . 7
2.2 Dynamic-sceneGeneralizableNeRFApproach . . . . . . . . . . . . . . . . . 8
2.3 GeneralizableHumanNeRFApproach . . . . . . . . . . . . . . . . . . . . . . 8
3 ProposedMethod 11
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Depth-guidedModule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Pixel-alignedFeature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 WeightFunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 TrainingStage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.2 LossFunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.3 InferenceStage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 ExperimentsResult 22
4.1 DatabaseandBaseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 ExperimentSettings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 ComparisonMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 TrainingSetting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3 InferenceSetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4 EvaluationMetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 ExperimentsonQuantitativeQuality . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 ExperimentsonVisualQuality . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 AblationExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Conclusion 45
References 46
[1] S. Peng, J. Dong, Q. Wang, S. Zhang, Q. Shuai, X. Zhou, and H. Bao, “Animatable neural radiance fields for modeling dynamic human bodies,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14314–14323, 2021.
[2] S.Peng, Y.Zhang, Y.Xu,Q.Wang,Q.Shuai,H.Bao,andX.Zhou,“Neuralbody: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9054–9063, 2021.
[3] C.-Y. Weng, B. Curless, P. P. Srinivasan, J. T. Barron, and I. Kemelmacher-Shlizerman, “Humannerf: Free-viewpoint rendering of moving people from monocular video,” in Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp. 16210–16220, 2022.
[4] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model, inSeminalGraphicsPapers: PushingtheBoundaries, Volume 2, pp. 851–866, 2023.
[5] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10975–10985, 2019.
[6] B.Mildenhall, P.P.Srinivasan, M.Tancik, J.T.Barron, R.Ramamoorthi, andR.Ng,“Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[7] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin Brualla, “Nerfies: Deformable neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874, 2021.
[8] A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer, “D-nerf: Neural radiance fields for dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327, 2021.
[9] E. Tretschk, A. Tewari, V. Golyanik, M. Zollhöfer, C. Lassner, and C. Theobalt, “Non rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12959–12970, 2021.
[10] Y. Kwon, D. Kim, D.Ceylan, and H. Fuchs, “Neural human performer: Learning generalizable radiance fields for human performance rendering,” Advances in Neural Information Processing Systems, vol. 34, pp. 24741–24752, 2021.
[11] F. Zhao, W. Yang, J. Zhang, P. Lin, Y. Zhang, J. Yu, and L. Xu, “Humannerf: Efficiently generated humanradiancefieldfromsparseinputs,” inProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 7743–7753, 2022.
[12] X.Gao,J.Yang,J.Kim,S.Peng,Z.Liu, andX.Tong,“Mps-nerf: Generalizable 3d human rendering from multiview images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[13] S. Hu, F. Hong, L. Pan, H. Mei, L. Yang, and Z. Liu, “Sherf: Generalizable human nerf from a single image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9352–9364, 2023.
[14] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299, 2017.
[15] K.-E. Lin, Y.-C. Lin, W.-S. Lai, T.-Y. Lin, Y.-C. Shih, and R. Ramamoorthi, “Vision transformer for nerf-based view synthesis from a single input image,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 806–815, 2023.
[16] A.Nichol,H.Jun,P.Dhariwal,P.Mishkin,andM.Chen,“Point-e: Asystemforgenerating 3d point clouds from complex prompts,” arXiv preprint arXiv:2212.08751, 2022.
[17] A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587, 2021.
[18] C. Gao, A. Saraf, J. Kopf, and J.-B. Huang, “Dynamic view synthesis from dynamic monocular video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5712–5721, 2021.
[19] K.Park, U.Sinha, P.Hedman, J.T.Barron, S.Bouaziz, D.B.Goldman, R.Martin-Brualla, and S. M. Seitz, “Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,” arXiv preprint arXiv:2106.13228, 2021.
[20] W. Xian, J.-B. Huang, J. Kopf, and C. Kim, “Space-time neural irradiance fields for free viewpoint video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9421–9431, 2021.
[21] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4104–4113, 2016.
[22] Y.-L. Liu, C. Gao, A. Meuleman, H.-Y. Tseng, A. Saraf, C. Kim, Y.-Y. Chuang, J. Kopf, andJ.-B. Huang, “Robustdynamicradiancefields,” in Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 13–23, 2023.
[23] S. Wang, K. Schwarz, A. Geiger, and S. Tang, “Arah: Animatable volume rendering of articulated human sdfs,” in European conference on computer vision, pp. 1–19, Springer, 2022.
[24] S.-Y. Su, F. Yu, M. Zollhöfer, and H. Rhodin, “A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose,” Advances in Neural Information Processing Systems, vol. 34, pp. 12278–12291, 2021.
[25] M. Mihajlovic, A. Bansal, M. Zollhoefer, S. Tang, and S. Saito, “Keypointnerf: Generalizing image-based volumetric avatars using relative spatial encoding of keypoints,” in European conference on computer vision, pp. 179–197, Springer, 2022.
[26] “Renderpeople,” 2018. https://renderpeople.com/.
[27] Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, et al., “Humman: Multi-modal 4d human dataset for versatile sensing and modeling,” in European Conference on Computer Vision, pp. 557–577, Springer, 2022.
[28] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu, “Deephuman: 3d human reconstruction from a single image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7739–7749, 2019.
[29] E.R.Chan, C.Z.Lin, M.A.Chan,K.Nagano,B.Pan,S.DeMello,O.Gallo, L.J.Guibas, J. Tremblay, S. Khamis, et al., “Efficient geometry-aware 3d generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133, 2022.
[30] N. Max, “Optical models for direct volume rendering,” IEEE Transactions on Visualization and Computer Graphics, vol. 1, no. 2, pp. 99–108, 1995.
[31] A. Maćkiewicz and W. Ratajczak, “Principal components analysis (pca),” Computers & Geosciences, vol. 19, no. 3, pp. 303–342, 1993.
[32] R.Ranftl, K.Lasinger, D.Hafner, K.Schindler, andV.Koltun “Towardsrobustmonocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020.
[33] Z. Yu, W. Cheng, X. Liu, W. Wu, andK.-Y. Lin, “Monohuman: Animatable human neural field from monocular video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16943–16953, 2023.
[34] J. T. Kajiya and B. P. Von Herzen, “Ray tracing volume densities,” ACM SIGGRAPH computer graphics, vol. 18, no. 3, pp. 165–174, 1984.
[35] U. Sara, M. Akter, and M. S. Uddin, “Image quality assessment through fsim, ssim, mse and psnr—a comparative study,” Journal of Computer and Communications, vol. 7, no. 3, pp. 8–18, 2019.
[36] Z.Wang,A.C.Bovik,H.R.Sheikh,andE.P.Simoncelli “Imagequalityassessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[37] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
[38] A.Mittal, R.Soundararajan, andA.C.Bovik, “Makinga“completelyblind"imagequality analyzer,” IEEE Signal processing letters, vol. 20, no. 3, pp. 209–212, 2012.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *