帳號:guest(52.15.200.167)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):蕭子謦
作者(外文):Hsiao, Tsu-Ching
論文名稱(中文):基於特殊歐幾里得三維群之分數擴散模型解決六維物體姿態估計中的模糊性
論文名稱(外文):Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
指導教授(中文):李濬屹
指導教授(外文):Lee, Chun-Yi
口試委員(中文):陳煥宗
劉育綸
口試委員(外文):Chen, Hwann-Tzong
Liu, Yu-Lun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:107062560
出版年(民國):112
畢業學年度:111
語文別:英文
論文頁數:45
中文關鍵詞:電腦視覺物體姿態估計擴散模型李群
外文關鍵詞:Computer VisionObject Pose EstimationDiffusion ModelLie Group
相關次數:
  • 推薦推薦:0
  • 點閱點閱:81
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
從單一RGB圖像中,解決由物體的對稱性或遮擋所造成的姿態模糊性,並精準預測物體的6D姿態是一個重大的挑戰。為了應對這一挑戰,我們提出了一種新穎的基於特殊歐幾里得三維群 SE(3) 的分數擴散模型。這是首次將基於 SE(3) 的分數擴散模型應用到圖像領域中解決姿態估計問題。實驗數據顯示,該方法在處理姿態模糊性與減輕影像透視引起的模糊性上能達到卓越的效果,同時也展示我們對 SE(3) 提出的替代斯坦分數公式具有很好的穩健性。這種公式不僅提高了朗之萬動力學方程式在 SE(3) 上的收歛性,也增強了斯坦分數的計算效率。因此,我們開發出一種有潛力的6D物體姿態估計方法。
Addressing accuracy limitations and pose ambiguity in 6D object pose estimation from single RGB images presents a significant challenge, particularly due to object symmetries or occlusions. In response, we introduce a novel score-based diffusion method applied to the SE(3) group, marking the first application of diffusion models to SE(3) within the image domain, specifically tailored for pose estimation tasks. Extensive evaluations demonstrate the method's efficacy in handling pose ambiguity, mitigating perspective-induced ambiguity, and showcasing the robustness of our surrogate Stein score formulation on SE(3). This formulation not only improves the convergence of Langevin dynamics but also enhances computational efficiency. Thus, we pioneer a promising strategy for 6D object pose estimation.
摘要 i
Abstract ii
1 Introduction 1
2 Related Work 5
2.1 Methodologies for Dealing with Pose Ambiguity Issues . . . . . . . . . . . . . 5
2.2 Previous Diffusion Probabilistic Models and Their Application Domains . . . . 6
3 Background 7
3.1 Lie Groups and Their Application in Pose Estimation . . . . . . . . . . . . . . 7
3.2 Parametrization of SE(3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Score­Based Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Preliminaries 11
4.1 Comparative Analysis of Diffusion Models on Lie Groups . . . . . . . . . . . 11
4.2 The Benefits of SE(3) over R3SO(3) in Perspective­Affected Pose Estimation 12
5 Methodology 13
5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Efficient Computation of Stein Score . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Surrogate Stein Score Calculation on SE(3) . . . . . . . . . . . . . . . . . . . 15
5.4 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Experimental Results 19
6.1 Experimental Hypotheses and Validation Objectives . . . . . . . . . . . . . . . 19
6.2 Datasets and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 Quantitative Results on SYMSOL . . . . . . . . . . . . . . . . . . . . . . . . 21
6.4 Quantitative Results on SYMSOL­T . . . . . . . . . . . . . . . . . . . . . . . 22
6.5 Analysis of SE(3) and R3SO(3) in the Presence of Image Perspective Ambiguity 23
6.6 Performance Analysis: Surrogate Score versus Automatically Differentiated
True Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.7 Comparison with Other Diffusion Models . . . . . . . . . . . . . . . . . . . . 25
7 Limitations and Future Directions 27
8 Conclusion 29
iii
9 Appendix 31
9.1 Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.1.1 Calculation of Stein Scores Using Automatic Differentiation in JAX . . 31
9.1.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
9.1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.1.4 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.1.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.1.6 Visualization of SYMSOL­T Results . . . . . . . . . . . . . . . . . . 34
9.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.2.1 Closed­Form of Stein Scores . . . . . . . . . . . . . . . . . . . . . . . 35
9.2.2 Left and Right Jacobians on SO(3) . . . . . . . . . . . . . . . . . . . 36
9.2.3 Eigenvector of The Jacobians . . . . . . . . . . . . . . . . . . . . . . 36
9.2.4 Closed­Form of Stein Scores on SE(3) . . . . . . . . . . . . . . . . . 37
References 41
[1] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari, “Explaining the ambiguity of object detection and 6d pose from visual data,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 6840–6849, 2019.
[2] T. Hodan, D. Baráth, and J. Matas, “EPOS: estimating 6d pose of objects with symmetries,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 11700–11709, 2020.
[3] H. Deng, M. Bui, N. Navab, L. Guibas, S. Ilic, and T. Birdal, “Deep bingham networks: Dealing with uncertainty and ambiguity in pose estimation,” 2020.
[4] K. A. Murphy, C. Esteves, V. Jampani, S. Ramalingam, and A. Makadia, “Implicit­pdf: Non­parametric representation of probability distributionson the rotation manifold,” in Proc. Int. Conf. on Machine Learning (ICML), vol. 139, pp. 7882–7893, 2021.
[5] T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis, “T­less: An rgbd dataset for 6d pose estimation of texture­less objects,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 880–888, IEEE, 2017.
[6] K. Park, T. Patten, and M. Vincze, “Pix2pose: Pixel­wise coordinate regression of objects for 6d pose estimation,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 7667– 7676, 2019.
[7] G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdr­net: Geometry­guided direct regression network for monocular 6d object pose estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 16611–16621, 2021.
[8] S. Thalhammer, T. Patten, and M. Vincze, “COPE: end­to­end trainable constant runtime object pose estimation,” in Proc. IEEE Winter Conf. on Applications of Computer Vision (WACV), pp. 2859–2869, 2023.
[9] T. Höfer, B. Kiefer, M. Messmer, and A. Zell, “Hyperposepdf­hypernetworks predicting the probability distribution on so (3),” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2369–2379, 2023.
[10] R. L. Haugaard, F. Hagelskjær, and T. M. Iversen, “Spyropose: Importance sampling pyramids for object pose distribution estimation in SE(3),” CoRR, vol. abs/2303.05308, 2023.
[11] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Proc. Conf. on Neural Information Processing Systems (NeurIPS), pp. 11895– 11907, 2019. 41
[12] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2020.
[13] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proc. Int. Conf. on Learning Representations (ICLR), 2021.
[14] Y. Song, J. Sohl­Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Scorebased generative modeling through stochastic differential equations,” in Proc. Int. Conf. on Learning Representations (ICLR), 2021.
[15] A. Leach, S. M. Schmon, M. T. Degiacomi, and C. G. Willcocks, “Denoising diffusion probabilistic models on so(3) for rotational alignment,” in Proc. Int. Conf. on Learning Representations Workshop (ICLRW), 2022.
[16] Y. Jagvaral, F. Lanusse, and R. Mandelbaum, “Diffusion generative models on so(3).” https://openreview.net/pdf?id=jHA-yCyBGb, 2023.
[17] J. Urain, N. Funk, J. Peters, and G. Chalvatzaki, “Se(3)­diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,” CoRR, vol. abs/2209.03855, 2022.
[18] J. Yim, B. L. Trippe, V. D. Bortoli, E. Mathieu, A. Doucet, R. Barzilay, and T. S. Jaakkola, “SE(3) diffusion model with application to protein backbone generation,” CoRR, vol. abs/2302.02277, 2023.
[19] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” in Robotics: Science and Systems XIV, 2018.
[20] A. Amini, A. S. Periyasamy, and S. Behnke, “Yolopose: Transformer­based multi­object 6d pose estimation using keypoint regression,” in Intelligent Autonomous Systems (IAS), vol. 577, pp. 392–406, 2022.
[21] Y. Labbé, J. Carpentier, M. Aubry, and J. Sivic, “Cosypose: Consistent multi­view multiobject 6d pose estimation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pp. 574–591, Springer, 2020.
[22] Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “So­pose: Exploiting selfocclusion for direct 6d pose estimation,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 12376–12385, 2021.
[23] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel­wise voting network for 6dof pose estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4561–4570, 2019.
[24] M. Rad and V. Lepetit, “BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 3848–3856, 2017. 42
[25] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category­level 6d object pose and size estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2642–2651, 2019.
[26] L. Huang, T. Hodan, L. Ma, L. Zhang, L. Tran, C. D. Twigg, P. Wu, J. Yuan, C. Keskin, and R. Wang, “Neural correspondence field for object pose estimation,” in Proc. European Conf. on Computer Vision (ECCV), vol. 13670, pp. 585–603, 2022.
[27] B. Okorn, M. Xu, M. Hebert, and D. Held, “Learning orientation distributions for object pose estimation,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10580–10587, IEEE, 2020.
[28] I. Gilitschenski, R. Sahoo, W. Schwarting, A. Amini, S. Karaman, and D. Rus, “Deep orientation uncertainty learning based on a bingham loss,” in International conference on learning representations, 2020.
[29] S. Prokudin, P. Gehler, and S. Nowozin, “Deep directional statistics: Pose estimation with uncertainty quantification,” in Proceedings of the European conference on computer vision (ECCV), pp. 534–551, 2018.
[30] D. M. Klee, O. Biza, R. Platt, and R. Walters, “Image to sphere: Learning equivariant features for efficient pose prediction,” arXiv preprint arXiv:2302.13926, 2023.
[31] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M.­H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” arXiv preprint arXiv:2209.00796, 2022.
[32] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text­conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
[33] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text­to­image diffusion models for subject­driven generation,” arXiv preprint arXiv:2208.12242, 2022.
[34] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High­resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
[35] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al., “Photorealistic text­to­image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494, 2022.
[36] R. Yang, P. Srivastava, and S. Mandt, “Diffusion probabilistic modeling for video generation,” arXiv preprint arXiv:2203.09481, 2022.
[37] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” arXiv preprint arXiv:2204.03458, 2022.
[38] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022. 43
[39] R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “Prodiff: Progressive fast diffusion model for high­quality text­to­speech,” in Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605, 2022.
[40] D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete diffusion model for text­to­sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[41] S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, “Diffuseq: Sequence to sequence text generation with diffusion models,” arXiv preprint arXiv:2210.08933, 2022.
[42] X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, “Diffusion­lm improves controllable text generation,” Advances in Neural Information Processing Systems, vol. 35, pp. 4328–4343, 2022.
[43] F.­A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[44] T. Amit, T. Shaharbany, E. Nachmani, and L. Wolf, “Segdiff: Image segmentation with diffusion probabilistic models,” arXiv preprint arXiv:2112.00390, 2021.
[45] D. Baranchuk, I. Rubachev, A. Voynov, V. Khrulkov, and A. Babenko, “Label­efficient semantic segmentation with diffusion models,” arXiv preprint arXiv:2112.03126, 2021.
[46] S. Chen, P. Sun, Y. Song, and P. Luo, “Diffusiondet: Diffusion model for object detection,” CoRR, vol. abs/2211.09788, 2022.
[47] J. Choi, D. Shim, and H. J. Kim, “Diffupose: Monocular 3d human pose estimation via denoising diffusion probabilistic model,” CoRR, vol. abs/2212.02796, 2022.
[48] K. Holmquist and B. Wandt, “Diffpose: Multi­hypothesis human pose estimation using diffusion models,” arXiv preprint arXiv:2211.16487, 2022.
[49] V. D. Bortoli, E. Mathieu, M. J. Hutchinson, J. Thornton, Y. W. Teh, and A. Doucet, “Riemannian score­based generative modelling,” in Advances in Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022.
[50] E. Jørgensen, “The central limit problem for geodesic random walks,” Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 32, no. 1­2, pp. 1–64, 1975.
[51] J. Solà, J. Deray, and D. Atchuthan, “A micro lie theory for state estimation in robotics,” CoRR, vol. abs/1812.01537, 2018.
[52] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural Comput., vol. 23, no. 7, pp. 1661–1674, 2011.
[53] D. I. Nikolayev and T. I. Savyolov, “Normal distribution on the rotation group so (3),” Textures and Microstructures, vol. 29, 1970.
[54] S. Said, L. Bombrun, Y. Berthoumieu, and J. H. Manton, “Riemannian gaussian distributions on the space of symmetric positive definite matrices,” IEEE Trans. Inf. Theory, vol. 63, no. 4, pp. 2153–2170, 2017. 44
[55] G. Chirikjian and M. Kobilarov, “Gaussian approximation of non­linear measurement models on lie groups,” in 53rd IEEE Conference on Decision and Control, pp. 6401–6406, IEEE, 2014.
[56] T. D. Barfoot and P. T. Furgale, “Associating uncertainty with three­dimensional poses for use in estimation problems,” IEEE Trans. Robotics, vol. 30, pp. 679–693, 2014.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770– 778, 2016.
[58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[59] L. Ziyin, T. Hartwig, and M. Ueda, “Neural networks fail to learn periodic functions and how to fix it,” Advances in Neural Information Processing Systems, vol. 33, pp. 1583– 1594, 2020.
[60] J. Lee, W. Kim, D. Gwak, and E. Choi, “Conditional generation of periodic signals with fourier­based decoder,” arXiv preprint arXiv:2110.12365, 2021.
[61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[62] T. Hodaň, M. Sundermeyer, B. Drost, Y. Labbé, E. Brachmann, F. Michel, C. Rother, and J. Matas, “Bop challenge 2020 on 6d object localization,” in Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 577– 594, Springer, 2020.
[63] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman­Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018.
[64] B. Yi, M. Lee, A. Kloss, R. Martín­Martín, and J. Bohg, “Differentiable factor graph optimization for learning smoothers,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
[65] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. on Learning Representations (ICLR), 2015
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *