帳號:guest(18.227.105.140)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):薛亘裕
作者(外文):Hsueh, Hsuan-Yu
論文名稱(中文):使用時間空間融合神經網路對連續幀RGB-D序列影像關鍵幀進行物件6D姿態估計
論文名稱(外文):6D Pose Estimation from a Monocular RGB-D Images via a Spatio-temporal Feature Fusion Network
指導教授(中文):朱宏國
指導教授(外文):Chu, Hung-Kuo
口試委員(中文):姚智原
胡敏君
口試委員(外文):Yao, Chih-Yuan
Hu, Min-Chun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:108065703
出版年(民國):112
畢業學年度:111
語文別:中文
論文頁數:43
中文關鍵詞:電腦視覺機器學習6D姿態估計
外文關鍵詞:Computer VisionMechine Learning6D Pose Estimation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:420
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
本研究設計了一個6D姿態估計(6D Pose Estimation)的 「RGB-D時間空間融合網路」,藉由輸入一組連續幀RGB-D序列,透過對連續幀RGB-D時間空間特徵的融合,並對關鍵幀進行6D姿態估計。整體架構,先是對輸入的每一幀RGB-D圖像,經由「RGB-D特徵融合網路」提取融合RGB-D的特徵資訊,並且透過「光流檢測器」計算每個相鄰幀對應關鍵幀的光流資訊,將融合的特徵以及光流資訊傳入「時間空間特徵融合網路」,實現在關鍵幀相機視角下的連續幀特徵融合,最後將獲得的關鍵幀視角RGB-D時間空間融合特徵,執行兩階段的「3D關鍵點6D姿態估計」方法,完成最終關鍵幀物件6D姿態估計。此外,由於我們所設計的模型網路為輸入一組有時間關聯性的連續幀資料,因此我們採用了提供連續幀真實訓練與測試資料的資料集《YCB-Video》,另外為了有效提升訓練的結果,我們還自製了有連續幀關係的合成資料作為訓練資料。最終根據用於測試「準確度」與「穩定度」的指標方法,進行分數計算與比較,證實了我們所設計的「RGB-D時間空間融合網路」,能夠獲得了比起單一幀的辨識方法,更為準確與穩定的輸出結果。
This research designed a "RGB-D spatio-temporal fusion network" for 6d pose estimation. This network inputs a set of continuous frame RGB-D sequences, and obtains the objects 6D estimated pose of the keyframe through continuous frame RGB-D spatio-temporal fusion feature. The overall architecture process first uses the "RGB-D feature fusion network" to obtain RGB-D fusion feature information for each frame of RGB-D image input, and use the "Optical flow detector" to calculate the optical flow information of each adjacent frame corresponding to the keyframe. Input the fused features and optical flow information into the "Spatio-temporal feature fusion network" to realize continuous frame feature fusion under the keyframe camera perspective. Finally, the RGB-D spatio-temporal fusion feature of the obtained key frame perspectives are used to perform a two-stage "3D keypoint based 6D pose estimation" method to complete the final keyframe object 6D pose estimation. In addition, because the model network we designed is to input a set of time-related continuous frame data, we use the data set "YCB-Video" that provides continuous frame real training and testing data. In order to effectively improve the training results, we also self-made synthetic data of continuous frames use to train. According to the metric method used to test "accuracy" and "stability", the estimated results are calculated and compared. It is confirmed that the "RGB-D spatio-temporal fusion network" designed by us can obtain more accurate and stable output results than the single-frame estimation method.
摘要 i
Abstract ii
目錄 iii
1 前言 1
2 相關研究 6
2.1 6D 姿態估計 . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 模板匹配方法 (Template Matching Method) . . . . . . . 6
2.1.2 端對端方法 (End-to-End Method) . . . . . . . . . . . . 7
2.1.3 關鍵點方法 (Keypoint-based Method) . . . . . . . . . . 7
2.1.4 密集對應方法 (Dense Correspondence Method) . . . . . . 8
2.2 特徵融合網路 . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 RGB-D 特徵融合網路 . . . . . . . . . . . . . . . . . . 9
2.2.2 時間空間特徵融合網路 . . . . . . . . . . . . . . . . . 10
3 系統架構 11
3.1 問題定義 . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 整體模型架構 . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 RGB-D 圖像特徵融合 . . . . . . . . . . . . . . . . . 13
3.2.2 光流估計與特徵映射 . . . . . . . . . . . . . . . . . . 13
3.2.3 時間空間特徵融合 . . . . . . . . . . . . . . . . . . . 15
3.2.4 基於 3D 關鍵點之 6D 姿態估計 . . . . . . . . . . . . . 17
3.3 損失函數 . . . . . . . . . . . . . . . . . . . . . . . 20
4 實驗結果 22
4.1 資料集 (Dataset) . . . . . . . . . . . . . . . . . . . 22
4.2 實作與訓練細節 . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 訓練步驟 . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 訓練細節與設置 . . . . . . . . . . . . . . . . . . . . 26
4.3 實驗驗證指標 . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 準確度 (Accuracy) 驗證指標 . . . . . . . . . . . . . . 27
4.3.2 穩定度 (Stability) 驗證指標 . . . . . . . . . . . . . . 29
4.4 量化比較 . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 準確度指標分數比較 . . . . . . . . . . . . . . . . . . . 30
4.4.2 穩定度指標分數比較 . . . . . . . . . . . . . . . . . . . 32
4.5 質化比較 . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.1 消融研究 (Ablation Study) . . . . . . . . . . . . . . . 36
5 結論 37
6 系統限制與未來工作 38
[1] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3343–3352, 2019.
[2] Lawrence G Roberts. Machine perception of three-dimensional solids. PhD thesis, Mas-sachusetts Institute of Technology, 1963.
[3] David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999.
[4] Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 998–1005. Ieee, 2010.
[5] Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE international conference on computer vision, pages 1521–1529, 2017.
[6] Henning Tjaden, Ulrich Schwanecke, and Elmar Schomer. Real-time monocular pose estimation of 3d objects using temporally consistent local color histograms. In Proceedings of the IEEE international conference on computer vision, pages 124–132, 2017.
[7] Tomas Hodan, Daniel Barath, and Jiri Matas. Epos: Estimating 6d pose of objects with symmetries. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11703–11712, 2020.
[8] Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11632–11641, 2020.
[9] Yisheng He, Haibin Huang, Haoqiang Fan, Qifeng Chen, and Jian Sun. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3003–3013, 2021.
[10] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4561–4570, 2019.
[11] Zhigang Li, Gu Wang, and Xiangyang Ji. Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Proceedings of the IEEE/ CVF International Conference on Computer Vision, pages 7678–7687, 2019.
[12] Yann Labbé, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. In European Conference on Computer Vision, pages 574–591. Springer, 2020.
[13] Gu Wang, Fabian Manhardt, Federico Tombari, and Xiangyang Ji. Gdr-net: Geometry- guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16611– 16621, 2021.
[14] Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 683–698, 2018.
[15] Shun Iwase, Xingyu Liu, Rawal Khirodkar, Rio Yokota, and Kris M Kitani. Repose: Fast 6d object pose refinement via deep texture rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3303–3312, 2021.
[16] Bowen Wen, Chaitanya Mitash, Baozhang Ren, and Kostas E Bekris. se (3)-tracknet: Data- driven 6d pose tracking by calibrating image residuals in synthetic domains. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10367–10373. IEEE, 2020.
[17] Fengjun Mu, Rui Huang, Ao Luo, Xin Li, Jing Qiu, and Hong Cheng. Temporalfusion:Temporal motion reasoning with multi-frame fusion for 6d object pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5930–5936. IEEE, 2021.
[18] Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, and Jia-Bin Huang. Hybrid neural fusion for full-frame video stabilization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2299–2308, 2021.
[19] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
[20] Maximilian Denninger, Martin Sundermeyer, Dominik Winkelbauer, Youssef Zidan, Dmitry Olefir, Mohamad Elbadrawy, Ahsan Lodhi, and Harinandan Katam. Blenderproc. arXiv preprint arXiv:1911.01911, 2019.
[21] Tomáš Hodaň, Martin Sundermeyer, Bertram Drost, Yann Labbé, Eric Brachmann, Frank Michel, Carsten Rother, and Jiří Matas. Bop challenge 2020 on 6d object localization. In European Conference on Computer Vision, pages 577–594. Springer, 2020.
[22] Roberto Brunelli. Template matching techniques in computer vision: theory and practice. John Wiley & Sons, 2009.
[23] Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan Ilic, Kurt Konolige, Nassir Navab, and Vincent Lepetit. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, pages 858–865. IEEE, 2011.
[24] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pages 548–562. Springer, 2012.
[25] Tomáš Hodaň, Xenophon Zabulis, Manolis Lourakis, Štěpán Obdržálek, and Jiří Matas. Detection and fine 3d pose estimation of texture-less objects in rgb-d images. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4421–4428. IEEE, 2015.
[26] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 699–715, 2018.
[27] Martin Sundermeyer, Maximilian Durner, En Yen Puang, Zoltan-Csaba Marton, Narunas Vaskevicius, Kai O Arras, and Rudolph Triebel. Multi-path learning for object pose estimation across domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13916–13925, 2020.
[28] Fabian Manhardt, Diego Martin Arroyo, Christian Rupprecht, Benjamin Busam, Tolga Birdal, Nassir Navab, and Federico Tombari. Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6841–6850, 2019.
[29] Yinlin Hu, Pascal Fua, Wei Wang, and Mathieu Salzmann. Single-stage 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2930–2939, 2020.
[30] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
[31] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
[32] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2):155, 2009.
[33] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
[34] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016.
[35] Mahdi Rad and Vincent Lepetit. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pages 3828–3836, 2017.
[36] Bugra Tekin, Sudipta N Sinha, and Pascal Fua. Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 292–301, 2018.
[37] Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, and Carsten Rother. Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pages 536–551. Springer, 2014.
[38] Frank Michel, Alexander Kirillov, Eric Brachmann, Alexander Krull, Stefan Gumhold, Bogdan Savchynskyy, and Carsten Rother. Global hypothesis generation for 6d object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 462–471, 2017.
[39] Wadim Kehl, Fausto Milletari, Federico Tombari, Slobodan Ilic, and Nassir Navab. Deep learning of local rgb-d patches for 3d object detection and 6d pose estimation. In European conference on computer vision, pages 205–220. Springer, 2016.
[40] Kiru Park, Timothy Patten, and Markus Vincze. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7668–7677, 2019.
[41] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019.
[42] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020.
[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[44] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[45] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11108–11117, 2020.
[46] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[47] Tomáš Hodan, Pavel Haluza, Štepán Obdržálek, Jiri Matas, Manolis Lourakis, and Xenophon Zabulis. T-less: An rgb-d dataset for 6d pose estimation of texture-less objects. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 880–888. IEEE, 2017.
[48] Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, et al. Bop: Benchmark for 6d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
[49] Shuaicheng Liu, Lu Yuan, Ping Tan, and Jian Sun. Bundled camera paths for video stabilization. ACM Transactions on Graphics (TOG), 32(4):1–10, 2013.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *