帳號:guest(18.188.226.93)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳守中
作者(外文):Chen, Shou-Zhong
論文名稱(中文):室內機器人輔助導向之基於深度學習之俯瞰視角物品偵測
論文名稱(外文):Deep Learning based Bird's Eye View Detection toward Indoor Robot Assistant System
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):馬席彬
邱維辰
口試委員(外文):Ma, Hsi-Pin
Chiu, Wei-Chen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:104061545
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:31
中文關鍵詞:目標偵測深度學習鳥瞰偵測室內物品偵測
外文關鍵詞:Object detectionDeep learningBird's eye view detecitonIndoor objects detection
相關次數:
  • 推薦推薦:0
  • 點閱點閱:413
  • 評分評分:*****
  • 下載下載:27
  • 收藏收藏:0
現存電腦視覺技術中的目標偵測技術大部分都是著眼於二維資料 (正常人類視角的平面資料)或是三維資料(具有空間概念的資料)的偵測。但是特定的電腦視覺工作,像是尋找室內物品,有時候我們只需要知道物品在房間中的哪個方向、距離我們多遠。而且基於二維資料與三維資料的偵測技術各自有他們的缺點,像是根據二維資料進行的偵測技術具有無法得知實際物體位置的缺點,而根據三維資料進行的偵測技術則有運算過於緩慢的問題。綜合上述的考量,我們提出了一個介於二維與三維目標偵測的偵測技術— 鳥瞰偵測(Bird’s Eye View Detection) 。運用二維資料與深度資訊加上特殊特徵圖片產生技術(HHA),抽取環境中的各個位置的高度、表面方向、距離、外觀等資訊,我們可以產生室內環境的鳥瞰視角的專用資料進行室內物品偵測,這樣做的好處是可以運用現存成熟的二維目標偵測模型(Faster-RCNN, Yolo v2 )進行運算,同時可以避免三維偵測模型訓練時所需要的高複雜度、高資源需求的運算。我們使用了目前室內目標偵測具高度難度的兩個真實場景蒐集的資料庫,分別是NYUv2 dataset, ScanNet dataset,進行實驗,並取得了最常見19類物品 31.7%的mAP的表現。我們將現存最佳的三維目標偵測模型技術的結果投影到鳥瞰偵測的維度,從而讓結果可以直接進行比較,並發現我們的結果(mAP)相較於他們的結果提升了九個百分點。
Most of the object detection technique in computer vision domain use 2D data or 3D data as input. However, in some finding object scenarios, we only need to know which direction and how far it is, that is, knowing the position on the ground plane is enough. Hence we propose method leverage both of the benefit from 2D and 3D detection -- Bird’s Eye View Detection. By using depth image and some special encoding method, we can gather the information like height, the angle of surface normal, texture appearance of each point in the scene. Next step we use that information to create the Bird’s Eye View image for indoor object detection. The benefits of Bird’s Eye View Detection is that we could leverage the strength of 2D detection model (Faster-RCNN or Yolo v2), and avoid high memory and time consumption when training with the 3D detector. We use challenging indoor real scene dataset NYUv2 dataset and ScanNet dataset as input for the experiment, and get 31.7% mAP result on 19 most common indoor objects.
致謝.................................ii
摘要.................................iii
Abstract.................................iv
1 Introduction.................................1
1.1 Motivation................................. 1
1.2 ProblemDescription ........................... 1
2 Related Work.................................3
2.1 ObjectDetection ............................. 3
3 Preliminaries.................................6
3.1 ObjectDetection ............................. 6
3.1.1 FasterR-CNN .......................... 7
3.1.2 Yolo and YoloV2......................... 7
3.1.3 SSD................................ 9
4 Our Method................................ 11
4.1 BVDetection ............................... 11
4.1.1 MappingEncodedValuetoBV ................. 11
4.1.2 HHA[1] ............................. 13
4.1.3 VisibilityandOccupationHandling . . . . . . . . . . . . . . . 14
4.1.4 Slice3DSceneIntoMultipleLayers . . . . . . . . . . . . . . . 14
4.1.5 HybridEncoding......................... 15
4.1.6 Point Cloud Upsampling and Refine Resolution . . . . . . . . . 16
5 Experiment................................ 18
5.1 BVdetection ............................... 18
5.1.1 Dataset .............................. 18
5.1.2 Features.............................. 19
5.1.3 Model learning........................... 20
5.1.4 EvaluationMetric. ........................ 21
5.1.5 BaselineMethod.......................... 22
5.1.6 Results............................... 22
6 Conclusion............................... 28

References............................... 29
[1] P. A. Saurabh Gupta, Ross Girshick and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in ECCV, 2014.
[2] R. G. J. S. Shaoqing Ren, Kaiming He, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
[3] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in CVPR, 2017.
[4] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and
support inference from rgbd images,” in ECCV, 2012.
[5] A.Dai,A.X.Chang, M.Savva, M.Halber, T.Funkhouser, and M.Nießner,“Scan- net: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in CVPR, 2016.
[7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
[8] S. L. S. Song and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in CVPR, 2015.
[9] J. W. B. L. T. X. Xiaozhi Chen, Huimin Ma, “Multi-view 3d object detection network for autonomous driving,” in CVPR, 2017.
[10] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
[12] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
[13] P. D. Kaiming He, Georgia Gkioxari and R. Girshick, “Mask r-cnn,” in ICCV,
2017.
[14] K. E. A. van de Sande, J. Uijlings, T. Gevers, and A. Smeulders, “Segmentation
as selective search for object recognition,” in ICCV, 2011.
29
[15] S.R.J.S.KaimingHe,XiangyuZhang,“Spatialpyramidpoolingindeepconvolu- tional networks for visual recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.
[16] S.C.T.A.A.D.Erhan,D.,“Scalableobjectdetectionusingdeepneuralnetworks,” in CVPR, 2014.
[17] R. G. K. H. B. H. Tsung-Yi Lin, Piotr Dollár and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
[18] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” in International Journal of Computer Vision, 2010.
[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
[20] J. M. Saurabh Gupta, Judy Hoffman, “Cross modal distillation for supervision transfer,” in CVPR, 2016.
[21] S. Song and J. Xiao., “Sliding shapes for 3d object detection in depth images,” in ECCV, 2014.
[22] S.SongandJ.Xiao.,“Deepslidingshapesforamodal3dobjectdetectioninrgb-d images,” in CVPR, 2016.
[23] N. Dalal and B. Triggs., “Histograms of oriented gradients for human detection,” in CVPR, 2005.
[24] R.ZhileandE.B.Sudderth.,“Three-dimensionalobjectdetec-tionandlayoutpredic- tion using clouds of oriented gradients,” in CVPR, 2016.
[25] P.-T. J. B. J. M. F. M. J. Arbela ́ez, P., “Multiscale combinatorial grouping.,” in CVPR, 2014.
[26] Z. D. L. J. Latecki, “Amodal detection of 3d objects: Inferring 3d bounding boxes from 2d ones in rgb-depth images,” in CVPR, 2017.
[27] Z.Wang,W.Zhan,and M.Tomizuka,“Fusingbirdviewlidarpointcloudandfront view camera image for deep object detection,” in arXiv, 2017.
[28] D. Lowe, “Distinctive image features from scale-invariant key-points,” in Int. J. Comput. Vis., vol. 60, no. 2, pp. 91 110, 2004.
[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
[31] J. . O. M. A. G. Basura Fernando, Efstratios Gavves and T. Tuytelaars, “Rank pooling for action recognition,” in PAMI, 2016.
30
[32] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Dynamic image networks for action recognition,” in CVPR, 2016.
[33] R. F. L. T. D. Tran, L. Bourdev and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks„” in ICCV, 2015.
[34] J. D. S. K. J. L. R. G. S. G. Y. Jia, E. Shelhamer, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in arXiv, 2014.
[35] S. S. T. L. R. S. A. Karpathy, G. Toderici and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014.
[36] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” in IEEE Inter- national Conference on Robotics and Automation (ICRA), 2011.
[37] Y. J. J. T. B. M. F. K. S. A. Janoch, S. Karayev and T. Darrell., “A category-level 3-d objectdataset: Putting the kinect to work.,” in ICCV Workshop onConsumer Depth Cameras for Computer Vision, 2011.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *