帳號:guest(216.73.216.146)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):卜延宇
作者(外文):Pu, Yen-Yu
論文名稱(中文):基於RGB-D 輔助光線不足與小尺度物件之即時一階物件偵測
論文名稱(外文):Real-time One-stage Object Detection Based on RGB-D for Insufficient Light Scene and Small-scale Object
指導教授(中文):邱瀞德
指導教授(外文):Chiu, Ching-Te
口試委員(中文):蘇豐文
林輝堂
口試委員(外文):Soo, Von-Wun
Lin, Hui-Tang
學位類別:碩士
校院名稱:國立清華大學
系所名稱:通訊工程研究所
學號:107064548
出版年(民國):109
畢業學年度:109
語文別:英文
論文頁數:64
中文關鍵詞:一階物件偵測深度資訊模塊化即時物件偵測
外文關鍵詞:One-stage object detectionDepth informationModularreal-time object detection
相關次數:
  • 推薦推薦:0
  • 點閱點閱:1080
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來,由於神經網絡的快速發展,使得更多研究人員投入了電腦視覺研究當中。而物件偵測是在電腦視覺當中非常重要的一項任務,目前已經產生出了許多著名的物件偵測研究例如一階的物件偵測SSD [1]、YOLO [2],或是二階的物件偵測,例如最著名的Faster-RCNN [3]。但是目前的物件偵測研究大多都只針對RGB 圖象進行偵測。當光線不足時就容易有判斷錯誤的情況發生。因此這篇論文的主要目的之一就是加入深度資訊,使得物件偵測能夠在光線不足時能夠有更好的效果。
目前有許多物件辨識的研究,也加入了深度資訊來輔助辨識,例如[4], [5], [6]。由此可知加入深度資訊對於提高辨識的準確度是有一定程度的效 果。我們主要會使用一階物件偵測SSD 來當作我們主要的架構,原因在於我們 想保持原本一階物件偵測較為快速的特性。雖然準確度會比二階物件偵測來的 低,但是我們會利用加入模塊的方式解決原始SSD 架構上的缺陷。我們第二 個主要的研究目標就在於提出模塊化的方式來提高準確率。並可以針對需求快速的應用 不同的骨幹上例如VGG16 [7]、Mobilenetv3 [8]、Densenet [9] 或是SE-ResneXt50 [10]。
在本篇論文當中,我們主要會提出五個模塊分別是,加入深度路徑 (Add depth information) 目的在於提高光線不足時的準確率,再來會利用強化特徵模塊(Enhanced Feature Block) 與上採樣模塊(Upsample Block) 用來提高 SSD 在小物件準確率不佳的問題。之後會加入邊界特徵提取模塊(Edge Feature Block) 來提供更多有效資訊,最後會加入權重融合層Weights Fusion Layer 將RGB與Depth 路徑偵測的結果進行權重相加,利用這個方法加強深度資訊的應用同時不會增加多餘的參數量。除此之外我們還利用邊界強化的非極大值抑制(Edge Enhanced NMS) 用途是利用邊界所占比例提高最後的置信度,幫助選取最後得出結果。
我們將會在SUN RGB-D 資料集與Pascal VOC 進行實驗,在SUN RGB-D 資料集中,我們在VGG 骨幹取得了46.53% 的準確率,57.38M 的參數量且速度可以達到27FPS。在Densenet 骨幹準確率可以達到47.52%,40.15M的參數使用量。在Mobilenetv3 骨幹可以達到43.92% 的準確率,並且速度可以達到33FPS 達到即時的偵測。
In recent years, the rapid development of convolutional neural networks makes more researchers join the computer vision research. Object detection is a very important task in computer vision. There have been many famous object detection researches such as one-stage object detection SSD [1], YOLO [2] or two-stage object detection, Faster-RCNN [3]. However, most of the current object detection researches only detect from RGB images. When the light is insufficient, detection errors are prone to occur. Therefore, one of the main purposes of this paper is to add depth information so that object detection can have better results when there is insufficient light.
At present, there are many researches on object recognition, and depth information has also been added to assist recognition, such as [4], [5], [6]. Therefore, we can know that adding depth information has a positive effect on improving the accuracy of recognition. We mainly use the one-stage object detection SSD as our main architecture because we want to keep the faster advantage. Although the accuracy will be lower than the two-stage object detection. But we will use the method of adding modules to solve the defects in the original SSD architecture. Meanwhile, it can be quickly applied to different backbones according to requirements. Such as VGG16 [7], Mobilenetv3 [8], Densenet [9] or SE-ResneXt50 [10].
In this work, we will mainly propose five modules. The first module is to add depth path to improve the accuracy when the light is insufficient. Next, the Enhanced Feature Block and Upsample Block will be used to improve the poor accuracy in small objects of SSD. Then the Edge Feature Block will be added to provide more effective information. Finally, Weight Fusion Layer is used to do weighted sum of the RGB and Depth path detection results to enhance the application of depth information without increasing the amount of redundant parameters. In addition, we also use Edge Enhanced NMS. The purpose is to improve the final confidence level to help select the final result by the ratio of the edge in the bounding box.
We experimented our method with the SUN RGB-D dataset and Pascal VOC. In the SUN RGB-D dataset, we achieved 46.53% accuracy on the VGG-16, 57.38M parameters, and the speed can reach 27FPS. On Densenet-169, accuracy rate can reach 47.52%, 40.15M parameter. On Mobilenetv3 can achieve an accuracy of 43.92%, and the speed can reach 33FPS to achieve real-time detection.
1 Introduction 1
1.1 Background and Motivation....................1
1.2 Goal and Contribution........................4
1.3 Thesis Organization..........................5
2 Related Works 6
2.1 Object Detection.............................6
2.2 RGB-D Object Detection.......................8
2.3 Two-stream fusion method....................10
3 Real-time One-stage Object Detection Based on RGB-D for Insufficient Light Scene and Small-scale Object 13
3.1 Overview of Our Architecture................14
3.2 Network Architecture........................15
3.2.1 Enhanced Features Block...................15
3.2.2 Upsample Block............................17
3.2.3 Edge Features Extraction..................20
3.3 RGB and Depth fusion........................23
3.3.1 Weight Fusion Layer.......................23
3.4 Edge Enhanced NMS...........................24
3.5 Loss function...............................26
4 Experimental Results 28
4.1 Implement Details and Datasets..............28
4.1.1 Implement Details.........................28
4.1.2 Datasets..................................29
4.1.3 Results...................................30
4.2 Ablation Studies............................46
4.2.1 Upsample Block............................46
4.2.2 Edge Feature Extraction...................48
4.2.3 Weight Fusion Layer.......................49
4.3 Comparison with Other works on SUN RGB-D....50
4.4 Comparison with Other works on Pascal VOC...53
5 Conclusions 55
[1] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in European conference on computervision. Springer, 2016, pp. 21–37.
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:Unified, real-time object detection,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2016, pp. 779–788.
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in Advances in neural informationprocessing systems, 2015, pp. 91–99.
[4] G. Goswami, S. Bharadwaj, M. Vatsa, and R. Singh, “On rgb-d face recognitionusing kinect,” in 2013 IEEE Sixth International Conference on Biometrics:Theory, Applications and Systems (BTAS). IEEE, 2013, pp. 1–6.
[5] B. Y. Li, A. S. Mian, W. Liu, and A. Krishna, “Using kinect for face recognitionunder varying poses, expressions, illumination and disguise,” in 2013IEEE workshop on applications of computer vision (WACV). IEEE, 2013,pp. 186–192.57
[6] H. Zhang, H. Han, J. Cui, S. Shan, and X. Chen, “Rgb-d face recognition viadeep complementary and common feature learning,” in 2018 13th IEEE InternationalConference on Automatic Face & Gesture Recognition (FG 2018).IEEE, 2018, pp. 8–15.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescaleimage recognition,” in 3rd International Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015, ConferenceTrack Proceedings, 2015.
[8] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedingsof the IEEE International Conference on Computer Vision, 2019,pp. 1314–1324.
[9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2017, pp. 4700–4708.
[10] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedingsof the IEEE conference on computer vision and pattern recognition, 2018,pp. 7132–7141.
[11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features fromrgb-d images for object detection and segmentation,” in European conferenceon computer vision. Springer, 2014, pp. 345–360.58
[12] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for supervisiontransfer,” in Proceedings of the IEEE conference on computer vision andpattern recognition, 2016, pp. 2827–2836.
[13] G. Li, Y. Gan, H. Wu, N. Xiao, and L. Lin, “Cross-modal attentional contextlearning for rgb-d object detection.” IEEE transactions on image processing:a publication of the IEEE Signal Processing Society, vol. 28, no. 4, p. 1591,2019.
[14] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedingsof the seventh IEEE international conference on computer vision,vol. 2. Ieee, 1999, pp. 1150–1157.
[15] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”in European conference on computer vision. Springer, 2006, pp. 404–417.
[16] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,”in 2005 IEEE computer society conference on computer vision andpattern recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 886–893.
[17] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale androtation invariant texture classification with local binary patterns,” IEEETransactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp.971–987, 2002.
[18] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXivpreprint arXiv:1804.02767, 2018.59
[19] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutionalsingle shot detector,” arXiv preprint arXiv:1701.06659, 2017.
[20] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinementneural network for object detection,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2018, pp. 4203–4212.
[21] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conferenceon computer vision, 2015, pp. 1440–1448.
[22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedingsof the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[23] M. Najibi, M. Rastegari, and L. S. Davis, “G-cnn: an iterative grid basedobject detector,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 2369–2377.
[24] Y.-C. Lee, J. Chen, C. W. Tseng, and S.-H. Lai, “Accurate and robust facerecognition from rgb-d images with a deep learning approach.” in BMVC,vol. 1, no. 2, 2016, p. 3.
[25] A. Chowdhury, S. Ghosh, R. Singh, and M. Vatsa, “Rgb-d face recognition vialearning-based reconstruction,” in 2016 IEEE 8th International Conferenceon Biometrics Theory, Applications and Systems (BTAS). IEEE, 2016, pp.1–7.60
[26] S.-Z. Li, B. Yu, W. Wu, S.-Z. Su, and R.-R. Ji, “Feature learning based on sae–pca network for human gesture recognition in rgbd images,” Neurocomputing,vol. 151, pp. 565–573, 2015.
[27] K. O. Rodriguez and G. C. Chavez, “Finger spelling recognition from rgb-dinformation using kernel descriptor,” in 2013 XXVI Conference on Graphics,Patterns and Images. IEEE, 2013, pp. 1–7.
[28] M. Ma, X. Xu, J. Wu, and M. Guo, “Design and analyze the structure basedon deep belief network for gesture recognition,” in 2018 Tenth InternationalConference on Advanced Computational Intelligence (ICACI). IEEE, 2018,pp. 40–44.
[29] M. Schwarz, A. Milan, A. S. Periyasamy, and S. Behnke, “Rgb-d object detectionand semantic segmentation for autonomous manipulation in clutter,”The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 437–451,2018.
[30] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for3d object detection from rgb-d data,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2018, pp. 918–927.
[31] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection inrgb-d images,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 808–816.
[32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchiesfor accurate object detection and semantic segmentation,” in Proceedings of61the IEEE conference on computer vision and pattern recognition, 2014, pp.580–587.
[33] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporatingdepth into semantic segmentation via fusion-based cnn architecture,” in Asianconference on computer vision. Springer, 2016, pp. 213–228.
[34] S.-J. Park, K.-S. Hong, and S. Lee, “Rdfnet: Rgb-d multi-level residual featurefusion for indoor semantic segmentation,” in Proceedings of the IEEEinternational conference on computer vision, 2017, pp. 4980–4989.
[35] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm,W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection andsemantic segmentation for autonomous driving: Datasets, methods, and challenges,”IEEE Transactions on Intelligent Transportation Systems, 2020.
[36] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for actionrecognition in videos,” in Advances in neural information processingsystems, 2014, pp. 568–576.
[37] E. Chen, X. Bai, L. Gao, H. C. Tinega, and Y. Ding, “A spatiotemporal heterogeneoustwo-stream network for action recognition,” IEEE Access, vol. 7,pp. 57 267–57 275, 2019.
[38] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,“Temporal segment networks: Towards good practices for deep action recognition,”in European conference on computer vision. Springer, 2016, pp.20–36.62
[39] Q. Gao, J. Liu, Z. Ju, Y. Li, T. Zhang, and L. Zhang, “Static hand gesturerecognition with parallel cnns for space human-robot interaction,” in InternationalConference on Intelligent Robotics and Applications. Springer, 2017,pp. 462–473.
[40] T. Weng, A. Pallankize, Y. Tang, O. Kroemer, and D. Held, “Multimodaltransfer learning for grasping transparent and specular objects,” IEEERobotics and Automation Letters, vol. 5, no. 3, pp. 3791–3798, 2020.
[41] K.-H. Shih, C.-T. Chiu, J.-A. Lin, and Y.-Y. Bu, “Real-time object detectionwith reduced region proposal network via multi-feature concatenation,” IEEEtransactions on neural networks and learning systems, 2019.
[42] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxoutnetworks,” in International conference on machine learning, 2013, pp.1319–1327.
[43] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,”2016.
[44] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell,“Understanding convolution for semantic segmentation,” in 2018 IEEE winterconference on applications of computer vision (WACV). IEEE, 2018, pp.1451–1460.
[45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fastfeature embedding,” in Proceedings of the 22nd ACM international conferenceon Multimedia, 2014, pp. 675–678.63
[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in neural information processingsystems, 2012, pp. 1097–1105.
[47] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understandingbenchmark suite,” in Proceedings of the IEEE conference on computervision and pattern recognition, 2015, pp. 567–576.
[48] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman,“The pascal visual object classes (voc) challenge,” International journal ofcomputer vision, vol. 88, no. 2, pp. 303–338, 2010.
[49] ——, “The pascal visual object classes challenge 2007 (voc2007) results,”2007.
[50] M. Everingham and J. Winn, “The pascal visual object classes challenge2012 (voc2012) development kit,” Pattern Analysis, Statistical Modelling andComputational Learning, Tech. Rep, vol. 8, 2011.
[51] S. Hou, Z. Wang, and F. Wu, “Object detection via deeply exploiting depthinformation,” Neurocomputing, vol. 286, pp. 58–66, 2018.
[52] J.-A. Lin, C.-T. Chiu, and Y.-Y. Cheng, “Object detection with color anddepth images with multi-reduced region proposal network and multi-pooling,”in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2020, pp. 1618–1622.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *