帳號:guest(18.191.218.252)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):羅文易
作者(外文):Lo, Wen-Yi
論文名稱(中文):基於多重擴張率路徑特徵提取器之單張影像深度預測
論文名稱(外文):Depth Estimation from A Single Image through Multi-path-multi-rate Diverse Feature Extractor
指導教授(中文):邱瀞德
指導教授(外文):Chiu, Ching-Te
口試委員(中文):賴尚宏
楊家輝
范倫達
口試委員(外文):Lai, Shang-Hong
Yang, Jar-Ferr
Van, Lan-Da
學位類別:碩士
校院名稱:國立清華大學
系所名稱:通訊工程研究所
學號:106064533
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:86
中文關鍵詞:深度預估卷積神經網絡空洞卷積多重擴張率路徑特徵提取 器
外文關鍵詞:Depth estimationConvolution neural networkDilated-convolutionMulti-path-multi-rate feature extractor
相關次數:
  • 推薦推薦:0
  • 點閱點閱:797
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來,越來越多研究發現深度圖對於許多電腦視覺的應用,像是景深、3D 立體重建、手勢/人臉辨識及物件偵測等,都有很好的效果,深度的資訊不僅可以讓使用者有更佳的視覺體驗,也可以使生物識別技術有更佳的可靠性。然而,相對於RGB 影像的取得,取得深度資訊較為困難,通常需要仰賴額外的感應器(Microsoft Kinect、Intel Realsense 等) 來取得,因此從單張RGB 影像預估精準的深度圖也越來越受重視。
以往,大多數傳統的方法都使用手工(hand-craft) 提取的特徵(SIFT [1],GIST [2],HOG [3]) 預估深度,但只能在單一場景有較準確的深度預估。隨著神經網絡的演進,從單張RGB 影像預估出精準的深度圖有了很大的進展[4] [5] [6] [7],透過卷積神經網絡學習出有效的特徵,能同時準確預估出不同場景的深度,但在較小的物件或是在背景複雜的場景上,沒辦法將深度準確的預估出來,使預估出的深度圖往往有模糊的區塊及物體輪廓的喪失。
這篇論文中,提出了一套能夠準確預估出深度的方法,使用類似於UNET[8] 的架構,使產生的深度圖維持高解析度,並且運用我們提出的多重擴張率路徑特徵提取器(Multi-path-multi-rate Feature Extractor),我們的多重擴張率路徑特徵提取器由多個空洞卷積(Dilated Convolution) 組成,能夠有效的傳遞編碼器(Encoder) 與解碼器(Decoder) 之間的特徵,產生大量的全域資訊(Global information),使整個系統精準的預估出小物件的深度,並且保有物體的輪廓。
我們分別在室內場景(NYUv2 [9]) 及室外場景(Make3D [10]) 進行實驗,在NYUv2 資料集上,方均根、平均相對誤差及平均log10 誤差分別達到了0.506、0.137 及0.058。而在Make3D 資料集上,去除無效像素點後,方均根、平均相對誤差及平均log10 誤差分別達到了6.426、0.188 及0.076。
Nowadays, more researches indicate that accurate depth information is benefit for computer vision tasks like bokeh, 3D reconstruction, object detection, or image classification, etc. Depth information not only allows people to have better visual experience, but also makes biometric technology more reliable. However, compared with the acquisition of RGB images, the way of obtaining depth maps are more difficult and rely on additional sensors (Microsoft Kinect, Intel Realsense, etc). If we can take RGB-only image as input and predict depth map, it is more convenient to get depth information. Instead of these sensors, we can get depth information by a single RGB image, and ease the hardware cost. Therefore, it is more important to predict a depth map from a single RGB image.
Traditional methods usually extract the hand-craft features, such as SIFT [1], GIST [2], HOG [3], etc, to predict depth, but they can only get better depth estimation in a single type of scene. With the development of neural network, depth estimation from a single image has a great achievement. We can learn effective features through convolution neural network, and predict depth in different types of scenes at the same time. However, these works can′t predict accurate depth in the case of small objects or scenes with complex backgrounds. These works usually use bilinear up-sampling method to enlarge the feature maps during training, or disable to transfer multi-scale information to the end of network effectively. It would make depth maps have blurred regions and lose contours.
In this paper, we propose a multi-path-multi-rate feature extractor which can extract effective multi-scale information to predict the accurate depth. We use the architecture like U-NET [8] to maintain depth maps with high resolution, and use our multi-path-multi-rate extractor which is composed of several dilatedconvolutions to translate useful features from encoder to decoder. Dilated convolutions with different rate can provide different field-of-view information. It makes depth estimation by our method more precise and maintain objects′ contours.
We made experiments on indoor scene (NYUv2 [9]) and outdoor scene (Make3D [10]) respectively. At the NYUv2 dataset, the depth ranges from 0 meter to 10 meters in the scene. At the Make3D dataset, the depth ranges from 0 meter to 70 meters in the scene. On NYUv2 dataset, our root mean squared error (RMSE), average relative error (rel), and average log10 error (log10) are 0.498 meters, 0.136, and 0.058 meters, respectively. On Make3d dataset, we performed the experiment without invalid pixels (>70 meters) in ground truth. Our RMSE, rel, and log10 are 6.426 meters, 0.188, and 0.076 meters, respectively. Finally, compared with T-net [11], we improve RMSE from 0.572 meters to 0.498 meters by 12.9%, rel from 0.151 to 0.136 by 9.9% , log10 from 0.064 meters to 0.058 meters by 9.3% on NYUv2 dataset, and RMSE from 6.9 meters to 6.308 meters by 8.5%, rel from 0.207 to 0.191 by 7.7%, log10 from 0.084 meters to 0.074 meters by 11.9% on Make3D, respectively.
1 Introduction . . . . . . . . . . . . . . . . . . . . . .1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goal and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Works . . . . . . . . . . . . . . . . . . . . . .8
2.1 Traditional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Monocular Depth Estimation by CNN-based Methods . . . . . . . . 9
3 Depth Estimation from A Single Image through Multi-path-multirate
Diverse Feature Extractor . . . . . . . . . . . . . . . . . . . . . .14
3.1 Overview of Our Architecture . . . . . . . . . . . . . . . . . . . . . 15
3.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Multi-path-multi-rate Feature Extractor . . . . . . . . . . . 20
3.2.2 Upsampling method . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Experimental Results . . . . . . . . . . . . . . . . . . . . . .31
4.1 Implement Details and Datasets . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Implement Details . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 T-net [11] and our MIG, rate = 1 . . . . . . . . . . . . . . . 37
4.2.2 MIGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Multi-scale Information . . . . . . . . . . . . . . . . . . . . 43
4.2.5 Channel Reduction Method . . . . . . . . . . . . . . . . . . 47
4.2.6 Traditional Conv vs Dilated conv . . . . . . . . . . . . . . . 50
4.3 Comparison with Other works on NYUv2 Dataset [9] . . . . . . . . 53
4.4 Comparison with Other works on Make3D Dataset [10] . . . . . . . 62
4.5 Results on Gesture Dataset . . . . . . . . . . . . . . . . . . . . . . 67
4.5.1 Gesture Dataset . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Parameter Counts and Speed . . . . . . . . . . . . . . . . . . . . . 73
4.7 High Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7.1 Problem Description and Experiment Setting . . . . . . . . 74
4.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Conclusions . . . . . . . . . . . . . . . . . . . . . .77
5 Reference . . . . . . . . . . . . . . . . . . . . . .78
[1] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[2] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
[3] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” 2005.
[4] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
[5] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650– 2658.
[6] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV). IEEE, 2016, pp. 239–248.
[7] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024–2039, 2016.
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[9] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision. Springer, 2012, pp. 746–760.
[10] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008.
[11] L. He, G. Wang, and Z. Hu, “Learning depth from single images with deep neural network embedding focal length,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4676–4689, 2018.
[12] H. Zhang, H. Han, J. Cui, S. Shan, and X. Chen, “Rgb-d face recognition via deep complementary and common feature learning,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 8–15.
[13] Y. Cao, C. Shen, and H. T. Shen, “Exploiting depth from single monocular images for object detection and semantic segmentation,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 836–846, 2016.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[16] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox, “Learning to generate chairs with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1538–1546.
[17] Q. Gao, J. Liu, Z. Ju, Y. Li, T. Zhang, and L. Zhang, “Static hand gesture recognition with parallel cnns for space human-robot interaction,” in International Conference on Intelligent Robotics and Applications. Springer, 2017, pp. 462–473.
[18] J. Cui, H. Zhang, H. Han, S. Shan, and X. Chen, “Improving 2d face recognition via discriminative face depth estimation,” in 2018 International Conference on Biometrics (ICB). IEEE, 2018, pp. 140–147.
[19] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “Rgb-d mapping: Using depth cameras for dense 3d modeling of indoor environments,” in Experimental robotics. Springer, 2014, pp. 477–491.
[20] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3dmatch: Learning local geometric descriptors from rgb-d reconstructions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1802–1811.
[21] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in European conference on computer vision. Springer, 2014, pp. 345–360.
[22] S. Gupta, P. Arbeláez, R. Girshick, and J. Malik, “Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation,” International Journal of Computer Vision, vol. 112, no. 2, pp. 133–149, 2015.
[23] X. Ren, L. Bo, and D. Fox, “Rgb-(d) scene labeling: Features and algorithms,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2759–2766.
[24] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard, “Multimodal deep learning for robust rgb-d object recognition,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 681–687.
[25] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv: 1502.03167, 2015.
[26] D. Hoiem, A. A. Efros, and M. Hebert, “Automatic photo pop-up,” in ACM transactions on graphics (TOG), vol. 24, no. 3. ACM, 2005, pp. 577–584.
[27] A. Saxena, S. H. Chung, and A. Y. Ng, “3-d depth reconstruction from a single still image,” International journal of computer vision, vol. 76, no. 1, pp. 53–69, 2008.
[28] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[29] R. Girshick, “Fast r-cnn,” in International Conference on Computer Vision (ICCV), 2015.
[30] G. Wang, H.-T. Tsui, Z. Hu, and F. Wu, “Camera calibration and 3d reconstruction from a single view based on scene constraints,” Image and Vision Computing, vol. 23, no. 3, pp. 311–323, 2005.
[31] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single monocular images,” in Advances in neural information processing systems, 2006, pp. 1161–1168.
[32] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from predicted semantic labels,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 1253–1260.
[33] K. Karsch, C. Liu, and S. Kang, “Depth extraction from video using nonparametric sampling-supplemental material,” in European conference on
Computer Vision. Citeseer, 2012.
[34] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across scenes and its applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 978–994, 2010.
[35] M. Liu, M. Salzmann, and X. He, “Discrete-continuous depth estimation from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 716–723.
[36] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1119–1127.
[37] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic prediction from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2800–2809.
[38] A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5506–5514.
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[40] J. Li, R. Klein, and A. Yao, “A two-streamed network for estimating finescaled depth maps from single rgb images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3372–3380.
[41] D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 675–684.
[42] L. He, M. Yu, and G. Wang, “Spindle-net: Cnns for monocular depth inference with dilation kernel method,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 2504–2509.
[43] Z. Hao, Y. Li, S. You, and F. Lu, “Detail preserving depth estimation from a single image using attention guided networks,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 304–313.
[44] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
[46] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, no. 7, pp. 59–72, 2007.
[47] L. Zwald and S. Lambert-Lacroix, “The berhu penalty and the grouped effect,” arXiv preprint arXiv:1207.6868, 2012.
[48] S. S. Girija, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” Software available from tensorflow. org, 2016.
[49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[50] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5162–5170.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *