帳號:guest(18.191.22.143)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):龔芠瑤
作者(外文):Kung, Wen Yao
論文名稱(中文):根據語意式切割與物體危險等級預測的道路場景理解
論文名稱(外文):Road Scene Understanding with Semantic Segmentation and Object Hazard Level Prediction
指導教授(中文):陳煥宗
指導教授(外文):Chen, Hwann Tzong
口試委員(中文):劉庭祿
賴尚宏
口試委員(外文):Liu, Tyng Luh
Lai, Shang Hong
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:103062559
出版年(民國):105
畢業學年度:104
語文別:英文中文
論文頁數:35
中文關鍵詞:卷積類神經網路場景理解語意式切割危險程度預測
外文關鍵詞:Convolutional Neural NetworkScene UnderstandingSemantic SegmentationHazard Level Prediction
相關次數:
  • 推薦推薦:0
  • 點閱點閱:284
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
我們提出了一個利用全卷積網路的方法來實現道路場景理解並且將物體的危險程度分成三個等級來預測。在我們的方法中,藉由一張圖片的輸入,這個多任務的模型可以提供一個細緻的切割結果及一個用熱圖表示的危險程度的預測結果。這個模型可以分成三個部分:共享網路、切割網路及危險程度網路。共享網路和切割網路主要是採用Badrinarayanan 等人所提出的用於影像切割的加密-解密網路。危險程度網路是一個全卷積網路架構,它將圖中每個位置的危險程度以一個粗略的切割結果來表示。
為了訓練及測試我們提出的深度神經網路,我們建立了一個同時有物體切割標記和危險程度標記的資料庫。為了證明我們的網路可以學習到更抽象的物體特性,我們使用兩種評估方式並和視覺顯著性的方法做比較,最後得出了物體危險程度有別於視覺顯著性問題的結論。
We introduce a method for understanding road scenes and simultaneously predicting the hazard levels of three categories of objects in road scene images by using a fully convolutional network (FCN) architecture. In our approach, with a single input image, the multi-task model produces a _ne segmentation result and a prediction of hazard levels in a form of heatmap. The model can be divided into three parts: shared net, segmentation net, and hazard level net. The shared net and segmentation net use the encoder-decoder architecture provided by Badrinarayanan et al . [2]. The hazard level net is a fully convolution network estimating hazard level of a segment with a coarse segmentation result.
We also provide a dataset with the object segmentation ground truth and the hazard levels for training and evaluating the proposed deep networks. To prove that our network can learn highly semantic attributes of objects, we use two measurements to evaluate the performance of our method, and compare our method with a saliency-based method to show the difference between predicting hazard levels and estimating human eyes fixations.
1 Introduction 8
2 Related Work 11
2.1 Road Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Deep Learning Based Semantic Segmentation . . . . . . . . . . . . . 11
2.3 Human Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Approach 13
3.1 Data Collection and Annotation . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 CamVid Road Scene Database . . . . . . . . . . . . . . . . . . 13
3.1.2 Hazard Level Annotation . . . . . . . . . . . . . . . . . . . . . 14
3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Experiments 23
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Conclusion and Future Work 32
[1] J. M. Alvarez, F. Lumbreras, A. M. Lopez, and T. Gevers. Understanding road
scenes using visual cues and GPS information. In ECCV Workshops (3), volume
7585 of Lecture Notes in Computer Science, pages 635-638. Springer, 2012.
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. CoRR, 2015.
[3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A
high-definition ground truth database. Pattern Recognition Letters, (2):88-97,
2009.
[4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic
image segmentation with deep convolutional nets and fully connected crfs. CoRR,
abs/1412.7062, 2014.
[5] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels
with a common multi-scale convolutional architecture. In ICCV, pages 2650-
2658, 2015.
[6] A. Ess, T. Mueller, H. Grabner, and L. J. V. Gool. Segmentation-based urban
traffic scene understanding. In BMVC, pages 1-11. British Machine Vision
Association, 2009.
[7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In ICCV, pages 1026-1034,
2015.
[8] W. Huang and X. Gong. Fusion based holistic road scene understanding. CoRR,
abs/1406.7525, 2014. 33
[9] X. Huang, C. Shen, X. Boix, and Q. Zhao. SALICON: reducing the semantic
gap in saliency prediction by adapting deep neural networks. In ICCV, pages
262-270, 2015.
[10] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, JMLR Proceedings, pages 448-
456, 2015.
[11] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multistage architecture for object recognition? In ICCV, pages 2146-2153, 2009.
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[13] S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepfix: A fully convolutional
neural network for predicting human eye fixations. CoRR, abs/1510.02927, 2015.
[14] M. Kummerer, L. Theis, and M. Bethge. Deep gaze I: boosting saliency prediction with feature maps trained on imagenet. CoRR, abs/1411.1045, 2014.
[15] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu. Predicting eye fixations using convolutional neural networks. In CVPR, pages 362-370. IEEE Computer Society,
2015.
[16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, pages 3431-3440, 2015.
[17] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic
segmentation. In ICCV, pages 1520-1528, 2015.
[18] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
[19] P. Sturgess, K. Alahari, L. Ladicky, and P. H. S. Torr. Combining appearance
and structure from motion features for road scene understanding. In BMVC,
pages 1-11. British Machine Vision Association, 2009. 34
[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR,
pages 1-9, 2015.
[21] C. L. Thomas. Opensalicon: An open source implementation of the salicon
saliency model. Technical Report TR-2016-02, University of Pittsburgh, 2016.
[22] H. Yang, B. Lin, K. Chang, and C. Chen. Automatic age estimation from face
images via deep ranking. In BMVC, pages 55.1-55.11, 2015.
[23] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional random fields as recurrent neural networks. In ICCV, pages 1529-1537. IEEE Computer Society, 2015.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *