帳號:guest(18.191.42.234)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):許毓軒
作者(外文):Xu, Yu-Syuan
論文名稱(中文):可動態調整影片語義分割網路
論文名稱(外文):Dynamic Vedio Segmentation Network
指導教授(中文):李濬屹
指導教授(外文):Lee, Chun-Yi
口試委員(中文):陳煥宗
黃稚存
口試委員(外文):Chen, Hwann-Tzong
Huang, Chih-Tsun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:106062504
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:37
中文關鍵詞:電腦視覺深度學習語意分割機器學習
外文關鍵詞:Computer VisionDeep LearningMachine LearningSemantic Segmentation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:384
  • 評分評分:*****
  • 下載下載:23
  • 收藏收藏:0
近年來,語義圖像分割(Semantic Segmentation)通過使用深度卷積神經網路(Deep Convolutional Neural Networks)在各種數據集上實現了前所未有的高準確性,準確的語義分割可以應用在非常多領域例如自動車、監視攝像機、無人機等,但是這些應用通常需要real-time的反應故高幀速是必須的,然而深度卷積神經網路的執行時間是非常長的無法符合real-time的需求。
可動態調整影片語義分割網路(DVSNet)被用來實現快速且正確的影片語義分割。可動態調整影片語義分割網路由兩個卷積神經網路組成:語義分割網路(Segmentation Network)和光流網路(Flow Network)。前者產生準確的語義分割結果,但網路層數較多且耗時;後者比前者快得多,但須經過額外處理來獲得語義分割結果且較不准確,可動態調整影片語義分割網路使用決策網路(Decision Network)根據預期相似度(Expected Confidence Score)動態的將不同的幀區域分配給不同的網路,具有較高預期相似度的幀區域使用光流網路;具有較低預期相似度的幀區域則必須通過語義分割網路。實驗結果證明可動態調整影片語義分割網路能夠在Cityscapes資料集上達到19.8 fps並有70.4% mIoU的正確性,可動態調整影片語義分割網路的高速版本能夠在相同的資料集上提供30.4 fps和63.2 mIoU,另外可動態調整影片語義分割網路至多可以減少高達95%的計算時間。
In this paper, we present a detailed design of dynamic video segmentation network (DVSNet) for fast and efficient video semantic segmentation. DVSNet consists of two convolutional neural networks: a segmentation network and a flow network. The former generates highly accurate semantic segmentations, but is deeper and slower. The latter is much faster than the former, but its output requires further processing to generate less accurate semantic segmentations. We explore the use of a decision network to adaptively assign different frame regions to different networks based on a metric called expected confidence score. Frame regions with a higher expected confidence score traverse the flow network. Frame regions with a lower expected confidence score have to pass through the segmentation network. We have extensively performed experiments on various configurations of DVSNet, and investigated
a number of variants for the proposed decision network. The experimental results show that our DVSNet is able to achieve up to 70.4% mIoU at 19.8 fps on the Cityscapes dataset. A high speed version of DVSNet is able to deliver an fps of 30.4 with 63.2% mIoU on the same dataset. DVSNet is also able to reduce up to 95% of the computational workloads.
Chinese Abstract i
Abstract ii
Acknowledgements iii
Contents iv
List of Figures vi
List of Tables vii
List of Algorithms viii
1 Introduction 1
2 Background 7
2.1 Image Semantic Segmentation 7
2.2 Optical Flow 8
2.3 Video Semantic Segmentation 8
3 DVSNet 9
3.1 Dynamic Video Segmentation Network 9
3.2 Adaptive Key Frame Scheduling 11
3.3 Frame Region Based Execution 12
3.4 DVSNet Inference Algorithm 13
3.5 DN and Its Training Methodology 14
4 Experiments 16
4.1 Experimental Setup 16
4.2 Validation of DVSNet 18
4.3 Validation of DVSNet’s Adaptive Key Frame Scheduling Policy 19
4.4 Computation Time Analysis 21
4.5 Comparison of DN Configurations 21
4.6 Impact of Frame Division Schemes 22
4.7 Impact of Overlapped Regions on Accuracy 23
4.8 Results 23
5 Conclusion and Future Work 25
5.1 Conclusion 25
5.2 Future Work 25
[1] M. Everingham et al., “The PASCAL visual object classes challenge: A retrospective,”
Int. J. Computer Vision, vol. 111, no. 1, pp. 98-136, Jan. 2015.
[2] M. Cordts et al., “The Cityscapes dataset for semantic urban scene understanding,”
in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
pp. 3213-3223, Jun. 2016.
[3] B. Zhou et al., “Scene parsing through ADE20K dataset,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition (CVPR), pp. 5122-5130, Jul. 2017.
[4] L. Tsung-Yi and other, “Microsoft COCO: Common objects in context,” in
Proc. European Conf. Computer Vision (ECCV), pp. 740-755, Sep. 2014.
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic
image segmentation with deep convolutional nets and fully connected
CRFs,” in Proc. Int. Conf. Learning Representations (ICLR), May 2015.
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets, atrous
convolution, and fully connected CRFs,” IEEE Trans. Pattern Analysis and
Machine Intelligence (TPAMI), Apr. 2017.
[7] L.-C. Chen, G. Papandreou, S. F, and H. Adam, “Rethinking atrous convolution
for semantic image segmentation,” arXiv:1706.0558, Aug. 2017.
[8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,”
in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
pp. 6230-6239, Jul. 2017.
[9] Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the
resnet model for visual recognition,” arXiv:11611.10080., Nov. 2016.
[10] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks
for high-resolution semantic segmentation,” in Proc. IEEE Conf. Computer
Vision and Pattern Recognition (CVPR), pp. 5168-5177, Jul. 2017.
[11] P. Wang et al., “Understanding convolution for semantic segmentation,”
1702.08502, Feb. 2017.
[12] G. GhiasiEmail and C. C. Fowlkes, “Laplacian pyramid reconstruction and refinement
for semantic segmentation,” in Proc. European Conf. Computer Vision
(ECCV), pp. 519-534, Oct. 2016.
[13] D. Alexey, R. German, C. Felipe, L. Antonio, and K. Vladlen, “CARLA:
An open urban driving simulator,” in Proc. Conf.on Robot Learning (CoRL),
pp. 445-461, Nov. 2017.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with
deep convolutional neural networks,” in Proc. Neural Information Processing
Systems (NIPS), pp. 1097-1105, Dec. 2012.
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale
image recognition,” in Proc. Int. Conf. Learning Representations (ICLR),
May 2015.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
pp. 770-778, Jun. 2016.
[17] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition (CVPR), pp. 1-9, Jun. 2015.
[18] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inceptionresnet
and the impact of residual connections on learning,” in Proc. Association
for the Advancement of Artificial Intelligence (AAAI), pp. 4278-4284, Feb. 2017.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic
segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), pp. 3431-3440, Jun. 2015.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional
networks for visual recognition,” IEEE Trans. Pattern Analysis and
Machine Intelligence (TPAMI), vol. 37, no. 9, pp. 1904-1916, Sep. 2015.
[21] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,”
in Proc. Int. Conf. Learning Representations (ICLR), May 2016.
[22] J. Long, E. Shelhamer, and T. Darrell, “Efficient piecewise training of deep
structured models for semantic segmentation,” in Proc. IEEE Conf. Computer
Vision and Pattern Recognition (CVPR), pp. 3194-3203, Jun. 2016.
[23] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time semantic
segmentation on high-resolution images,” arXiv:1704.08545., Apr. 2017.
[24] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale:
Scale-aware semantic image segmentation,” in Proc. IEEE Conf. Computer Vision
and Pattern Recognition (CVPR), pp. 3640-3649, Jun. 2016.
[25] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual
networks for semantic segmentation in street scenes,” in Proc. IEEE Conf.
Computer Vision and Pattern Recognition (CVPR), pp. 3309-3318, Jul. 2017.
[26] S. Zagoruyko et al., “A multipath network for object detection,”
arXiv:1604.02135, Aug. 2016.
[27] S. Zheng et al., “Conditional random fields as recurrent neural networks,” in
Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 1529-1537, Dec. 2015.
[28] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fully connected crfs with
gaussian edge potentials,” in Proc. Neural Information Processing Systems
(NIPS), pp. 109-117, Dec. 2011.
[29] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell, “Clockwork convnets
for video semantic segmentation,” in Proc. European Conf. Computer Vision
(ECCV) Wksp, pp. 852-868, Oct. 2016.
[30] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for video
recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), pp. 4141-4150, Jul. 2017.
[31] L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised learning
of invariances,” Neural computation, vol. 14, no. 4, pp. 715-770, Apr. 2002.
[32] M. D. Zeiler and R. Fergus, “Slow and steady feature analysis: Higher order
temporal coherence in video,” in Proc. IEEE Conf. Computer Vision and Pattern
Recognition (CVPR), pp. 3852-3861, Jun. 2016.
[33] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” J. Artificial
intelligence, vol. 17, no. 1-3, pp. 185-203, Aug. 1981.
[34] A. Dosovitskiy et al., “FlowNet: Learning optical flow with convolutional networks,”
in Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 2758-2766,
Dec. 2015.
[35] E. Ilg et al., “FlowNet 2.0: Evolution of optical flow estimation with deep
networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), pp. 1647-1655, Jul. 2017.
[36] S. Ioffe and C. Szegedy, “Batch Normalization: accelerating deep network training
by reducing internal covariate shift,” in Proc. Machine Learning Research
(PMLR), vol. 37, pp. 448-456, Jul. 2015.
[37] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional
networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), pp. 2528-2535, Jun. 2010.
[38] J. Weickert, A. Bruhn, T. Brox, and N. Papenberg, “A survey on variational
optic flow methods for small displacements,” Mathematical Models for Registration
and Applications to Medical Imaging, pp. 103-136, Oct. 2006.
[39] T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching
in variational motion estimation,” IEEE Trans. Pattern Analysis and Machine
Intelligence (TPAMI), vol. 33, no. 3, pp. 500-513, May 2011.
[40] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid., “DeepFlow: Large
displacement optical flow with deep matching,” in Proc. IEEE Conf. Computer
Vision and Pattern Recognition (CVPR), pp. 1385-1392, Dec. 2013.
[41] C. Bailer, B. Taetz, and D. Stricker, “Flow fields: Dense correspondence fields
for highly accurate large displacement optical flow estimation,” in Proc. IEEE
Int. Conf. Computer Vision (ICCV), pp. 4015-4023, Dec. 2015.
[42] J. Wulff and M. J. Black, “Efficient sparse-to-dense optical flow estimation
using a learned basis and layers,” in Proc. IEEE Conf. Computer Vision and
Pattern Recognition (CVPR), pp. 120-130, Jun. 2015.
[43] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid
network,” arXiv:1611.00850., Nov. 2016.
[44] E. L. Denton, S. Chintala, R. Fergus, et al., “Deep generative image models
using a laplacian pyramid of adversarial networks,” in Proc. Neural Information
Processing Systems (NIPS), pp. 1486-1494, Dec. 2015.
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
Proc. Int. Conf. Learning Representations (ICLR), May 2015.
[46] D.Jayaraman and K. Grauman, “Visualizing and understanding convolutional
networks,” in Proc. European Conf. Computer Vision (ECCV), pp. 818-833,
Sep. 2014.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *