帳號:guest(3.139.103.163)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):楊詠旭
作者(外文):Yang, Yung-Hsu
論文名稱(中文):使用基於注意力模型的特徵聚合用於每個像素的密集預測
論文名稱(外文):Dense Prediction with Attentive Feature Aggregation
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):徐宏民
王傑智
口試委員(外文):Hsu, Winston
Wang, Chieh-Chih
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:109061519
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:45
中文關鍵詞:密集預測特徵聚合深度學習電腦視覺
外文關鍵詞:Dense PredictionFeature AggregationDeep LearningComputer Vision
相關次數:
  • 推薦推薦:0
  • 點閱點閱:405
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
聚合來自不同層的特徵資訊對於密集預測模型至關重要。儘管表現力有限, 但簡單的特徵連接在聚合操作的選擇中仍占據主導地位。在本文中,我們引入了基於注意力模型的特徵聚合(AFA)進行更具表現力的非線性操作來融合不同層的特徵。AFA 利用空間(Spatial)和通道(Channel)的注意 力模型來計算每層激活函數值的加權平均。受到神經網路立體渲染(Neural Volume Rendering)的啟發,我們進一步擴展 AFA 成尺度空間渲染 (Scale Space Rendering) 以執行多尺度預測的融合。AFA 適用於廣泛的現有網路設計。我們的實驗表明,在具有挑戰性的語義分割(Semantic Segementation)基準上,包括 Cityscapes 和 BDD100K,在幾乎不增加計算量和參數量的情況下取得了一致且顯著的改進。特別是,AFA 在 Cityscapes 上將 DLA 模型的性能提高了近 6% mIoU。我們的實驗分析表明,AFA 學習逐步細化分割 圖並改善邊界細節,從而在 NYUDv2 和 BSDS500 上的邊界偵測(Boundary Detection)的基準上取得新的最先進結果。
Aggregating information from features across different layers is essential for dense prediction models. Despite its limited expressiveness, vanilla feature concatenation dominates the choice of aggregation operations. In this paper, we introduce Attentive Feature Aggregation (AFA) to fuse different network layers with more expressive non-linear operations. AFA exploits both spatial and channel attention to compute weighted averages of the layer activations. Inspired by neural volume rendering, we further extend AFA with Scale-Space Rendering (SSR) to perform a late fusion of multi-scale predictions. AFA is applicable to a wide range of existing network designs. Our experiments show consistent and significant improvements on challenging semantic segmentation benchmarks, including Cityscapes and BDD100K at negligible computational and parameter overhead. In particular, AFA improves the performance of the Deep Layer Aggregation (DLA) model by nearly 6% mIoU on Cityscapes. Our experimental analyses show that AFA learns to progressively refine segmentation maps and improve boundary details, leading to new state-of-the-art results on boundary detection benchmarks on NYUDv2 and BSDS500.
Declaration Acknowledgements

摘要 i

Abstract ii

1 Introduction 1

2 Related Work 5

2.1 Multi-Scale Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Feature Aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Attention Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Multi-Scale Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Method 9

3.1 Attentive Feature Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Binary Fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.2 Multiple Feature Fusion. . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Scale-Space Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.2 Choice of φ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Training Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Experiments 17

4.1 Results on Cityscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Results on BDD100K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Boundary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.1 Results on NYUDv2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3.2 Results on BSDS500. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4.1 Binary Fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4.2 Attention Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.3 Auxiliary Segmentation Head. . . . . . . . . . . . . . . . . . . . . . . 23

4.4.4 Multiple Feature Fusion. . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.5 Scale-Space Rendering. . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Attention Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6 Segmentation Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii 5 Conclusion 27

A 29

A.1 Ablation Study on Post-processing of Semantic Segmentation . . . . . . . . . 29

A.2 Training Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A.2.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.2.2 Boundary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

A.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.3.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.3.2 Boundary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.4 Visualization of Attention Maps . . . . . . . . . . . . . . . . . . . . . . . . . 33

A.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

References
[1] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.

[2] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in Proceedings of the IEEE international conference on computer vision, pp. 4990–4999, 2017.

[3] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26362645, 2020.

[4] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European conference on computer vision, pp. 746–760, Springer, 2012.

[5] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.

[6] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.

[7] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), pp. 801–818, 2018.

[8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890, 2017.

[9] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480, 2017.

[10] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen, “Axial-deeplab: Standalone axial-attention for panoptic segmentation,” in European Conference on Computer Vision, pp. 108–126, Springer, 2020.

[11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.

[12] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2403–2412, 2018.

[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Springer, 2015.

[14] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.

[15] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in Proceedings of the European conference on computer vision (ECCV), pp. 405–420, 2018.

[16] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.

[17] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3640–3649, 2016.

[18] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020.

[19] R. A. Drebin, L. Carpenter, and P. Hanrahan, “Volume rendering,” ACM Siggraph Computer Graphics, vol. 22, no. 4, pp. 65–74, 1988.

[20] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision, pp. 405–421, Springer, 2020.

[21] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3684–3692, 2018.

[22] D. Lin, D. Shen, S. Shen, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Zigzagnet: Fusing top-down and bottom-up context for object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7490–7499, 2019.

[23] J. Fu, J. Liu, Y. Wang, Y. Li, Y. Bao, J. Tang, and H. Lu, “Adaptive context network for scene parsing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6757, 2019.

[24] X. Li, H. Zhao, L. Han, Y. Tong, S. Tan, and K. Yang, “Gated fully fusion for semantic segmentation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34 (07), pp. 11418–11425, 2020.

[25] S. Huang, Z. Lu, R. Cheng, and C. He, “Fapn: Feature-aligned pyramid network for dense image prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 864–873, 2021.

[26] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018.

[27] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning,” in CVPR, 2017.

[28] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154, 2019.

[29] X. Li, X. Li, L. Zhang, G. Cheng, J. Shi, Z. Lin, S. Tan, and Y. Tong, “Improving semantic segmentation via decoupled body and edge supervision,” in European Conference on Computer Vision, pp. 435–452, Springer, 2020.

[30] S. Yang and G. Peng, “Attention to refine through multi scales for semantic segmentation,” in Pacific Rim Conference on Multimedia, pp. 232–241, Springer, 2018.

[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

[32] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” in European conference on computer vision, pp. 173–190, Springer, 2020.

[33] S. Zhao, Y. Wang, Z. Yang, and D. Cai, “Region mutual information loss for semantic segmentation,” Advances in Neural Information Processing Systems, vol. 32, 2019.

[34] X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, S. Tan, and Y. Tong, “Semantic flow for fast and accurate scene parsing,” in European Conference on Computer Vision, pp. 775–793, Springer, 2020.

[35] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature pyramid transformer,” in European Conference on Computer Vision, pp. 323–339, Springer, 2020.

[36] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-scnn: Gated shape cnns for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 5229–5238, 2019.

[37] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272, 2021.

[38] Z. Huang, Y. Wei, X. Wang, W. Liu, T. S. Huang, and H. Shi, “Alignseg: Featurealigned segmentation networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 550–557, 2021.

[39] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” arXiv preprint arXiv:2112.01527, 2021.

[40] Y. Yuan, J. Xie, X. Chen, and J. Wang, “Segfix: Model-agnostic boundary refinement for segmentation,” in European Conference on Computer Vision, pp. 489–506, Springer, 2020.

[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.

[42] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612, 2019.

[43] M. Yin, Z. Yao, Y. Cao, X. Li, Z. Zhang, S. Lin, and H. Hu, “Disentangled non-local neural networks,” in European Conference on Computer Vision, pp. 191–207, Springer, 2020.

[44] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE international conference on computer vision, pp. 1395–1403, 2015.

[45] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai, “Richer convolutional features for edge detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3000–3009, 2017.

[46] J. He, S. Zhang, M. Yang, Y. Shan, and T. Huang, “Bi-directional cascade network for perceptual edge detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3828–3837, 2019.

[47] Z. Su, W. Liu, Z. Yu, D. Hu, Q. Liao, Q. Tian, M. Pietikäinen, and L. Liu, “Pixel difference networks for efficient edge detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5117–5127, 2021.

[48] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object detection and segmentation,” in European conference on computer vision, pp. 345–360, Springer, 2014.

[49] R. Deng, C. Shen, S. Liu, H. Wang, and X. Liu, “Learning to predict crisp boundaries,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 562–578, 2018.

[50] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 891898, 2014.

[51] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.

[52] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.

[53] Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro, “Improving semantic segmentation via video propagation and label relaxation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8856–8865, 2019.

[54] S. R. Bulo, L. Porzi, and P. Kontschieder, “In-place activated batchnorm for memoryoptimized training of dnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647, 2018.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *