帳號:guest(18.222.21.175)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):孫士益
作者(外文):Sun, Shih-Yi
論文名稱(中文):適用於快速圖像處理之具記憶體效率分組排序膨脹卷積引擎
論文名稱(外文):Memory-Efficient Dilated Convolution Engine with Group Ordering for Fast Image Processing
指導教授(中文):黃朝宗
指導教授(外文):Huang, Chao-Tsung
口試委員(中文):賴永康
林嘉文
口試委員(外文):Lai, Yeong-Kang
Lin, Chia-Wen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061624
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:43
中文關鍵詞:膨脹卷積硬體群組順序
外文關鍵詞:DilatedConvolutionHardwareGroupOrder
相關次數:
  • 推薦推薦:0
  • 點閱點閱:76
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
使用膨脹卷積的模型,相較於單純使用一般卷積的模型,在擁有同樣參數 量、運算量及模型深度的條件下,能大幅增加模型的接受域,且能避免空間訊 息上的損失,在多種圖像處理應用的學習上,有著較佳的表現。然而,在硬體 上支援膨脹卷積會遇到兩個問題: (1) 難以支援過大的膨脹率。(2) 膨脹卷積所需 要使用的數據,在空間上並不相鄰。為了解決前者,我們使用膨脹率不大於 16 的卷積層進行堆疊,當模型接受域增加到一定程度,可與使用過大膨脹率建立 的模型達到相似的表現。而對於後者,有研究提出延遲區塊的作法,但這種作 法必須重複讀取輸入數據以得到最終運算結果,且需要將所有輸入及輸出的數 據暫存於硬體內部記憶體上,在此基準架構下,硬體內記憶體可能佔據 95% 總 面積以及 95% 總功耗,且系統吞吐量大幅降低,造成整個系統非常大的負擔。
在本論文中,我們提出採用組別順序方式進行膨脹卷積運算的引擎。首先,藉由改變數據輸入及輸出的順序,由空間上的順序改為根據膨脹率進行分組的組別順序,解決數據在空間上分散的問題,進而避免大量硬體內部記憶體的使用;接著,為了讓輸出能直接被下一層卷積所使用,我們提前查看下一層的膨脹率,對輸出數據重新分組並輸出結果;最後,我們設計合理的內部記憶體數量及映射機制,使系統效率不被硬體內部記憶體速度所限制。
我們實作兩套支援膨脹率不大於 16 的硬體,去評估這些作法在硬體上帶來 的好處,一套支援所有膨脹率,另一套支援的膨脹率為 2 的冪次方。我們在台 積電 40 奈米製程下進行合成,系統總共使用 0.5KB/0.5KB 的硬體內部記憶體, 且合成總面積為 2.036mm2/1.966mm2。相較於同樣平行度的基準架構,硬體內 部記憶體的使用量降低 69.1%/69.1%,總功耗相比之下減少 52.43%/55.24%。當 運行於 200MHz 時,此系統能每秒處理 12200 個邊長為 256 的區塊,每秒能處 理的速度提升至 2.997 倍。
Dilated convolutional neural network models, compared to plain ones, can increase receptive filed without losing information in the spatial domain under the condition of the same model depth, computation amount, and parameter size. Besides, dilated CNN models perform better than plain ones in several image processing tasks. However, there are two problems for supporting dilated convolution on hardware design. Firstly, excessive dilation rate is hard to be supported, and the second one is non-adjacent input data for dilated convolution in the spatial domain. To solve the former problem, we stack the convolution layer with dilation rates no greater than 16 and reach similar performance to models that have layers with excessive dilation rates in several image processing applications. For the latter, a delay cell method is proposed to solve it. However, when using the delay cell method, it is required to access input data several times to get the final computation result, and the entire input and output data need to be temporally stored in SRAMs. SRAMs might occupy 95% of the total area and consume 95% of the total power, and throughput of the system substantially decreases in the baseline structure, which is a huge overhead of the entire system.
In this thesis, we propose an engine that calculates dilated convolution in group order. First of all, we change the order of input and output data from spatial order to group order, in order to solve the problem caused by sparse data and save a large amount of SRAM usage. Then, we look ahead to the dilation rate of the next layer and do group-shuffling before delivering output data. This mechanism enables output data to be directly used by the following layer. Furthermore, we suggest the number of SRAM banks and design a reasonable memory-mapping mechanism for preventing the system efficiency from being restricted by the speed of SRAM.
In this thesis, we propose an engine that calculates dilated convolution in group order. First of all, we change the order of input and output data from spatial order to group order, in order to solve the problem caused by sparse data and save a large amount of SRAM usage. Then, we look ahead to the dilation rate of the next layer and do group-shuffling before delivering output data. This mechanism enables output data to be directly used by the following layer. Furthermore, we suggest the number of SRAM banks and design a reasonable memory-mapping mechanism for preventing the system efficiency from being restricted by the speed of SRAM.
Finally, We implement two hardware designs that support dilation rates no greater than 16 to evaluate the design, one supports all dilation rates and the other supports dilation rates with a power of 2. We implement the proposed engine and synthesis in TSMC 40nm process. This engine uses 0.5KB/0.5KB onchip memory, and the total synthesis area of this engine is 2.036mm2/1.966mm2. Compare to baseline architecture, this engine reduces 69.1%/69.1% SRAM usage and 52.43%/55.24% total power. This engine can process 12200 blocks with the side length of 256 per second, which is 2.997 times faster
摘要 i
Abstract iii
誌謝 v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Dilated Convolution . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Image processing applications . . . . . . . . . . . . . . . . . 5
1.2.3 Matrix splitting and matrix merging . . . . . . . . . . . . . 7
1.3 DT-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Software model evaluation 11
2.1 Overview to Fast Image Processing models . . . . . . . . . . . . . . 11
2.1.1 Multi-Scale Context Aggregation Network . . . . . . . . . . 11
2.1.2 Adaptive Normalization . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Baseline model of Fast Image Processing applications . . . . 14
2.2 Quality Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 System Architecture of Group Ordering Dilated Convolution Engine 21
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Group Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Memory Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Layer-lookahead and Group Shuffling . . . . . . . . . . . . . . . . . 31
4 Implementation of Group Ordering Dilated Convolution Engine 35
5 Conclusion and Future Work 39
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
[1] Qifeng Chen, Jia Xu, and Vladlen Koltun, “Fast image processing with fully-convolutional networks,” in Proceedings of the IEEE International Conferenceon Computer Vision, 2017, pp. 2497–2506.
[2] Rajendra KC Khatri, Brendan J Caseria, Yifei Lou, Guanghua Xiao, and Yan Cao, “Automatic extraction of cell nuclei using dilated convolutional network,” Inverse Problems & Imaging, vol. 15, no. 1, pp. 27, 2021.
[3] Dongseok Im, Donghyeon Han, Sungpill Choi, Sanghoon Kang, and Hoi-Jun Yoo, “Dt-cnn: Dilated and transposed convolution neural network accelerator for real-time image segmentation on mobile devices,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
[4] Dongseok Im, Donghyeon Han, Sungpill Choi, Sanghoon Kang, and Hoi-Jun Yoo, “Dt-cnn: An energy-efficient dilated and transposed convolutional neural network processor for region of interest based image segmentation,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 10, pp. 3471–3483, 2020.
[5] Fisher Yu and Vladlen Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[7] Zhiqiang Gong, Ping Zhong, Yang Yu, Weidong Hu, and Shutao Li, “A cnn with multiscale convolution and diversified metric for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 6, pp. 3599–3618, 2019.
[8] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[9] Hossein Gholamalinezhad and Hossein Khosravi, “Pooling methods in deep neural networks, a review,” arXiv preprint arXiv:2009.07485, 2020.
[10] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[11] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, 2018, pp. 1451–1460.
[12] Lichen Zhou, Chuang Zhang, and Ming Wu, “D-linknet: Linknet with pre-trained encoder and dilated convolution for high resolution satellite imagery road extraction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 182–186.
[13] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, and Shuicheng Yan, “Deep joint rain detection and removal from a single image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1357–1366.
[14] Youzhao Yang and Hong Lu, “Single image deraining using a recurrent multi-scale aggregation and enhancement network,” in 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019, pp. 1378–1383.
[15] Wenhan Yang, Robby T Tan, Jiashi Feng, Zongming Guo, Shuicheng Yan, and Jiaying Liu, “Joint rain detection and removal from a single image with contextualized deep networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 6, pp. 1377–1393, 2019.
[16] Youzhao Yang and Hong Lu, “A fast and efficient network for single image de-raining,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 2030–2034.
[17] Tianyang Wang, Mingxuan Sun, and Kaoning Hu, “Dilated deep residual network for image denoising,” in 2017 IEEE 29th international conference on tools with artificial intelligence (ICTAI). IEEE, 2017, pp. 1272–1279.
[18] Shehzeen Hussain, Mojan Javaheripi, Paarth Neekhara, Ryan Kastner, and Farinaz Koushanfar, “Fastwave: Accelerating autoregressive convolutional neural networks on fpga,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2019, pp. 1–8.
[19] Yecheng Lyu, Lin Bai, and Xinming Huang, “Chipnet: Real-time lidar processing for drivable region segmentation on an fpga,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 5, pp. 1769–1779, 2018.
[20] Shouyi Yin, Peng Ouyang, Shibin Tang, Fengbin Tu, Xiudong Li, Shixuan Zheng, Tianyi Lu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei, “A high energy efficient reconfigurable hybrid neural network processor for deep learning applications,” IEEE Journal of Solid-State Circuits, vol. 53, no. 4, pp. 968–982, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *