作者(外文):Sun, Shih-Yi
論文名稱(外文):Memory-Efficient Dilated Convolution Engine with Group Ordering for Fast Image Processing
指導教授(外文):Huang, Chao-Tsung
口試委員(外文):Lai, Yeong-Kang
Lin, Chia-Wen
使用膨脹卷積的模型,相較於單純使用一般卷積的模型,在擁有同樣參數 量、運算量及模型深度的條件下,能大幅增加模型的接受域,且能避免空間訊 息上的損失,在多種圖像處理應用的學習上,有著較佳的表現。然而,在硬體 上支援膨脹卷積會遇到兩個問題: (1) 難以支援過大的膨脹率。(2) 膨脹卷積所需 要使用的數據,在空間上並不相鄰。為了解決前者,我們使用膨脹率不大於 16 的卷積層進行堆疊,當模型接受域增加到一定程度,可與使用過大膨脹率建立 的模型達到相似的表現。而對於後者,有研究提出延遲區塊的作法,但這種作 法必須重複讀取輸入數據以得到最終運算結果,且需要將所有輸入及輸出的數 據暫存於硬體內部記憶體上,在此基準架構下,硬體內記憶體可能佔據 95% 總 面積以及 95% 總功耗,且系統吞吐量大幅降低,造成整個系統非常大的負擔。
我們實作兩套支援膨脹率不大於 16 的硬體,去評估這些作法在硬體上帶來 的好處,一套支援所有膨脹率,另一套支援的膨脹率為 2 的冪次方。我們在台 積電 40 奈米製程下進行合成,系統總共使用 0.5KB/0.5KB 的硬體內部記憶體, 且合成總面積為 2.036mm2/1.966mm2。相較於同樣平行度的基準架構,硬體內 部記憶體的使用量降低 69.1%/69.1%,總功耗相比之下減少 52.43%/55.24%。當 運行於 200MHz 時,此系統能每秒處理 12200 個邊長為 256 的區塊,每秒能處 理的速度提升至 2.997 倍。
Dilated convolutional neural network models, compared to plain ones, can increase receptive filed without losing information in the spatial domain under the condition of the same model depth, computation amount, and parameter size. Besides, dilated CNN models perform better than plain ones in several image processing tasks. However, there are two problems for supporting dilated convolution on hardware design. Firstly, excessive dilation rate is hard to be supported, and the second one is non-adjacent input data for dilated convolution in the spatial domain. To solve the former problem, we stack the convolution layer with dilation rates no greater than 16 and reach similar performance to models that have layers with excessive dilation rates in several image processing applications. For the latter, a delay cell method is proposed to solve it. However, when using the delay cell method, it is required to access input data several times to get the final computation result, and the entire input and output data need to be temporally stored in SRAMs. SRAMs might occupy 95% of the total area and consume 95% of the total power, and throughput of the system substantially decreases in the baseline structure, which is a huge overhead of the entire system.
In this thesis, we propose an engine that calculates dilated convolution in group order. First of all, we change the order of input and output data from spatial order to group order, in order to solve the problem caused by sparse data and save a large amount of SRAM usage. Then, we look ahead to the dilation rate of the next layer and do group-shuffling before delivering output data. This mechanism enables output data to be directly used by the following layer. Furthermore, we suggest the number of SRAM banks and design a reasonable memory-mapping mechanism for preventing the system efficiency from being restricted by the speed of SRAM.
Finally, We implement two hardware designs that support dilation rates no greater than 16 to evaluate the design, one supports all dilation rates and the other supports dilation rates with a power of 2. We implement the proposed engine and synthesis in TSMC 40nm process. This engine uses 0.5KB/0.5KB onchip memory, and the total synthesis area of this engine is 2.036mm2/1.966mm2. Compare to baseline architecture, this engine reduces 69.1%/69.1% SRAM usage and 52.43%/55.24% total power. This engine can process 12200 blocks with the side length of 256 per second, which is 2.997 times faster
摘要 i
Abstract iii
誌謝 v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Dilated Convolution . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Image processing applications . . . . . . . . . . . . . . . . . 5
1.2.3 Matrix splitting and matrix merging . . . . . . . . . . . . . 7
1.3 DT-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Software model evaluation 11
2.1 Overview to Fast Image Processing models . . . . . . . . . . . . . . 11
2.1.1 Multi-Scale Context Aggregation Network . . . . . . . . . . 11
2.1.2 Adaptive Normalization . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Baseline model of Fast Image Processing applications . . . . 14
2.2 Quality Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 System Architecture of Group Ordering Dilated Convolution Engine 21
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Group Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Memory Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Layer-lookahead and Group Shuffling . . . . . . . . . . . . . . . . . 31
4 Implementation of Group Ordering Dilated Convolution Engine 35
5 Conclusion and Future Work 39
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
