卷積神經網路,是用於圖像辨別的一種深度學習演算法,由於可以達到高精確的結果,因此被廣泛用於人工智慧和電腦視覺等應用,但其高維度的網路結構及大量資料的計算負荷,在吞吐量表現及隨之產生的大量資料存取會是一個問題,也因此成為研究領域的熱門主題。即使我們能達到高度的平行化運算,改善吞吐量;我們需同時考慮資料移動的編排,以降低頻寬需求。因此,我們提出一種資料流並配合相應的硬體架構(從MIT的Eyeriss, an energy-efficient reconfigurable CNN accelerator延伸而來),在不犧牲效能的情況下,減少大量不必要的資料存取。


利用卷積層可高度平行化計算的特性,我們使用大量運算單元同時計算,以達到高吞吐量。另一方面,我們提出的資料流,其核心是充分利用卷積層各種資料共享的形式,並將資料暫存於運算單元的記憶體,運算單元在計算每個卷積視窗時,直接從暫存器重複存取資料,不但可以立即累加連續的輸出,也因此減少中間產物及運算資料從DRAM來回存取的負擔。由此,我們提出Tiling方法,增加運算單元記憶體空間來提高運算效率。當暫存器越大,越多的資料可以直接在PE內重複存取,並可以累加更多的中間產物,有更好的效能表現。另外,由於卷積神經網路的每層結構不同,我們提出的資料流和硬體架構具可重新配置性,並配合基於標籤辨別方法建立的晶片網路,在每層達到有效的空間及功能單元使用率。Alexnet的第二個卷積層,多使用2.63KB的暫存器,可以有4.64倍的效能改善;跟Eyeriss的第二個卷積層相比,在我們多使用18.38KB的暫存器的情況下,我們有1.07倍快的效能。我們的第三個卷積層,增加12.8KB的暫存器,我們有14.55倍快的效能;跟Eyeriss的第三個卷積層相比,在我們多使用28.55KB的記憶體情況下,有1.03倍快的效能。以Alexnet全部卷積層來看,在250MHz工作頻率下,我們的效能可以達到41.7 fps (55.5 GOPS)。與Eyeriss相比,在同樣200MHz工作頻率下,Alexnet的第2~4卷積層可以快1.03~1.07倍,晶片內建記憶體則多使用了16KB。
Convolution Neural Network (CNN) is a deep learning method for vision recognition. The state-of-the-art accuracy makes it widely used in artificial intelligence, computer vision, self-driving car, etc. However, CNNs are highly computational complex and demand high memory bandwidth. Although we exploit highly parallel computation to achieve effective throughput, the good orchestration of data movements should be taken into consideration to reduce increased memory bandwidth. To address these problems, we present a specialized dataflow with spatial hardware (extended from MIT Eyeriss, an energy-efficient reconfigurable CNN accelerator) to reduce memory access without sacrificing performance.

Existing works typically improve in either computation or memory access aspect. However, the computational parallelism and memory bandwidth react each other, so we should take both of them into consideration at the same time.

Convolution operations of CNNs exhibit various data reuse types and show high parallelism. We apply highly-parallel PE array to improve the throughput. To minimize data access, we purpose a dataflow leveraging data reuse opportunities and local buffer inside PE. Then, data can be temporal reused without iterative access between high-level memory and PEs. In addition, large amount of intermediate data can be accumulated immediately, which could pose additional pressure on storage. By reason of that, we propose a tiling methodology with the tradeoff between performance and local buffer size. The larger local buffer is used, the more data can be reused and the more intermediate data can be consumed, which can alleviate the data streaming bottleneck, enabling the efficient computation. Furthermore, our dataflow and hardware can adapt to different layers with varying shapes, so we can maximize the throughput in each layer of CNNs. For layer2 of Alexnet, we can have a speedup of 4.64 times with additional buffer of 2.63 KB, over initial buffer size, which is a row size of input data. Compared to Eyeriss, we have the speedup of 1.07 times by using additional buffer of 18.38KB. For layer3 of Alexnet, we can have a speedup of 14.55 times with additional buffer of 12.8KB. Compare to Eyeriss, we have the speedup of 1.03 times by using additional buffer of 28.55KB. As a result, the throughput of the frame rate achieves 41.7 fps (55.5 GOPS) at 250MHz frequency for convolution layers of Alexnet. Compared to Eyeriss, under the same frequency 200 MHz and using Alexnet, we achieve better performance in layer2~4 ranging from 1.03~1.07 times by using additional on-chip buffer of 16KB.
1 Introduction 1
1.1 Introduction of Convolution Neural Network (CNN) and Motivation 1
1.2 Previous Work 1
1.3 Contribution 4
1.4 Organization of Thesis 5
2 CNN Background 6
2.1 Primer 6
2.1.1 Convolution Layer 8
2.1.2 Non-Linearity Layer (ReRU) 12
2.1.3 Pooling Layer (Sub Sampling) 14
2.1.4 Fully-Connected Layer 16
2.2 State-of-the-Art CNN Models 17
2.3 Challenges and Breakthrough 18
3 Spatial Architecture and Specialized Dataflow 21
3.1 Overview of Spatial Architecture 21
3.2 Specialized Dataflow 24
3.2.1 Logical Mapping 24
3.2.2 Physical Mapping 42
3.3 Network-on-Chip (NoC) 45
3.4 Processing Engine (PE) 49
4 Tiling Methodology 51
5 Experiment Results and Analysis 58
6 Conclusion and Future Work 64
6.1 Conclusion 64
6.2 Future Work 65
