作者(外文):Chen, Chun-Chen
論文名稱(外文):Design Exploration Methodology for Deep Convolutional Neural Network Accelerator
指導教授(外文):Huang, Chih-Tsun
口試委員(外文):Liou, Jing-Jia
Shieh, Ming-Der
效能可以被以低於0.63% 的錯誤來快速預估。為了要將推論加速器套用至
的結果,我們提出一個改進過的加速器,相較於現有的Eyeriss [1] 架構可
以分別在ResNet-50 達到1.34 倍以及MobileNet-V2 達到2.39 倍的效能改
最後,一個擁有2016 運算單元的加速器被作為例子來展現我們的方法
容量,設計架構可以被一步一步地改進,在ResNet-50 上可以達到1849.89
MACOPS (每秒可以進行的乘加運算)。如此高的運算效率(91.8%) 證明我
Deep convolutional neural networks (DCNN) have played the key roles in modern artificial
intelligence (AI) applications. Recently, more and more inference accelerators have been
proposed to cope with the gigantic computation complexity of DCNN. Dedicated accelerators
make the real-time inferencing computation possible, utilizing the huge degree of the
computational parallelism. However, the huge computation complexity comes with the huge
data requirement. Storing the whole convolutional layer data in the on-chip storage will
result in huge memory cost, which is too inefficient to consider. As a result, the whole-layer
computation should be separated into small pieces to efficiently process the convolution.
The way to separate and schedule the computation is called dataflow. The complicated
dataflow with the massive bandwidth requirement leads to the crucial burden of optimizing
the architectural design.
To design an efficient DCNN accelerator, the dataflow and hardware architecture should
be considered simultaneously. In this thesis, we propose an analytical model of fast and
accurate latency estimation based on the regularity of convolution during the early design
phase. Using our model, the performance of energy-efficient inference architecture can be
estimated rapidly with less than 0.63% error. In order to adopt the inference architecture to
different DCNN models, a parameter exploration flow is proposed to search for the optimized
workload arrangement.
With our exploration flow, the performance bottleneck of the target DCNN accelerator
can be easily identified. Based on the evaluation result, we propose an improved accelerator
architecture, which can achieve the speedup of 1.34 and 2.39 on ResNet-50 [2] and
MobileNet-V2 [3], respectively, as compared with the existing Eyeriss [1] architecture.
Finally, an accelerator design of 2016 processing elements is studied to demonstrate the
effective architecture exploration and specification using our approach in detials. Given
the initial constraints of the number of processing elements and the size of memories, the
design architecture can be optimized iteratively, achieving 1,849.89 MACOPS (multiply-andaccumulate
operations) per cycle on ResNet-50. The high resultant utilization (i.e., 91.8%)
justifies the effectiveness of the proposed exploration flow.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Previous Works 4
2.1 Eyeriss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 MAESTRO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Dataflow 7
3.1 Parameter Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Per PE Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 PE Set Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Data Delivering Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 GLB Usage and Processing Pass Scheduling . . . . . . . . . . . . . . . . . . 13
4 Proposed Design Exploration Methodology 15
4.1 Latency Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Target Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Required Information for Latency Estimation . . . . . . . . . . . . . 16
4.1.3 Three-stage Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.4 Processing Pass Composition . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Exploration Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Proposed DNN Accelerator Architecture 34
5.1 Edge Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Exchangeable Input Network . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Kernel Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 PE Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Experiment Result 44
6.1 Proposed DNN Accelerator Architecture . . . . . . . . . . . . . . . . . . . . 44
6.1.1 Edge Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.2 Exchangeable Input Network and Kernel Split . . . . . . . . . . . . . 46
6.1.3 PE Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Conclusion and Future Works 59
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.1 MAC Bit Width Exploration . . . . . . . . . . . . . . . . . . . . . . 60
7.2.2 PE Group Size Exploration . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.3 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.4 Latency Analysis for Zero-Aware Convolutional Computation . . . . 61
7.2.5 Other Type of Interconnection . . . . . . . . . . . . . . . . . . . . . 61
7.2.6 Flexible Design Exploration Methodology . . . . . . . . . . . . . . . 61
