作者(外文):Ho, Pin-Hui
論文名稱(外文):Design of an Inference Accelerator for Compressed Convolution Neural Networks
指導教授(外文):Huang, Chih-Tsun
口試委員(外文):Liou, Jing-Jia
Shieh, Ming-Der
外文關鍵詞:convolution neural networkacceleratorhardware designCSR compression
在最新的類神經網路中,高維度的卷積運算時常佔了大部分的運算時 間。由於高維度的卷積運算,花費了大量的運算時間與能源,各種修簡類 神經網路模型的演算法,與卷積神經網路的加速器應運而生。修簡卷積模 型與使用線性整流函數運算類神經元,會造成數據有大量的零。因此許多 卷積運算將會涉及零的相乘,導致大多數的運算是無效的。所以高度的數 據稀疏性,讓我們有空間進行資料壓縮。
此論文提出了個可進行稀疏矩陣乘法的類神經網路加速器,改自 最新的Eyeriss accelerator (Eyeriss, an energy-efficient reconfigurable CNN accelerator ),我們的加速器將在有限的硬體資源下作討論 。將事先修簡與 壓縮過的模型與動態壓縮成CSR(compressed sparse row)格式的類神經元輸 入加速器,將減少可觀的片外數據傳輸量、存儲讀取量,與跳過非必要的 運算。
在Alexnet這個有名的模型下,類神經元會有35%的數據稀疏性與模型會 有89.5% 的資料稀疏性,我們的加速器相比Eyeriss會有1.12倍的加速,而與 我們基準模型相比,會有3.42倍的加速。
The state-of-the-art convolution neural network (CNNs) are wildly used in the intelligent applications such as the AI systems, nature language processing and image recognition. The huge and growing computation latency of the high-dimensional convolution has become a critical issue. Most of the multiplications in the convolution layers are ineffectual as they involve the multiplication that either one of the input data or both are zero. By pruning the redundant connections in the CNN models and clamping the features to zero by the rectified linear unit (ReLU), the input data of the convolution layers achieve a higher sparsity. The high data sparsity leaves a big room for improvement by compressing the data and skipping the ineffectual multiplication.
In this thesis, we propose a design of the CNN accelerators which can accelerate the convolution layers by performing the sparse matrix multiplications and reducing the amount of the off-chip data transfer. The state-of-the-art Eyeriss architecture is an energy-efficient reconfigurable CNN accelerator which has the specialized data flow with the multi-level memory hierarchy and limited hardware resource. Improved over the Eyeriss architecture, our approach can perform the sparse matrix multiplications effectively. With the pruned and compressed kernel data, and dynamic encoding of the input feature into the compressed sparse row (CSR) format, our accelerator can reduce a significant amount of the off-chip data transfer, minimizing the memory accesses and skipping the ineffectual multiplication.
One of the disadvantages of the compression scheme is that after the compression, the input workload becomes imbalance dynamically. Therefore, the technique will lower the data reusability and increase the off-chip data transfer. To analyze the data transfer further, we explore the relationship between the on-chip buffer size and off-chip data transfer. Our design needs to reaccess the off-chip data while the on-chip buffer can’t store all the output features.
As a result, by reducing the significant amount of computation and memory accesses, our accelerator can still achieve the better performance. With Alexnet, a popular CNN model with 35% input data sparsity and 89.5% kernel data as the benckmark, our accelerator can achieve 1.12X speedup as compared with Eyeriss, and 3.42X speed up as compared with our baseline architecture.
1 Introduction 1
2 Background and Motivation 3
2.1 CNNBasic .................................... 3
2.2 PreviousWork .................................. 5
3 Sparse Matrix Compression and Convolution 7
3.1 SparseMatrixRepresentationIntroduction................... 7
3.2 CompressionRateMeasurement......................... 9
3.3 SparseMatrixConvolution............................ 10
4 Architecture 13
4.1 DesignOverviewandArchitecture........................ 13
4.2 PEofCSRCompressedConvolution ...................... 14
4.3 EntryAccessUnit................................. 18
4.4 WorkloadDistributionandDispatcher ..................... 21
4.4.1 WorkloadDistribution .......................... 21
4.4.2 Dispatcher................................. 22
4.5 DataReuseandMemoryAccessTimes..................... 25
5 Experiment Results and Analysis 29
5.1 ExperimentSetup................................. 29
5.2 ExperimentalResult ............................... 29
6 Conclusion and Future Work 38
6.1 Conclusion..................................... 38
6.2 FutureWork.................................... 39
