帳號:guest(3.15.148.252)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):何品蕙
作者(外文):Ho, Pin-Hui
論文名稱(中文):基於壓縮式卷積神經網路之推論加速器設計
論文名稱(外文):Design of an Inference Accelerator for Compressed Convolution Neural Networks
指導教授(中文):黃稚存
指導教授(外文):Huang, Chih-Tsun
口試委員(中文):劉靖家
謝明得
口試委員(外文):Liou, Jing-Jia
Shieh, Ming-Der
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:103062565
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:42
中文關鍵詞:卷積神經網路加速器硬體設計壓縮
外文關鍵詞:convolution neural networkacceleratorhardware designCSR compression
相關次數:
  • 推薦推薦:0
  • 點閱點閱:808
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在最新的類神經網路中,高維度的卷積運算時常佔了大部分的運算時 間。由於高維度的卷積運算,花費了大量的運算時間與能源,各種修簡類 神經網路模型的演算法,與卷積神經網路的加速器應運而生。修簡卷積模 型與使用線性整流函數運算類神經元,會造成數據有大量的零。因此許多 卷積運算將會涉及零的相乘,導致大多數的運算是無效的。所以高度的數 據稀疏性,讓我們有空間進行資料壓縮。
此論文提出了個可進行稀疏矩陣乘法的類神經網路加速器,改自 最新的Eyeriss accelerator (Eyeriss, an energy-efficient reconfigurable CNN accelerator ),我們的加速器將在有限的硬體資源下作討論 。將事先修簡與 壓縮過的模型與動態壓縮成CSR(compressed sparse row)格式的類神經元輸 入加速器,將減少可觀的片外數據傳輸量、存儲讀取量,與跳過非必要的 運算。
在Alexnet這個有名的模型下,類神經元會有35%的數據稀疏性與模型會 有89.5% 的資料稀疏性,我們的加速器相比Eyeriss會有1.12倍的加速,而與 我們基準模型相比,會有3.42倍的加速。
The state-of-the-art convolution neural network (CNNs) are wildly used in the intelligent applications such as the AI systems, nature language processing and image recognition. The huge and growing computation latency of the high-dimensional convolution has become a critical issue. Most of the multiplications in the convolution layers are ineffectual as they involve the multiplication that either one of the input data or both are zero. By pruning the redundant connections in the CNN models and clamping the features to zero by the rectified linear unit (ReLU), the input data of the convolution layers achieve a higher sparsity. The high data sparsity leaves a big room for improvement by compressing the data and skipping the ineffectual multiplication.
In this thesis, we propose a design of the CNN accelerators which can accelerate the convolution layers by performing the sparse matrix multiplications and reducing the amount of the off-chip data transfer. The state-of-the-art Eyeriss architecture is an energy-efficient reconfigurable CNN accelerator which has the specialized data flow with the multi-level memory hierarchy and limited hardware resource. Improved over the Eyeriss architecture, our approach can perform the sparse matrix multiplications effectively. With the pruned and compressed kernel data, and dynamic encoding of the input feature into the compressed sparse row (CSR) format, our accelerator can reduce a significant amount of the off-chip data transfer, minimizing the memory accesses and skipping the ineffectual multiplication.
One of the disadvantages of the compression scheme is that after the compression, the input workload becomes imbalance dynamically. Therefore, the technique will lower the data reusability and increase the off-chip data transfer. To analyze the data transfer further, we explore the relationship between the on-chip buffer size and off-chip data transfer. Our design needs to reaccess the off-chip data while the on-chip buffer can’t store all the output features.
As a result, by reducing the significant amount of computation and memory accesses, our accelerator can still achieve the better performance. With Alexnet, a popular CNN model with 35% input data sparsity and 89.5% kernel data as the benckmark, our accelerator can achieve 1.12X speedup as compared with Eyeriss, and 3.42X speed up as compared with our baseline architecture.
1 Introduction 1
2 Background and Motivation 3
2.1 CNNBasic .................................... 3
2.2 PreviousWork .................................. 5
3 Sparse Matrix Compression and Convolution 7
3.1 SparseMatrixRepresentationIntroduction................... 7
3.2 CompressionRateMeasurement......................... 9
3.3 SparseMatrixConvolution............................ 10
4 Architecture 13
4.1 DesignOverviewandArchitecture........................ 13
4.2 PEofCSRCompressedConvolution ...................... 14
4.3 EntryAccessUnit................................. 18
4.4 WorkloadDistributionandDispatcher ..................... 21
4.4.1 WorkloadDistribution .......................... 21
4.4.2 Dispatcher................................. 22
4.5 DataReuseandMemoryAccessTimes..................... 25
5 Experiment Results and Analysis 29
5.1 ExperimentSetup................................. 29
5.2 ExperimentalResult ............................... 29
6 Conclusion and Future Work 38
6.1 Conclusion..................................... 38
6.2 FutureWork.................................... 39
[1] K. Fukushima, ”Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, no. 4, pp. 193-202, 1980
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”ImageNet Classification with Deep Con- volutional Neural Networks,” NIPS, vol. 1, pp. 1097-1105, 2012
[3] Y. LeCun, Y. Bengio, and G. Hinton, ”Deep learning,” Nature, vol. 521, no. 7553, 2015
[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ”Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998
[5] K. Simonyan and A. Zisserman, ”Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014
[6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, ”cuDNN: Efficient Primitives for Deep Learning,” CoRR, vol. abs/1410.0759, 2014
[7] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P. Graf, ”A Massively Parallel Coprocessor for Convolutional Neural Networks,” IEEE ASAP, p.53-60, 2009
[8] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, ”Towards an embedded biologically-inspired machine vision processor,” FPT, 2010
[9] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, ”A Dynamically Config- urable Coprocessor for Convolutional Neural Networks,” International Symposium on Com- puter Architecture (ISCA), vol. 38, no. 3, pp.247-257 , 2010
[10] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, ”Memory-centric accelerator design for Convolutional Neural Networks,” IEEE ICCD, 2013
[11] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, ”A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks,” IEEE CVPRW, 2014
[12] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, ”Deep Learning with Limited Numerical Precision,” CoRR, vol. abs/1502.02551, 2015
[13] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, ”Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” FPGA, pp. 161-170, 2015
[14] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, ”DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning,” ASPLOS, pp. 269-284, 2014
[15] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, ”ShiDianNao: Shifting Vision Processing Closer to the Sensor,” International Symposium on Computer Architecture (ISCA), 2015
[16] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, ”DaDianNao: A Machine-Learning Supercomputer,” MICRO, 2014
[17]S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, ”A 1.93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data appli- cations,” IEEE International Solid-State Circuits Conference (ISSCC), 2015
[18] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, ”Origami: A Convolutional Network Accelerator,” IEEE Transactions on Circuits and Systems for Video Technology, 2017
[19] Chen YH, Krishna T, Emer J, Sze V, ”Eyeriss: An energy-efficient reconfigurable ac- celerator for deep convolutional neural networks,” IEEE International Solid-State Circuits Conference (ISSCC), pp 262-263, 2016
[20] Song Han, Huizi Mao, and William J. Dally, ”Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” CoRR, vol. abs/1510.00149, 2015
[21] Song Han, Jeff Pool, John Tran, and William J. Dally, ”Learning Both Weights and Connections for Efficient Neural Networks,” Proceedings of the International Conference on Neural Information Processing Systems (NIPS), pp. 1135-1143, 2015
[22] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey, ”Faster cnns with direct sparse convolutions and guided pruning,” Interna- tional Conference on Learning Representations (ICLR), 2017
[23] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally, ”Eie: Efficient inference engine on compressed deep neural network,” In- ternational Symposium on Computer Architecture (ISCA), 2016
[24] Richard Dorrance, Fengbo Ren,Dejan Markovic, ”A Scalable Sparse Matrix-Vector Mul- tiplication Kernel for Energy-Efficient Sparse-Blas on FPGAs,” FPGA, pp.161-170, 2014
[25] Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N.E. and Moshovos, A., ”Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,” Proceedings of the 43rd International Symposium on Computer Architecture, 2016
[26] B. Moons, Roel Uytterhoeven, Wim Dehaene, Marian Verhelst, ”ENVISION: A 0.26- to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-FrequencyScalable CNN Pro- cessor in 28nm FDSOI,” IEEE International Solid-State Circuits Conference (ISSCC), pp. 246-247, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *