應用於卷積神經網路模型的心脈陣列加速器設計與分析__國立清華大學博碩士論文全文影像系統

帳號：guest(3.137.220.166) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	林彥廷
作者(外文):	Lin, Yen-Ting
論文名稱(中文):	應用於卷積神經網路模型的心脈陣列加速器設計與分析
論文名稱(外文):	Design and Analysis of Systolic Array-Based Accelerators for Convolutional Neural Networks
指導教授(中文):	吳誠文
指導教授(外文):	Wu, Cheng-Wen
口試委員(中文):	黃稚存黃錫瑜謝明得
口試委員(外文):	Huang, Chih-Tsun Huang, Shi-Yu Shieh, Ming-Der
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	106061574
出版年(民國):	109
畢業學年度:	108
語文別:	中文
論文頁數:	86
中文關鍵詞:	心脈陣列、卷積神經網路、人工智慧、電阻式隨機存取記憶體、量化、計算機結構
外文關鍵詞:	Systolic Array、Convolutional Neural Networks、Artificial Intelligence、Resistive Random-Access Memory、Quantization、Computer Architecture
相關次數:	推薦:0 點閱:437 評分: 下載:26 收藏:0

近年來由於人工智慧(AI)應用領域的蓬勃發展，低功耗的深度神經網路(DNN)加速器於終端裝置的需求與日俱增。這些加速器必須在可接受的運算精確度下有好的運作效率。我們提出了系統性的方法設計心脈陣列，並對其進行分析。基於先前提出的矩陣乘積之心脈陣列的映射演算法，此論文拓展其方法於二維卷積運算之心脈陣列架構的設計。我們以幾種具有不同架構特性之心脈陣列架構為例，它們具有不同的運算平行度、不同的層運算型態等。我們實作了DNN加速器的模擬器，它包含了不同的心脈陣列架構以及其它周邊電路以評估它們的吞吐量、運算延遲、硬體使用率、及DNN的運算精確度等。對於固定的DNN模型，模擬器中每層運算所使用的硬體架構可依其規格需求各自調整，即不同的層架構配置。對於不同資料處理平行度的層架構配置，增加配置平行度能使DNN模型的運作加速3.5倍，並有3.0倍的硬體使用率，而其僅會增加最多約37%的面積。這些心脈陣列依循權重值靜止數據流的特性所設計，因此其中的次級運算單元陣列得以使用新興非揮發型記憶體(NVM)來實作，例如電阻式隨機存取記憶體(RRAM)。比較次級運算單元陣列的不同實作方式，能發現使用RRAM陣列實作者較使用數位運算單元陣列者，其面積小2至3個數量級。然而，由於每次運算的輸入數據及權重數據必須對RRAM單元多次操作，以致其運算延遲會提升至多64倍。考慮權重數據和激勵數據的不同量化位元數，暫存器大小、運算區塊的複雜度、RRAM陣列的大小等皆能縮減許多，其面積差異至少47%。

In recent years, along with the development of applications on artificial intelligence (AI), low power Deep Neural Network (DNN) accelerator is demanded incrementally by the commercial end-point devices. The accelerators are required the acceptable accuracy and perform high efficiency. In this thesis, we propose a systemic methodology to design and analysis the systolic array. We develop systolic array of 2D-convolution based on the proposed mapping algorithm for matrix multiplication. We take more than one available systolic array architectures as the examples. They have different hardware characteristics for different projection parallelisms and operation types. We propose a DNN accelerator simulator with these systolic arrays and the peripheral circuits to evaluate their throughput, latency, utilization, and test accuracy. For a DNN model, the systolic array architectures in the simulator can be adjusted by the requirement of layers, i.e., different layer configuration. The layer configuration with larger parallelism may speedup 3.5 times and improve the utilization by 3.0 times with only up to 37% of additional area overhead. These systolic arrays are followed by the weight stationary dataflow. Its secondary PE arrays are suitable for implemented by the Resistive Random-Access Memory (RRAM). It can reduce the hardware cost about 2 or 3 orders compared to the digital PE array implementations. However, the computation latency may be increase up to 64x because of the operating latency overheads of RRAM. Considering the different quantization level, the size of register files, the complexity of computational block, and the size of RRAM array can be reduced to varying degrees. The area difference for many cases are at least 47%.

Abstract ----- i
List of Figures ----- iv
List of Tables ----- vii
Chapter 1 Introduction ----- 1
Sec. 1.1 Motivation ----- 1
Sec. 1.2 Systolic Array ----- 3
Sec. 1.3 Processing in Memory Using Emerging Non-Volatile Memory ----- 7
Sec. 1.4 Quantization ----- 9
Sec. 1.5 Proposed Approach ----- 11
Sec. 1.6 Thesis Organization ----- 12
Chapter 2 Fundamentals of Systolic-Array Design for Matrix Multiplication ----- 14
Sec. 2.1 Mapping Algorithm ----- 14
Sec. 2.2 Complexity Analysis of Systolic Array ----- 17
Sec. 2.3 Dataflow and Scheduling of Systolic Array ----- 18
Sec. 2.4 Systolic Array Examples ----- 19
Chapter 3 Designs for 2D-Convolutions in DNN ----- 24
Sec. 3.1 Mapping Algorithm for Simple 2D-Convolutions ----- 24
Sec. 3.2 Mapping Algorithm for 2D-Convolutions with 3D I/O ----- 29
Sec. 3.3 (3s, 3s) Projection ----- 30
Sec. 3.4 Parallelism 1: (2s, 2s, 2t) Projection ----- 32
Sec. 3.5 Parallelism m: (2s, 2+s, 2-t) Projection ----- 36
Sec. 3.6 Parallelism m2: (2+s, 2+s, 2-t) Projection ----- 38
Sec. 3.7 Specific Architecture: Different CNN Layer Types ----- 41
Sec. 3.7.1 Stride 2 ----- 41
Sec. 3.7.2 Depth-Wise Separable Operations ----- 43
Sec. 3.7.3 FC Layers ----- 44
Sec. 3.8 Summary ----- 45
Chapter 4 Simulation and Evaluation ----- 47
Sec. 4.1 DNN Accelerator Simulator ----- 47
Sec. 4.1.1 General Convolution Operations ----- 48
Sec. 4.1.2 Skip Paths ----- 49
Sec. 4.2 Different Implementations of Synapse Block ----- 49
Sec. 4.2.1 Digital PE Array ----- 50
Sec. 4.2.2 Analog RRAM Pseudo-Crossbar (2n-level cell) ----- 50
Sec. 4.2.3 Digital RRAM Pseudo-Crossbar (2-level cell) ----- 51
Sec. 4.3 Quantization Flow ----- 52
Sec. 4.4 Area Evaluations ----- 56
Sec. 4.4.1 Area of PE Arrays ----- 56
Sec. 4.4.2 Area of Delay Units and Output Data Registers ----- 58
Sec. 4.4.3 Area of The Other Blocks: Act, Avg, SkMul ----- 60
Sec. 4.4.4 Layer Configurations for DNN Hardware ----- 63
Sec. 4.5 DNN Model Implementations ----- 64
Sec. 4.6 Experimental Results ----- 67
Sec. 4.6.1 Utilization and Latency Analysis ----- 67
Sec. 4.6.2 Area Breakdown of Components ----- 69
Sec. 4.6.3 Area with Different Quantization Levels ----- 72
Sec. 4.6.4 Trade-Off: Area vs. Throughput and Utilization ----- 75
Sec. 4.6.5 Summary ----- 78
Chapter 5 Conclusion and Future Work ----- 80
Sec. 5.1 Conclusion ----- 80
Sec. 5.2 Future Work ----- 80
References ----- 82

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” in Nature, vol. 521, no. 7553, pp. 436-444, May 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convoltional neural networks,” in Proc. Advances in Neural Information Processing Systems 25 (NIPS 2012), pp. 1106–1114, Dec. 2012.
[3] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243-254, June 2016.
[4] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models to FPGAs,” in Proc. 49th Ann. IEEE/ACM Int. Symp. on Microarchitecture (MICRO), no. 17, pp. 1-12, Oct. 2016.
[5] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in IEEE Press ACM SIGARCH Computer Architecture News, pp. 367-379, June 2016.
[6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in IEEE Press 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, June 2017.
[7] H. T. Kung, B. McDanel, and S. Q. Zhang, “Mapping Systolic Arrays onto 3D Circuit Structures: Accelerating Convolutional Neural Network Inference,” in 2018 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 330-336, Oct. 2018.
[8] S. K. Lee, P. N. Whatmough, N. Mulholland, P. Hansen, D. Brooks, and G.-Y. Wei, “A Wide Dynamic Range Sparse FC-DNN Processor with Multi-Cycle Banked SRAM Read and Adaptive Clocking in 16nm FinFET,” in ESSCIRC 2018-IEEE 44th European Solid State Circuits Conference (ESSCIRC), pp. 158–161, Sept. 2018.
[9] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, and G.-Y. Wei, “A 28nm SoC with a 1.2 GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications,” in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 242–243, Feb. 2017.
[10] Arm Limited, “Arm Ethos-N series processors,” https://developer.arm.com/ip-products/processors/machine-learning/arm-ethos-n, arm Developer, Jan. 2019.
[11] Google LLC, “Helping you bring local AI to applications from prototype to production,” https://coral.ai/products/#production-products/, Coral Products, 2019.
[12] NVIDIA Corporation, “Buy the Latest Jetson Products,” https://developer.nvidia.com/buy-jetson, Nvidia autonomous machines, Nov. 2015.
[13] Intel Corporation, “Intel® Neural Compute Stick 2 (Intel® NCS2),” https://software.intel.com/en-us/neural-compute-stick, Intel® Software Developer Zone, Aug. 2018.
[14] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, P. Zhang, “Machine learning at facebook: Understanding inference at the edge,” in: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331-344, Feb. 2019.
[15] C. Merkel, R. Hasan, N. Soures, D. Kudithipudi, T. Taha, S. Agarwal, and M. Marinella, “Neuromemristive Systems: Boosting Efficiency through Brain-Inspired Computing,” in IEEE Computer, vol. 49, no. 10, pp. 56-64, Oct. 2016.
[16] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in Reram-Based Main Memory,” in Proc. 43rd Int. Symp. on Computer Architecture (ISCA), pp. 27-39, Jun. 2016.
[17] H. Y. Lee, P. S. Chen, T. Y. Wu, Y. S. Chen, C. C. Wang, P. J. Tzeng, C. H. Lin, F. Chen, C. H. Lien, and M.-J. Tsai, “Low Power and High Speed Bipolar Switching With a Thin Reactive Ti Buffer Layer in Robust HfO2 Based RRAM,” in 2008 IEEE International Electron Devices Meeting, pp. 1-4, Dec. 2008.
[18] J.-G. Zhu, Y. Zheng, and G. A. Prinz, “Ultrahigh density vertical magnetoresistive random access memory,” in Journal of Applied Physics, vol. 87, no. 9, pp. 6668-6673, April 2000.
[19] Y. Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y. Oh, D. Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M. G. Kang, J. Lee, Y. Kwon, S. Kim, J. Kim, Y.-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y.-T. Lee, J. Yoo, and G. Jeong, “A 20nm 1.8 V 8Gb PRAM with 40MB/s Program Bandwidth,” in 2012 IEEE International Solid-State Circuits Conference, pp. 46-48, Feb. 2012.
[20] D. Walczyk, T. Bertaud, M. Sowinska, M. Lukosius, et al., “Resistive switching behavior in tin/hfo2/ti/tin devices,” in Proc. Int. Semiconductor Conference Dresden-Grenoble (ISCDG), pp. 143–146, Sept. 2012.
[21] P. Pouyan, E. Amat, and A. Rubio, “Memristive Crossbar Memory Lifetime Evaluation and Reconfiguration Strategies,” in IEEE Trans. on Emerging Topics in Computing, pp. 1, June 2016.
[22] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical Precision,” in International Conference on Machine Learning, pp. 1737-1746, July 2015.
[23] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704-2713, June 2018.
[24] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary Quantization,” in arXiv preprint 1612.01064, Feb. 2017.
[25] H. T. Kuang, “Why Systolic Architectures?” in IEEE computer, vol. 15, no. 1, pp. 37-46, Feb. 1982.
[26] G. J. Li, “The Design of Optimal Systolic Arrays,” in IEEE Transactions on Computers, vol. 100, no. 1, pp. 66-77, Jan. 1985.
[27] I. Z. Milentijević, I. Z. Milovanović, E. I. Milovanović, and M. K. Stojcev, “The Design of Optimal Planar Systolic Arrays for Matrix Multiplication,” in Computers & Mathematics with Applications, vol. 33, no. 6, pp. 17-35, March 1997.
[28] R. Günter, “A Systolic Array Algorithm for the Algebraic Path Problem (Shortest Paths; Matrix Inversion),” in Computing, vol. 34, no. 3, pp. 191-219, Sep. 1985.
[29] M. R. Zargham, “Data Flow and Systolic Array Architectures,” in Computer Architecture: Single and Parallel Systems, 1st ed., ch. 8, Dec. 1996.
[30] J. J. Lee, and G. Y. Song, “Super-Systolic Array for 2D Convolution,” in TENCON 2006-2006 IEEE Region 10 Conference, pp. 1-4, Nov. 2006.
[31] H. T. Kung, B. McDanel, and S. Q. Zhang, “Adaptive Tiling: Applying Fixed-Size Systolic Arrays to Sparse Convolutional Neural Networks,” in IEEE Press 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1006-1011, Aug. 2018.
[32] M. G. Smith, S. Emanuel, “Methods of Making Thru-Connections in Semiconductor Wafers,” U.S. Patent, no. 3,343,256, Sept. 1967.
[33] D. A. Meltzer, P. Kulkarni, A. J. Walder, J. Farkas, “Clear Hydrophobic TPU,” U.S. Patent Application, no. 14/765,657, Dec. 2015
[34] J. Y. Hu, K. W. Hou, C. Y. Lo, Y. F. Chou, and C. W. Wu, “RRAM-Based Neuromorphic Hardware Reliability Improvement by Self-Healing and Error Correction,” in 2018 IEEE International Test Conference in Asia (ITC-Asia), pp. 19-24, Aug. 2018.
[35] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, and W. Lu, “Nanoscale memristor device as synapse in neuromorphic systems,” in Nano letters, vol. 10, no. 4, pp. 1297-1301, March 2010.
[36] J. Zhou, K.-H. Kim, and W. Lu, “Crossbar RRAM Arrays: Selector Device Requirements During Read Operation,” in IEEE Transactions on Electron Devices, vol. 61, no. 5, pp. 1369-1376, March 2014.
[37] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, V. Srikumar, “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in Proc. 43rd Int. Symp. on Computer Architecture (ISCA), pp. 14-26, Jun. 2016.
[38] P. Y. Chen, X. Peng, and S. Yu, “Neurosim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3067-3080, Jan. 2018.
[39] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang, “TIME: A Training-in-Memory Architecture for RRAM-Based Deep Neural Networks,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 5, pp. 834-847, April 2018.
[40] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. di Nolfo, S. Sidler, M. Giordano, M. Bodini, N. C. P. Farinha, B. Killeen, C. Cheng, Y. Jaoudi, and G. W. Burr, “Equivalent-Accuracy Accelerated Neural-Network Training Using Analogue Memory,” in Nature, vol. 558, no. 7708, pp. 60-67, June 2018.
[41] M. Mao, X. Peng, R. Liu, J. Li, S. Yu, and C. Chakrabarti, “MAX2: An ReRAM-based Neural Network Accelerator That Maximizes Data Reuse and Area Utilization,” in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, April 2019.
[42] Q. Yang, H. Li, Q. Wu, “A Quantized Training Method to Enhance Accuracy of ReRAM-Based Neuromorphic Systems,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1-5, May 2018.
[43] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for Simplicity: The All Convolutional Net,” arXiv preprint 1412.6806, April 2014.
[44] G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708, July 2017.
[45] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv preprint 1704.04861, April 2017.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文