適用於記憶體內運算之具有按位稀疏偵測和優化記憶體存取的輸入帶寬擴展架構_

帳號：guest(216.73.216.18) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳禹樵
作者(外文):	CHEN, YU-CHIAO
論文名稱(中文):	適用於記憶體內運算之具有按位稀疏偵測和優化記憶體存取的輸入帶寬擴展架構
論文名稱(外文):	A Computing-In-Memory Based Input Bandwidth Scaling Architecture with Bitwise-Sparsity Detection and Memory Access Optimization
指導教授(中文):	鄭桂忠張孟凡
指導教授(外文):	Tang, Kea-Tiong Chang, Meng-Fan
口試委員(中文):	黃朝宗盧峙丞
口試委員(外文):	Huang, Chao-Tsung Lu, Chih-Cheng
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	109061627
出版年(民國):	112
畢業學年度:	111
語文別:	中文
論文頁數:	68
中文關鍵詞:	非揮發記憶體內運算、按位稀疏化、加速器、資料複用
外文關鍵詞:	nonvolatile-computing-in-memory、bitwise-sparsity、accelerator、data-reuse
相關次數:	推薦:0 點閱:1102 評分: 下載:0 收藏:0

得益於近年來深度神經網絡的興起和發展，人工智能逐漸出現在生活周圍的邊緣設備應用中。然而由於深度神經網絡大量的權重和輸入特徵圖數據在運算核心和記憶體之間的頻繁移動，傳統的馮諾依曼計算架構在能源效率和計算速度方面受到限制，對於電池驅動的低功耗人工智慧邊緣裝置的能耗與速度造成龐大的負擔。記憶體內運算(Computing-In-Memory, 簡稱CIM)通過在記憶體單元陣列中執行平行化點積運算，可以很大程度上克服這一瓶頸。
其中CIM透過類比電流累加運算雖然擁有較高的能量效率以及陣列密度，但電流偏差和類比數位轉換器面積限制造成CIM在每條位元線上能讀取的累加值受限，大幅降低輸入帶寬和運算效率。雖然如此，如果能利用近年來神經網路的高輸入稀疏特性，跳過不必要的輸入零值甚至是按位的進行跳過，並且透過在CIM架構下更進一步降低輸入特徵圖(IFMap)的資料搬運，便能在降低系統功耗的同時更進一步提高運算效能。
本架構採用記憶體內運算巨集設計優化專屬的資料流，設計一套硬體架構在不影響CIM輸出準確率前提下，透過省略含零值的計算增加系統效能。並且提出一降低IFMap記憶體存取次數且同時增加系統給予CIM輸入帶寬之串入並出資料複用架構，並將此研究提出之架構設計應用於一顆基於RRAM CIM之整合型晶片。
此論文提出之架構與具備稀疏化感知機制之設計相較於無稀疏化感知提升了最高9倍的運算速度，而透過本研究所設計的資料複用架構則是能降低最多85%的記憶體存取次數同時減少94%的複用緩衝器數量。

Artificial intelligence has gradually appeared in edge devices thanks to the recent rise and development of deep neural networks. However, conventional von Neumann architectures are limited in energy efficiency and performance due to the frequent movement of neural network weights and input feature map (IFMap) between processing cores and memory, placing a huge burden on the battery-driven AI edge devices. Computing-In-Memory can largely overcome the bottleneck by performing a parallelized dot product operation in the memory cell array.
Although CIM has high energy efficiency and array density with analog current accumulation operation, current deviation and area limitation of analog-to-digital converter limit the accumulated current value that CIM can read out on each bit line, significantly reducing the input bandwidth and operational efficiency. Even so, if we can take advantage of the high input sparsity characteristics of neural networks to skip unnecessary input zero values, or even bitwise zero skipping, and further reduce the data movement of the input feature map under the CIM architecture, the performance of AI edge devices can be further improved while reducing system power consumption simultaneously. The proposed architectures increase the system performance up to 32× by omitting zero-value calculations without affecting the accuracy of CIM output. Furthermore, propose a serial-in-parallel-output data reuse architecture that can reduce the number of IFMap memory accesses by up to 85% while decreasing the number of local buffers by 94% and simultaneously increasing the input bandwidth to the CIM by up to 9×. We apply the architecture design proposed in this research and tape out a chip based on RRAM CIM.

摘要---------------------------------------------i
ABSTRACT-----------------------------------------ii
目錄---------------------------------------------iii
圖目錄--------------------------------------------v
表格目錄------------------------------------------viii
第 1 章緒論--------------------------------------1
1.1 研究背景----------------------------------1
1.2 研究動機與目的-----------------------------2
1.3 章節簡介----------------------------------6
第 2 章文獻回顧----------------------------------7
2.1 深度學習硬體加速器-------------------------7
2.1.1 資料搬運能耗------------------------------7
2.1.2 資料複用性--------------------------------8
2.2 記憶體內運算------------------------------10
2.2.1 非揮發性記憶體內運算-----------------------10
2.2.2 記憶體內運算輸入方法-----------------------13
2.3 稀疏化感知設計----------------------------14
2.4 研究動機---------------------------------18
第 3 章基於記憶體內運算之按位稀疏化檢測架構--------19
3.1 記憶體內運算行為模型與卷積運算設計----------20
3.2 自適應動態按位稀疏檢測架構(Self-aware Dynamic Bitwise Sparsity Detection Architecture)--------------------------24
3.2.1 動態按位激活稀疏性加速方法 (Dynamic Bitwise Activation Sparsity Speed-Up Method)---------------------------------24
3.2.2 動態按位仲裁器 (Bitwise Sparsity Arbiter)-29
3.2.3 按位稀疏發射器(Bitwise Sparsity Emitter)--30
3.2.4 自適應消除器(Self-adaptive eliminator)----33
3.3 按位權重對齊輸入路由器 (Bit-Shift Weight-Aligning Router)--------------------------------------------------------35
第 4 章群組化通道優先串入並出資料複用架構(Grouped Channel-First Serial-in-Parallel-Out Data Reuse Architecture)---------39
4.1 群組化深度優先串入並出資料複用暫存器(Grouped Channel-First Serial-in-Parallel-Out Stride Buffer)------------40
4.2 不同運算模式之資料複用資料流---------------42
4.2.1 3D Convolution---------------------------42
4.2.2 DW Convolution---------------------------46
4.2.3 PW Convolution & FC----------------------47
第 5 章實驗結果----------------------------------50
5.1 環境設置----------------------------------50
5.2 晶片規格----------------------------------52
5.3 不同神經網路應用下稀疏化檢測架構成效--------54
5.4 記憶體存取次數與功耗分析-------------------55
5.5 整體面積與功耗分析比較---------------------61
5.6 與世界先進基於記憶體內運算加速器之比較------63
第 6 章結論與未來發展----------------------------65
參考文獻------------------------------------------66

[1] Olga Russakovsky, et al., “Imagenet large scale visual recognition challenge.” In International Journal of Computer Vision, 115.3: 211-252, 2015.
[2] A. Krizhevsky, et al., “Imagenet classification with deep convolutional neural networks.” In NIPS, 2012.
[3] Y. Lecun, et al., "Gradient-based learning applied to document recognition." In Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.” In ICLR, 2015.
[5] C. Szegedy et al., "Going deeper with convolutions." In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, doi: 10.1109/CVPR.2015.
[6] K. He, et al., “Deep Residual Learning for Image Recognition.” In CVPR, 2016.
[7] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," In ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, 2017.
[8] E. Lindholm, et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," In IEEE Micro, vol. 28, no. 2, pp. 39-55, 2008.
[9] Yann LeCun, et al., “Deep learning.” In Nature,521(7553): 436–444, 2015.
[10] Song Han, et al., “Learning both weights and connections for efficient neural networks” In NIPS, 2015.
[11] Zhuang Liu et al., "Learning efficient convolutional networks through network slimming." In ICCV, 2017.
[12] S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network." In ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243-254, 2016.
[13] M. Courbariaux, et al., “Binaryconnect: Training deep neural networks with binary weights during propagations.” In NIPS, 2015.
[14] S. Zhou, et al., “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.” In arXiv:1606.06160, 2016.
[15] Z. Cai, et al., “Deep learning with low precision by half-wave gaussian quantization.” In CVPR, 2017.
[16] Shouyi Yin, et al., “A 141 uW, 2.46 pJ/Neuron Binarized Convolutional Neural Network based Self-learning Speech Recognition Processor in 28nm CMOS.”, In Symposia on VLSI Technology and Circuits, 2018.
[17] Y.-H. Chen, et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.” In JSSC, ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, 2017.
[18] K. Ueyoshi, et al., “QUEST: A 7.49 TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS.” In ISSCC, 2018.
[19] S.-H. Sie, et al., “MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with Co-designed Compressed Neural Networks.” In arXiv:2010.12861, 2020.
[20] W. Wen, et al., “Learning Structured Sparsity in Deep Neural Network”. In NIPS, 2016.
[21] V. Sze, T.-J. Yang, Y.-H. Chen, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey." In Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
[22] W. Wei et al., "A Relaxed Quantization Training Method for Hardware Limitations of Resistive Random Access Memory (ReRAM)-Based Computing-in-Memory." In IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, vol. 6, no. 1, pp. 45-52, June 2020.
[23] A. Shafiee et al., "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars." In International Symposium on Computer Architecture (ISCA), pp. 14-26, 2016.
[24] P. Chi et al., "PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory." In International Symposium on Computer Architecture (ISCA), pp. 27-39, 2016.
[25] Y. Zhe, et al., “Sticker: A 0.41-62.1 TOPS/W 8Bit neural network processor with multi-sparsity compatible convolution arrays and online tuning acceleration for fully connected layers.” In VLSI, 2018.
[26] H. Ji, L. Song, et al., "ReCom: An efficient resistive accelerator for compressed deep neural networks." In Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 237-240, 2018.
[27] T. Yang et al., "Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks." In International Symposium on Computer Architecture (ISCA), pp. 236-249, 2019.
[28] Xizi Chen et al., "CompRRAE: RRAM-based Convolutional Neural Network Accelerator with Reduced Computations through a Runtime Activation Estimation." In ASP-DAC, 2019.
[29] R. Guo et al., "A 5.1pJ/Neuron 127.3us/Inference RNN-based Speech Recognition Processor using 16 Computing-in-Memory SRAM Macros in 65nm CMOS." In Symposium on VLSI Circuits, pp. C120-C121, 2019.
[30] J. Yue et al., "A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse," In International Solid- State Circuits Conference (ISSCC), pp. 234-236, 2020.
[31] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger and A. Moshovos, "Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing," In International Symposium on Computer Architecture (ISCA), pp. 1-13, 2016.
[32] Wei Wen et al., " Learning structured sparsity in deep neural networks." In NIPS, pp. 2082-2090, 2016.
[33] X. Si et al., "A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips." In International Solid- State Circuits Conference (ISSCC), pp. 246-248, 2020.
[34] X. Si et al., “A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning”, in ISSCC, 2019.
[35] K. Prabhu et al., "CHIMERA: A 0.92-TOPS, 2.2-TOPS/W Edge AI Accelerator With 2-MByte On-Chip Foundry Resistive RAM for Efficient Training and Infer-ence," in IEEE Journal of Solid-State Circuits, vol. 57, no. 4, pp. 1013-1026, April 2022
[36] K. Goetschalckx and M. Verhelst, "DepFiN: A 12nm, 3.8TOPs depth-first CNN processor for high res. image processing," 2021 Symposium on VLSI Circuits, Kyoto, Japan, 2021, pp. 1-2.
[37] M. Chang et al., "A 40nm 60.64TOPS/W ECC-Capable Com-pute-in-Memory/Digital 2.25MB/768KB RRAM/SRAM System with Embedded Cortex M3 Microprocessor for Edge Recommendation Systems," 2022 IEEE In-ternational Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 2022, pp. 1-3.
[38] J. Yue et al., "15.2 A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating," 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 2021, pp. 238-240
[39] H. Jia et al., "15.1 A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing," 2021 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 2021, pp. 236-238.
[40] K. Goetschalckx and M. Verhelst, "Breaking High-Resolution CNN Bandwidth Barriers With Enhanced Depth-First Execution," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 323-331, June 2019.
[41] S. Yan et al., "An FPGA-based MobileNet Accelerator Considering Network Structure Characteristics," 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 2021, pp. 17-23.
[42] T. -H. Yang et al., "Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks," 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, AZ, USA, 2019, pp. 236-249.
[43] Goetschalckx, K., & Verhelst, M. (2019). Breaking High-Resolution CNN Band-width Barriers With Enhanced Depth-First Execution. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9, 323-331.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文