論文名稱(外文):A Computing-In-Memory Based Input Bandwidth Scaling Architecture with Bitwise-Sparsity Detection and Memory Access Optimization
指導教授(外文):Tang, Kea-Tiong
Chang, Meng-Fan
口試委員(外文):Huang, Chao-Tsung
Lu, Chih-Cheng
得益於近年來深度神經網絡的興起和發展,人工智能逐漸出現在生活周圍的邊緣設備應用中。然而由於深度神經網絡大量的權重和輸入特徵圖數據在運算核心和記憶體之間的頻繁移動,傳統的馮諾依曼計算架構在能源效率和計算速度方面受到限制,對於電池驅動的低功耗人工智慧邊緣裝置的能耗與速度造成龐大的負擔。記憶體內運算(Computing-In-Memory, 簡稱CIM)通過在記憶體單元陣列中執行平行化點積運算,可以很大程度上克服這一瓶頸。
本架構採用記憶體內運算巨集設計優化專屬的資料流,設計一套硬體架構在不影響CIM輸出準確率前提下,透過省略含零值的計算增加系統效能。並且提出一降低IFMap記憶體存取次數且同時增加系統給予CIM輸入帶寬之串入並出資料複用架構,並將此研究提出之架構設計應用於一顆基於RRAM CIM之整合型晶片。
Artificial intelligence has gradually appeared in edge devices thanks to the recent rise and development of deep neural networks. However, conventional von Neumann architectures are limited in energy efficiency and performance due to the frequent movement of neural network weights and input feature map (IFMap) between processing cores and memory, placing a huge burden on the battery-driven AI edge devices. Computing-In-Memory can largely overcome the bottleneck by performing a parallelized dot product operation in the memory cell array.
Although CIM has high energy efficiency and array density with analog current accumulation operation, current deviation and area limitation of analog-to-digital converter limit the accumulated current value that CIM can read out on each bit line, significantly reducing the input bandwidth and operational efficiency. Even so, if we can take advantage of the high input sparsity characteristics of neural networks to skip unnecessary input zero values, or even bitwise zero skipping, and further reduce the data movement of the input feature map under the CIM architecture, the performance of AI edge devices can be further improved while reducing system power consumption simultaneously. The proposed architectures increase the system performance up to 32× by omitting zero-value calculations without affecting the accuracy of CIM output. Furthermore, propose a serial-in-parallel-output data reuse architecture that can reduce the number of IFMap memory accesses by up to 85% while decreasing the number of local buffers by 94% and simultaneously increasing the input bandwidth to the CIM by up to 9×. We apply the architecture design proposed in this research and tape out a chip based on RRAM CIM.
第 1 章 緒論--------------------------------------1
1.1 研究背景----------------------------------1
1.2 研究動機與目的-----------------------------2
1.3 章節簡介----------------------------------6
第 2 章 文獻回顧----------------------------------7
2.1 深度學習硬體加速器-------------------------7
2.1.1 資料搬運能耗------------------------------7
2.1.2 資料複用性--------------------------------8
2.2 記憶體內運算------------------------------10
2.2.1 非揮發性記憶體內運算-----------------------10
2.2.2 記憶體內運算輸入方法-----------------------13
2.3 稀疏化感知設計----------------------------14
2.4 研究動機---------------------------------18
第 3 章 基於記憶體內運算之按位稀疏化檢測架構--------19
3.1 記憶體內運算行為模型與卷積運算設計----------20
3.2 自適應動態按位稀疏檢測架構(Self-aware Dynamic Bitwise Sparsity Detection Architecture)--------------------------24
3.2.1 動態按位激活稀疏性加速方法 (Dynamic Bitwise Activation Sparsity Speed-Up Method)---------------------------------24
3.2.2 動態按位仲裁器 (Bitwise Sparsity Arbiter)-29
3.2.3 按位稀疏發射器(Bitwise Sparsity Emitter)--30
3.2.4 自適應消除器(Self-adaptive eliminator)----33
3.3 按位權重對齊輸入路由器 (Bit-Shift Weight-Aligning Router)--------------------------------------------------------35
第 4 章 群組化通道優先串入並出資料複用架構(Grouped Channel-First Serial-in-Parallel-Out Data Reuse Architecture)---------39
4.1 群組化深度優先串入並出資料複用暫存器(Grouped Channel-First Serial-in-Parallel-Out Stride Buffer)------------40
4.2 不同運算模式之資料複用資料流---------------42
4.2.1 3D Convolution---------------------------42
4.2.2 DW Convolution---------------------------46
4.2.3 PW Convolution & FC----------------------47
第 5 章 實驗結果----------------------------------50
5.1 環境設置----------------------------------50
5.2 晶片規格----------------------------------52
5.3 不同神經網路應用下稀疏化檢測架構成效--------54
5.4 記憶體存取次數與功耗分析-------------------55
5.5 整體面積與功耗分析比較---------------------61
5.6 與世界先進基於記憶體內運算加速器之比較------63
第 6 章 結論與未來發展----------------------------65
