帳號:guest(3.146.178.174)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):謝嘉祐
作者(外文):Hsieh, Chia-Yu
論文名稱(中文):支援深度可分離卷積與稀疏化感知機制之可重構記憶體內運算核心深度學習硬體加速器
論文名稱(外文):Reconfigurable Computing-In-Memory-Core Deep Learning Accelerator With Depthwise Separable Convolution and Sparsity Aware Mechanism
指導教授(中文):鄭桂忠
指導教授(外文):Tang, Kea-Tiong
口試委員(中文):黃朝宗
呂仁碩
盧峙丞
口試委員(外文):Huang, Chao-Tsung
Liu, Ren-Shuo
Lu, Chih-Cheng
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061615
出版年(民國):111
畢業學年度:111
語文別:中文
論文頁數:72
中文關鍵詞:記憶體內運算深度學習加速器可重構稀疏化
外文關鍵詞:in-memory-computingdeep learningacceleratorreconfigurablesparsity
相關次數:
  • 推薦推薦:0
  • 點閱點閱:86
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
記憶體內運算藉由在記憶體本身上進行資料計算,降低了大量的資料搬運,進而避免了馮.諾伊曼瓶頸。也因為其低功耗的特性,記憶體內運算在需要高能源效率的移動式邊緣裝置應用上擁有極高的潛力,然而在實際的靜態記憶體內運算仍有些硬體限制。第一個為記憶體內運算的儲存空間有限,會面臨到權重需要更新,而更新權重往往受限於頻寬只能開啟一條字元線,造成運算過程中的效率損失。第二個為計算時能導通的字元線和位元線有限,且無法任意選取想開啟的字元線和位元線,導致進行輕量化網路所需的深度可分離卷積會有低吞吐量的問題。
本研究根據上述提到的問題提出了一低功耗且高效率的深度學習硬體加速器,並且依照開發之架構下線了一顆晶片。此架構在設計上使用可重構多核心之記憶體內運算巨集陣列,根據可重構陣列之特性設計周邊電路,以此達到更快的推論速度。並且考量記憶體內運算巨集之計算特性,在無需增加額外的控制電路下,設計特殊的權重映射方法及利用稀疏化感知機制來同時支援深度可分離卷積(Depthwise Separable Convolution)以及標準卷積(Standard Convolution),達到有低功耗的同時,支援更多樣的卷積運算且達到更好的FPS。
本研究提出之方法與架構在圖片集為CIFAR-10,8bit權重精度與4bit圖片精度,應用VGG16網路模型下,能得到109.275 GOPS與17.315 TOPS/W。而在圖片集為ImageNet,8bit權重精度與8bit圖片精度,應用於MobileNet網路模型下,能得到50.268 GOPS與2.51 TOPS/W。所設計的可重構架構,與先前[1]同樣支援稀疏化感知機制且相同數量的記憶體內運算巨集架構之作品比較下,於VGG16,圖片集為CIFAR-100中最高提升了1.808倍的加速。
Computing-in-memory (CIM) systems reduce the degree of large-scale data movement by performing computation on the memory; this avoids a von Neumann bottleneck. Because of its low-power characteristic, CIM has demonstrated great potential for increasing the energy efficiency of edge devices. However, there is some limit to practical CIM. The first is that the storage capacity of CIM is limited, so the weights need to be updated. Updating the weights is limited by the write bandwidth, and usually, one word line can be activated at a time; this leads to performance loss. The second is that the number of the word lines and bit lines that can be turned on is limited and cannot be arbitrarily selected. This leads to low throughput when performing depthwise separable convolution in lightweight models.
This research proposes a deep learning accelerator with low power consumption and high performance, and we tape out a chip based on the proposed architecture. This architecture adopts reconfigurable multi-core CIM architecture and is equipped with a peripheral circuit based on the characteristic of reconfigurable to achieve faster inference speed. In addition, we design a particular weight mapping method and use a sparsity-aware mechanism to support depthwise separable convolution and standard convolution without an extra control circuit by considering the operation of CIM to achieve low power consumption and support more convolution types.
The method and architecture proposed to apply 4bit activation and 8bit weight on CIFAR-10 dataset, achieving 109.275 GOPS and 17.315 TOPS/W on the VGG16, on ImageNet dataset with 8bit activation and 8bit weight, achieving 50.268 GOPS and 2.51 TOPS/W on MobileNet. Compared with previous work[1] supporting sparsity-aware mechanism and the same CIM macro number, this work on CIFAR-10 dataset improves the speed by up to 1.808 times on VGG16.
摘要
ABSTRACT
目錄
圖目錄
表格目錄
第1章 緒論----------------------1
1.1 研究背景--------------------1
1.2 研究動機與目的---------------5
1.3 章節簡介--------------------10
第2章 文獻回顧-------------------11
2.1 深度學習硬體加速器-----------11
2.1.1 低精度數值運算-------------11
2.1.2 資料重複利用性-------------12
2.1.3 資料搬運能耗---------------15
2.1.4 稀疏化感知機制-------------16
2.1.5 軟硬體協同設計-------------18
2.2 記憶體內運算架構-------------20
2.3 實際記憶體內運算之硬體限制----23
2.4 研究動機--------------------27
第3章 可重構多核心記憶體內運算加速器-----------28
3.1 記憶體內運算巨集操作行為與稀疏化權重組名詞定義-----------28
3.2 硬體架構與資料流--------------------------33
3.3 可重構運算核心平行化與管線化運算-----------38
3.4 稀疏化感知指引碼與深度可分離卷積映射方法----42
3.4.1 稀疏化感知指引碼(Sparsity Index Code)與卷積運算-------42
3.4.2 深度可分離卷積(Depthwise Separable Convolution)映射方法-----44
3.5 可重構加法器(Reconfigurable Global adder)--------------46
3.6 多核心匯流排仲裁機制(Arbiter)--------------48
第4章 實驗結果---------------------------50
4.1 模擬階段之神經網路應用表現-------------50
4.2 可重構多核心架構成效------------------57
4.3 雙運算核心版本晶片--------------------61
4.4 與多種基於記憶體內運算加速器之比較------67
第5章 結論與未來發展----------------------69
參考文獻---------------------------------70
[1] S.-H. Sie, et al., “MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with Co-designed Compressed Neural Networks.” In arXiv:2010.12861, 2020.
[2] Olga Russakovsky, et al., “Imagenet large scale visual recognition challenge.” In International Journal of Computer Vision, 115.3: 211-252, 2015.
[3] A. Krizhevsky, et al., “Imagenet classification with deep convolutional neural networks.” In NIPS, 2012.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.” In ICLR, 2015.
[5] Howard, Andrew G., et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications." In arXiv:1704.04861, 2017.
[6] E. Lindholm, et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," In IEEE Micro, vol. 28, no. 2, pp. 39-55, 2008.
[7] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," In ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, 2017.
[8] S. Han, et al., “Learning both Weights and Connections for Efficient Neural Networks.” In NIPS, 2015.
[9] Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." In IEEE Conference on computer vision and pattern recognition, 2018.
[10] C. Szegedy et al., "Going deeper with convolutions." In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, doi: 10.1109/CVPR.2015.
[11] R. Andri, et al., "YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights," In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2016.
[12] S. Yin et al., "An Ultra-High Energy-Efficient Reconfigurable Processor for Deep Neural Networks with Binary/Ternary Weights in 28NM CMOS," In IEEE Symposium on VLSI Circuits, 2018.
[13] Y.-H. Chen, et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.” In JSSC, ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, 2017.
[14] Kung, Hsiang-Tsung. "Why systolic architectures?." Computer 15.01 (1982): 37-46.
[15] Y. Chen et al., "DaDianNao: A Machine-Learning Supercomputer," In IEEE Micro, pp. 609-622, 2014.
[16] Y. Zhe, et al., “Sticker: A 0.41-62.1 TOPS/W 8Bit neural network processor with multi-sparsity compatible convolution arrays and online tuning acceleration for fully connected layers.” In VLSI, 2018.
[17] H. Ji, L. Song, et al., "ReCom: An efficient resistive accelerator for compressed deep neural networks." In Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 237-240, 2018.
[18] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger and A. Moshovos, "Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing," In International Symposium on Computer Architecture (ISCA), pp. 1-13, 2016.
[19] J. -F. Zhang, C. -E. Lee, C. Liu, Y. S. Shao, S. W. Keckler and Z. Zhang, "SNAP: A 1.67 — 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm CMOS," In Symposium on VLSI Circuits, 2019.
[20] H. Li, et al., ” Pruning Filters for Efficient ConvNets.” In ICLR, 2017.
[21] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan and C. Zhang, "Learning Efficient Convolutional Networks through Network Slimming," In ICCV, 2017.
[22] R. Guo et al., "A 5.1pJ/Neuron 127.3us/Inference RNN-based Speech Recognition Processor using 16 Computing-in-Memory SRAM Macros in 65nm CMOS." In Symposium on VLSI Circuits, pp. C120-C121, 2019.
[23] J. Yue et al., "A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse," In International Solid- State Circuits Conference (ISSCC), pp. 234-236, 2020.
[24] H. Jia et al., "A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing," In International Solid- State Circuits Conference (ISSCC), pp. 236-238, 2021.
[25] J. Yue et al., "A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating," In International Solid- State Circuits Conference (ISSCC), pp. 238-240, 2021
[26] X. Si et al., "A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips." In International Solid- State Circuits Conference (ISSCC), pp. 246-248, 2020.
[27] J. -H. Kim, J. Lee, J. Lee, H. -J. Yoo and J. -Y. Kim, "Z-PIM: An Energy-Efficient Sparsity Aware Processing-In-Memory Architecture with Fully-Variable Weight Precision," In IEEE Symposium on VLSI Circuits, 2020.
[28] Z. Li et al., "A Miniature Electronic Nose for Breath Analysis," 2021 IEEE International Electron Devices Meeting (IEDM), 2021.
[29] X. Si et al., "24.5 A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning," 2019 IEEE International Solid- State Circuits Conference - (ISSCC), pp. 396-398, 2019.
[30] A. Biswas et al., “Conv-RAM: An Energy-Efficient SRAM with Embedded Convolution Computation for Low-Power CNN-Based Machine Learning Applications,” ISSCC, pp. 488-489, Feb. 2018.
[31] V. Khwa et al., “A 65nm 4Kb Algorithm-Dependent Computing-In-Memory SRAM Unit-Macro with 2.3ns and 55.8 TOPS/W Fully Parallel Product-Sum Operation for Binary DNN Edge Processors,” ISSCC, pp. 496-498, Feb. 2018.
[32] S. K. Gonugondla et al., “A 42pJ/decision 3.12 TOPS/W Robust In-Memory Machine Learning Classifier with On-Chip Training,” ISSCC, pp. 490-492, Feb. 2018.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *