帳號:guest(3.149.251.104)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):鄭揚翰
作者(外文):Zheng, Yang-Han
論文名稱(中文):基於有效位元組合機制及可重構高性能乘法器之高面積效率深度神經網路加速器
論文名稱(外文):An Area-Efficient DNN Accelerator with Effective Bit Combination Mechanism and a Reconfigurable High-Performance Multiplier
指導教授(中文):鄭桂忠
指導教授(外文):Tang, Kea-Tiong
口試委員(中文):黃朝宗
呂仁碩
盧峙丞
口試委員(外文):Huang, Chao-Tsung
Liu, Ren-Shuo
Lu, Chih-Cheng
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061470
出版年(民國):111
畢業學年度:111
語文別:中文
論文頁數:51
中文關鍵詞:深度神經網路加速器有效位元組合機制乘法器高面積效率
外文關鍵詞:DNN acceleratoreffective bit combination mechanismmultiplierarea-efficient
相關次數:
  • 推薦推薦:0
  • 點閱點閱:563
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
深度神經網路廣泛地應用在各種各樣的任務中,例如圖像分類或者語音辨識等。在將深度神經網路部署到邊緣裝置時,通常會對輸入和權重進行量化處理。因此資料的分佈通常會呈現明顯的規律。大部分的資料中具有大量的冗餘的位元,這些冗餘的位元會導致計算資源的利用率下降。本論文提出了一個基於有效位元組合機制及可重構高性能乘法器的高面積效率深度神經網路加速器,能夠為移動設備提供深度神經網路加速的支援。基於modified Baugh-Wooly的乘法器,本論文提出了一個可以同時實現2個4-bit乘法運算的乘法器,所消耗的面積和功耗僅為傳統乘法器的1.57倍和2.31倍。根據深度神經網路中資料的分佈特性,本論文提出了一種針對0/-1/1的權重的閘控方法,能夠減少34.96%的功耗。本論文提出一種優化的資料流,可以在更小的面積和功耗下對輸入和權重實現更好的重複利用,減少記憶體的訪問。並且本論文進一步提出具有2種策略的高效的卷積方案,能夠有效地提高在各種層配置下的處理單元的利用率。基於所提出的方法,本論文所設計的深度神經網路加速器可以實現243.13 GOPS/mm2的面積效率。
Deep neural networks are widely used in a variety of tasks, such as image classification or speech recognition, etc. When deploying DNN to the edge device, the inputs and weights are usually quantized. And there are obvious rules in the data distribution. Most of the data have a lot of redundant bits, which will reduce the utilization of computation resources. This paper proposed an area-efficient DNN accelerator with effective bit combination mechanism and a reconfigurable high-performance multiplier that can support DNN acceleration for mobile devices. Based on the modified Baugh-Wooly multiplier, this paper proposes a multiplier that can process two 4-bit multiplication operations in one cycle, consuming only the 1.57× area and the 2.31× power consumption of a traditional multiplier. Based on the distribution characteristics of data in DNN, this paper proposes a gating approach for the weights of 0/-1/1, resulting in a 34.96% reduction in power consumption. This paper proposes an optimized data flow that better reuses inputs and weights and reduces memory access in the smaller area and lower power consumption. And this paper further proposes an optimized convolutional scheme with 2 strategies that can effectively improve the utilization of processing elements under various layer configurations. Based on the proposed approaches in this paper, the proposed deep neural networks accelerator can achieve an area efficiency of 243.13 GOPS/mm2.
摘要 I
Abstract II
圖目錄 IV
表目錄 V
第 1 章 緒論 1
1.1 研究背景 1
1.2 研究動機 4
1.3 章節簡介 5
第 2 章 文獻回顧 6
2.1. 深度神經網路加速器的發展 6
2.2. 多精度的乘法計算 8
2.2.1 Bit-serial 8
2.2.1 Bit Fusion 9
2.3. 挑戰及解決方案 10
第 3 章 有效位組合機制 15
3.1. 有效位組合機制原理 15
3.2. 有效位組合機制實際應用 17
第 4 章 深度神經網路加速器硬體設計 21
4.1 系統架構設計 21
4.2 處理單元組 24
4.2.1 處理單元設計 24
4.2.2 乘法器設計 25
4.3 組合編碼器 30
4.4 針對權重0/-1/1的閘控方法 31
4.5 優化的資料流 32
4.6 高效的卷積方案 38
第 5 章 實驗結果與討論 40
5.1 模擬結果 40
5.1.1 乘法器結果 40
5.1.2 閘控方法結果 42
5.1.3 優化資料流結果 43
5.1.2 FPGA系統驗證 45
5.2 與S.O.T.A.研究的比較 47
第 6 章 結論與未來發展 48
第 7 章 參考文獻 49

[1] McCulloch W.S. and Pitts W.. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 1943, 5(4): 115-133
[2] Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain[J]. Psychological review, 1958, 65(6): 386.
[3] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. nature, 1986, 323(6088): 533-536.
[4] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[5] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[6] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
[7] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 580-587.
[8] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
[9] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7263-7271.
[10] Redmon J, Farhadi A. Yolov3: An incremental improvement[J]. arXiv preprint arXiv:1804.02767, 2018.
[11] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J]. arXiv preprint arXiv:2004.10934, 2020.
[12] Pouyanfar S, Sadiq S, Yan Y, et al. A survey on deep learning: Algorithms, techniques, and applications[J]. ACM Computing Surveys (CSUR), 2018, 51(5): 1-36.
[13] Wen W, Wu C, Wang Y, et al. Learning structured sparsity in deep neural networks[J]. Advances in neural information processing systems, 2016, 29.
[14] Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.
[15] Capra M, Bussolino B, Marchisio A, et al. Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead[J]. IEEE Access, 2020, 8: 225134-225180.
[16] Esmaeilzadeh H, Sampson A, Ceze L, et al. Neural acceleration for general-purpose approximate programs[C]//2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012: 449-460.
[17] Han S, Liu X, Mao H, et al. EIE: Efficient inference engine on compressed deep neural network[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 243-254.
[18] Chen F, Song L, Chen Y. ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks[C]//2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2018: 178-183.
[19] Han S, Kang J, Mao H, et al. Ese: Efficient speech recognition engine with sparse lstm on fpga[C]//Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017: 75-84.
[20] S. -H. Sie et al., "MARS: Multimacro Architecture SRAM CIM-Based Accelerator With Co-Designed Compressed Neural Networks," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 5, pp. 1550-1562, May 2022, doi: 10.1109/TCAD.2021.3082107.
[21] Chen J, Ran X. Deep learning with edge computing: A review[J]. Proceedings of the IEEE, 2019, 107(8): 1655-1674.
[22] Ma Y, Cao Y, Vrudhula S, et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(7): 1354-1367.
[23] Judd P, Albericio J, Hetherington T, et al. Stripes: Bit-serial deep neural network computing[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016: 1-12.
[24] Sharma H, Park J, Suda N, et al. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018: 764-775.
[25] Wang Y, Qin Y, Deng D, et al. A 28nm 27.5 TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing[C]//2022 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2022, 65: 1-3.
[26] J. Yue et al., "15.2 A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating," 2021 IEEE International Solid- State Circuits Conference (ISSCC), 2021, pp. 238-240, doi: 10.1109/ISSCC42613.2021.9365958.
[27] Zhao X, Wang Y, Cai X, et al. Linear symmetric quantization of neural networks for low-precision integer hardware[J]. 2020.
[28] C. -H. Lin et al., "7.1 A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC," 2020 IEEE International Solid-State Circuits Conference - (ISSCC), 2020, pp. 134-136.
[29] J. -S. Park et al., "9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC," 2021 IEEE International Solid-State Circuits Conference (ISSCC), 2021, pp. 152-154.
[30] J. -S. Park et al., "A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC," 2022 IEEE International Solid-State Circuits Conference (ISSCC), 2022, pp. 246-248.
[31] S. K. Lee et al., "A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling," in IEEE Journal of Solid-State Circuits, vol. 57, no. 1, pp. 182-197, Jan. 2022.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *