帳號:guest(18.118.193.240)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):歐宇豐
作者(外文):Ou, Yu-Feng
論文名稱(中文):用於卷積神經網路加速之能量和面積高效可重構四精度卷積引擎
論文名稱(外文):Energy- and Area-Efficient Reconfigurable Quad-Precision Convolution Engine for CNN Acceleration
指導教授(中文):黃朝宗
指導教授(外文):Huang, Chao-Tsung
口試委員(中文):呂仁碩
賴永康
口試委員(外文):Liu, Ren-Shou
Lai, Yeong-Kang
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:109061553
出版年(民國):111
畢業學年度:111
語文別:英文
論文頁數:41
中文關鍵詞:可重構卷積引擎卷積神經網路加速
外文關鍵詞:Reconfigurable Convolution EngineCNN Acceleration
相關次數:
  • 推薦推薦:0
  • 點閱點閱:402
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
CNN在計算影像方面取得了巨大成功。但是需要降低能量和面積要求才能在邊緣設備上部署CNN。減少模型層數和均勻精度量化通常用於減少巨大的計算強度,從而降低能耗,但它們會導致模型結果的顯著下降。在保持 PSNR 的同時減少高能量需求的更好方法是通過利用模型每層的不同量化靈敏度的多精度網絡。然而在之前支持可變精度模型加速的硬件設計中,由於使用了擴展乘法器,在基本計算單元上增加了1位元符號擴展,導致面積顯著增加。此外,它們中的大多數以分類或風格遷移為目標。因此對於量化敏感和計算密集型應用(例如去噪)的有效性尚不清楚。
在這篇論文中,我們綜合考慮符號擴展的缺點和多精度去噪模型,開發了一種能量和面積高效的可重構四精度卷積引擎。首先,我們通過為每一層分配最合適的精度來構建四精度去噪模型。該技術將四個選定的降噪模型的計算複雜度降低了17-33%。第二,我們提出了偏置乘法器來避免符號擴展,並表明它可以在不降低PSNR的前提下讓單通道到單通道卷積上將面積開銷從66%減少至6%。第三,我們建立了一種面積開銷較小的數據打包方法,讓處理單元在處理不同位寬的數據時保持100%的利用率。考慮到以上所有因素,我們實現了一個高度平行的捲積引擎,該引擎具有偏置乘法器和最節省面積的打包方法。基於台積電40nm工藝的合成結果表明,與使用擴展乘法器的設計相比,我們的引擎可以將額外的面積開銷從60%降至10%。此外,結合計算複雜度的降低,我們的設計在 UHD 30 fps、FHD 60 fps 和 FHD 30 fps 下分別比傳統的8位元減少模型層數方法提供 34%, 30%, 32% 的能量節省。
CNN has achieved great success in computational imaging. However, the energy and area requirements need to be lowered to deploy CNN on edge devices. Layer reduction and uniform-precision quantization are often used to reduce the enormous computational intensity and thus reduce energy consumption, but they result in significant loss of model results. A better approach to reducing high energy demand while maintaining PSNR is through multi-precision networks which exploit the different quantization sensitivity of each model layer. However, in previous hardware designs that supported variable precision model acceleration, the area was significantly increased due to the extended multiplier, which adds a 1-bit sign extension to the basic computing unit. Also, most of them target classification or style transfer; therefore, the effectiveness is unclear for quantization-sensitive and computationally intensive applications such as denoising.
In this work, we develop an energy- and area-efficient reconfigurable quad-precision convolution engine by considering the disadvantage of sign extension and multi-precision denoising model jointly. First, we build a quad-precision denoising modeling process by assigning the most suitable precision for each layer. This technique reduces computation complexity by 17-33% for four selected denoise models. Then, we propose the biased multiplier to avoid sign extension and show that it can reduce area overhead from 66% to 6% on a one-channel to one-channel convolution without PSNR drop. Third, we establish a data packaging method with small area overhead, allowing processing elements to maintain 100% utilization when processing data of different bit widths. Considering all above, we implement a highly-parallel convolution engine with the biased multiplier and the most area-efficient packing method. Synthesis results based on TSMC 40nm technology show that our engine can reduce the area overhead from 60% to 10% compared with the design using the extended multiplier. Also, combined with computation complexity reduction, our design can offer 34%, 30%, 32% energy reduction over the conventional 8-bit layer reduction method at UHD 30 fps, FHD 60 fps and FHD 30 fps respectively.
摘要 i
Abstract iii
誌謝 v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Dynamic Fixed Point . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Reconfigurable Engine with Extended Multiplier . . . . . 3
2 Quad-Precision Modeling for Denoising 5
2.1 Denoising Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Decision of Basic Computing Unit . . . . . . . . . . . . . . . . . . 6
2.3 Quad-Precision Denoising Model . . . . . . . . . . . . . . . . . . 10
2.4 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Design of Processing Element for Reconfigurable Quad-Precision Engine 17
3.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Brief Introduction of eCNN . . . . . . . . . . . . . . . . . . . . . 21
3.3 Brief Introduction of Reconfigurable PE . . . . . . . . . . . . . . 22
3.4 Data Packing Methods . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Basic Packing Concept . . . . . . . . . . . . . . . . . . . . 24
3.4.2 Introduce Different Packing Methods . . . . . . . . . . . . 24
3.4.3 Area Overhead Comparison . . . . . . . . . . . . . . . . . 28
4 Implementation of Highly-Parallel Quad-Precision Convolution Engine 31
4.1 Uniform-Precision Convolution Engine . . . . . . . . . . . . . . . 31
4.2 Quad-Precision Convolution Engine with Extended Multiplier . . 32
4.3 Quad-Precision Convolution Engine with Biased Multiplier . . . . 33
4.4 Detail Comparison of Packing Area Overhead . . . . . . . . . . . 34
4.5 Implementation Result . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Conclusion and Discussion 37
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

[1] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand, “Deep joint demosaicking and denoising,” ACM Trans. Graph., vol. 35, no. 6, Nov. 2016.
[2] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
[3] Kai Zhang, Wangmeng Zuo, and Lei Zhang, “FFDNet: Toward a Fast and
Flexible Solution for CNN-Based Image Denoising,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4608–4622, 2018.
[4] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, Eds., Cham, 2014, pp. 184–199, Springer International Publishing.
[5] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.
[6] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi, “Photo-Realistic Single Image SuperResolution Using a Generative Adversarial Network,” arXiv e-prints, p. arXiv:1609.04802, Sept. 2016.
[7] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu
Lee, “Enhanced Deep Residual Networks for Single Image Super-Resolution,” arXiv e-prints, p. arXiv:1707.02921, July 2017.
[8] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016.
[9] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng
Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” CoRR, vol. abs/1606.06160, 2016.
[10] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr, “WRPN: wide reduced-precision networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. 2018, OpenReview.net.
[11] Jungwook Choi, Swagath Venkataramani, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan, Zhuo Wang, and Pierce Chuang, “Accurate and efficient 2-bit quantized neural networks,” in Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019, Ameet Talwalkar, Virginia Smith, and Matei Zaharia, Eds. 2019, mlsys.org.
[12] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” J. Mach. Learn. Res., vol. 18, pp. 187:1–187:30, 2017.
[13] Patrick Judd, Jorge Albericio, and Andreas Moshovos, “Stripes: Bit-serial deep neural network computing,” IEEE Comput. Archit. Lett., vol. 16, no. 1, pp. 80–83, 2017.
[14] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in 45th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, Los Angeles, CA, USA, June 1-6, 2018, Murali Annavaram Timothy Mark Pinkston, and Babak Falsafi, Eds. 2018, pp. 764–775, IEEE Computer Society.
[15] L.W. Wang and C.T. Huang, Reconfigurable Convolution Engine with CoarseGrained Bit-Level Flexibility for Style Transfer, 國立清華大學, 2021.
[16] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi, “Hardwareoriented approximation of convolutional neural networks,” CoRR, vol. abs/1604.03168, 2016.
[17] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, “Training deep neural networks with low precision multiplications,” 2014.
[18] Philipp Gysel, “Ristretto: Hardware-oriented approximation of convolutional neural networks,” 2016.
[19] Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang, Chi-Wen Weng, KaiPing Lin, Li-Wei Wang, and Li-De Chen, “eCNN: A Block-Based and
Highly-Parallel CNN Accelerator for Edge Inference,” arXiv e-prints, p.arXiv:1910.05680, Oct. 2019.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *