帳號:guest(3.145.75.49)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):蔡佩妘
作者(外文):Cai, Pei Yun
論文名稱(中文):針對深度卷積神經網路之具彈性加速器設計
論文名稱(外文):Design of A Flexible Accelerator for Deep Convolution Neural Networks
指導教授(中文):黃稚存
指導教授(外文):Huang, Chih Tsun
口試委員(中文):金仲達
劉靖家
口試委員(外文):King, Chung Ta
Liou, Jing Jia
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:103062560
出版年(民國):106
畢業學年度:105
語文別:中文英文
論文頁數:68
中文關鍵詞:卷積神經網路加速器
外文關鍵詞:Convolution Neural NetworkAccelerator
相關次數:
  • 推薦推薦:0
  • 點閱點閱:1176
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
卷積神經網路,是用於圖像辨別的一種深度學習演算法,由於可以達到高精確的結果,因此被廣泛用於人工智慧和電腦視覺等應用,但其高維度的網路結構及大量資料的計算負荷,在吞吐量表現及隨之產生的大量資料存取會是一個問題,也因此成為研究領域的熱門主題。即使我們能達到高度的平行化運算,改善吞吐量;我們需同時考慮資料移動的編排,以降低頻寬需求。因此,我們提出一種資料流並配合相應的硬體架構(從MIT的Eyeriss, an energy-efficient reconfigurable CNN accelerator延伸而來),在不犧牲效能的情況下,減少大量不必要的資料存取。

多數現有的加速器架構,分別從兩個觀點出發,其一是解決高度複雜的計算,或是有效編排資料存取順序,降低頻寬需求。由於兩者相互影響,我們必須同時考慮兩者,才能真正解決問題。

利用卷積層可高度平行化計算的特性,我們使用大量運算單元同時計算,以達到高吞吐量。另一方面,我們提出的資料流,其核心是充分利用卷積層各種資料共享的形式,並將資料暫存於運算單元的記憶體,運算單元在計算每個卷積視窗時,直接從暫存器重複存取資料,不但可以立即累加連續的輸出,也因此減少中間產物及運算資料從DRAM來回存取的負擔。由此,我們提出Tiling方法,增加運算單元記憶體空間來提高運算效率。當暫存器越大,越多的資料可以直接在PE內重複存取,並可以累加更多的中間產物,有更好的效能表現。另外,由於卷積神經網路的每層結構不同,我們提出的資料流和硬體架構具可重新配置性,並配合基於標籤辨別方法建立的晶片網路,在每層達到有效的空間及功能單元使用率。Alexnet的第二個卷積層,多使用2.63KB的暫存器,可以有4.64倍的效能改善;跟Eyeriss的第二個卷積層相比,在我們多使用18.38KB的暫存器的情況下,我們有1.07倍快的效能。我們的第三個卷積層,增加12.8KB的暫存器,我們有14.55倍快的效能;跟Eyeriss的第三個卷積層相比,在我們多使用28.55KB的記憶體情況下,有1.03倍快的效能。以Alexnet全部卷積層來看,在250MHz工作頻率下,我們的效能可以達到41.7 fps (55.5 GOPS)。與Eyeriss相比,在同樣200MHz工作頻率下,Alexnet的第2~4卷積層可以快1.03~1.07倍,晶片內建記憶體則多使用了16KB。
Convolution Neural Network (CNN) is a deep learning method for vision recognition. The state-of-the-art accuracy makes it widely used in artificial intelligence, computer vision, self-driving car, etc. However, CNNs are highly computational complex and demand high memory bandwidth. Although we exploit highly parallel computation to achieve effective throughput, the good orchestration of data movements should be taken into consideration to reduce increased memory bandwidth. To address these problems, we present a specialized dataflow with spatial hardware (extended from MIT Eyeriss, an energy-efficient reconfigurable CNN accelerator) to reduce memory access without sacrificing performance.

Existing works typically improve in either computation or memory access aspect. However, the computational parallelism and memory bandwidth react each other, so we should take both of them into consideration at the same time.

Convolution operations of CNNs exhibit various data reuse types and show high parallelism. We apply highly-parallel PE array to improve the throughput. To minimize data access, we purpose a dataflow leveraging data reuse opportunities and local buffer inside PE. Then, data can be temporal reused without iterative access between high-level memory and PEs. In addition, large amount of intermediate data can be accumulated immediately, which could pose additional pressure on storage. By reason of that, we propose a tiling methodology with the tradeoff between performance and local buffer size. The larger local buffer is used, the more data can be reused and the more intermediate data can be consumed, which can alleviate the data streaming bottleneck, enabling the efficient computation. Furthermore, our dataflow and hardware can adapt to different layers with varying shapes, so we can maximize the throughput in each layer of CNNs. For layer2 of Alexnet, we can have a speedup of 4.64 times with additional buffer of 2.63 KB, over initial buffer size, which is a row size of input data. Compared to Eyeriss, we have the speedup of 1.07 times by using additional buffer of 18.38KB. For layer3 of Alexnet, we can have a speedup of 14.55 times with additional buffer of 12.8KB. Compare to Eyeriss, we have the speedup of 1.03 times by using additional buffer of 28.55KB. As a result, the throughput of the frame rate achieves 41.7 fps (55.5 GOPS) at 250MHz frequency for convolution layers of Alexnet. Compared to Eyeriss, under the same frequency 200 MHz and using Alexnet, we achieve better performance in layer2~4 ranging from 1.03~1.07 times by using additional on-chip buffer of 16KB.
1 Introduction 1
1.1 Introduction of Convolution Neural Network (CNN) and Motivation 1
1.2 Previous Work 1
1.3 Contribution 4
1.4 Organization of Thesis 5
2 CNN Background 6
2.1 Primer 6
2.1.1 Convolution Layer 8
2.1.2 Non-Linearity Layer (ReRU) 12
2.1.3 Pooling Layer (Sub Sampling) 14
2.1.4 Fully-Connected Layer 16
2.2 State-of-the-Art CNN Models 17
2.3 Challenges and Breakthrough 18
3 Spatial Architecture and Specialized Dataflow 21
3.1 Overview of Spatial Architecture 21
3.2 Specialized Dataflow 24
3.2.1 Logical Mapping 24
3.2.2 Physical Mapping 42
3.3 Network-on-Chip (NoC) 45
3.4 Processing Engine (PE) 49
4 Tiling Methodology 51
5 Experiment Results and Analysis 58
6 Conclusion and Future Work 64
6.1 Conclusion 64
6.2 Future Work 65
[1] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, in IEEE ISSCC, 2016.
[2] Y.-H. Chen, J. Emer, and V. Sze, Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks, in IEEE ISCA, 2016.
[3] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, DianNao: A Small- footprint High-throughput Accelerator for Ubiquitous Machine-learning, in ASPLOS, 2014.
[4] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, DaDianNao: A Machine-Learning Supercomputer, in MICRO, 2014.
[5] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, Pudiannao: A polyvalent machine learning accelerator, in ASPLOS. ACM, 2015, pp. 369381.
[6] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, ShiDianNao: Shifting Vision Processing Closer to the Sensor, in ISCA, 2015.
[7] J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim, 14.6 a 1.42tops/w deep convolutional neural network recognition processor for intelligent ioe systems, in ISSCC16, pp. 264265, Jan 2016.
[8] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, cuDNN: Efficient Primitives for Deep Learning, CoRR, vol. abs/1410.0759, 2014.
[9] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, Cnp: An fpga-based processor for convolutional networks, in FPL. IEEE, 2009, pp. 3237.
[10] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and
H. P. Graf, A Massively Parallel Coprocessor for Convolutional Neural Networks, in IEEE ASAP, 2009.
[11] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, A Dynamically Config- urable Coprocessor for Convolutional Neural Networks, in ISCA, 2010.
[12] c. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, ”NeuFlow: A runtime reconfigurable dataflow processor for vision,” in IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition Workshops (CVPRW) . IEEE, Jun. 2011, pp.109-116.
[13] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, Memory-centric accelerator design for Convolutional Neural Networks, in IEEE ICCD, 2013.
[14] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks, in IEEE CVPRW, 2014.
[15] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, in FPGA, 2015.
[16] Q. Jiantao, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, Going deeper with embedded fpga platform for convolutional neural network. Proceedings of the 2016 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays. ACM, 2016.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convo- lutional neural networks, in NIPS, 2012, pp. 10971105.
[18] M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, in ECCV, 2014, pp. 818833.
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, arXiv preprint arXiv:1409.4842, 2014.
[20] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.
[21] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, arXiv preprint arXiv:1512.03385, 2015.
[22] J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning-ICANN 2014, pages 281-290. Springer, 2014.
[23] An Intuitive Explanation of Convolutional Neural Networks by ujjwalkarn, 2016.
[24] Shaoqing Ren, et al, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015, arXiv:1506.01497.
[25] Clarifai / Technology
[26] Wikipedia article on Kernel (image processing).
[27] Neural Networks by Rob Fergus, Machine Learning Summer School 2015.
[28] CS231n Convolutional Neural Networks for Visual Recognition, Stanford.
[29] Teledyne DALSA / Image Filtering in FPGAs.
(此全文未開放授權)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *