帳號:guest(3.135.220.208)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):徐連志
作者(外文):Hsu, Lien-Chih
論文名稱(中文):基於深度卷積神經網路之具功率意識位元序列 串流加速器
論文名稱(外文):ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
指導教授(中文):邱瀞德
指導教授(外文):Chiu, Ching-Te
口試委員(中文):李政崑
黃朝宗
口試委員(外文):Lee, Jenq-Kuen
Huang, Chao-Tsung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:105062591
出版年(民國):107
畢業學年度:107
語文別:英文
論文頁數:67
中文關鍵詞:卷積神經網路硬體加速器功率意識精準度位元序列處理 單元串流式資料流
外文關鍵詞:Convolutional Neural Networks (CNNs)Hardware AcceleratorEnergy-AwarePrecisionBit-Serial PEStreaming Dataflow
相關次數:
  • 推薦推薦:0
  • 點閱點閱:288
  • 評分評分:*****
  • 下載下載:34
  • 收藏收藏:0
在過去十多年間,深度卷積神經網路由於其在視覺辨識上的準確率已經甚至比人類還高,因此已被許多不同的視覺辨識應用廣泛地接受。然而,高運算複雜度和大量的資料儲存對於卷積神經網路的硬體設計來說是兩個挑戰。雖然圖形處理器的出現可以讓我們解決高運算複雜度的問題,但是它因為大量的外部記憶體存取而相當耗能的這點讓許多研究者往專屬的卷積神經網路加速器來研究設計。普遍地來說,現今的卷積神經網路加速器都是將精準度設為16位元的定點數。為了減少資料儲存,Sakr 等人[1] 的結果顯示出在視覺辨識上,在準確率不減少超過一個百分比的限制之下,我們可以使用更低的精準度。此外,逐層的精準度設置可以比統一的精準度設置達到更低的位元寬度需求。
在這篇論文中,我們提出了一個基於深度卷積神經網路之具功率意識的位元序列串流加速器來解決計算複雜度、資料儲存以及外部記憶體存取等問題。
透過使用環狀的串流資料流和輸出重複使用策略,卷積層的外部記憶體存取量跟沒有使用輸出重複使用策略相比,在AlexNet 上減少了357.26 倍。此外,我們利用循環分塊以及將卷積層的步長映射成單位步長的方式來最佳化硬體使用率和避免不必要的運算,藉此來提升計算效能。此外,我們基於在權重上使用更少位元的特性設計了位元序列處理單元,可以同時提升計算效能以及減少外部記憶體的存取。
我們使用知名的屋頂模型來評估我們的設計,其與實際的硬體實現相比能夠更有效率地進行評估。我們探索所有可能的組合來尋找最佳計算效能以及最佳的傳輸與計算比率組合。在使用與Chen 等人[2] 相同FPGA 的假設下,我們與[2] 的設計相比能夠加速1.36 倍並減少41% 的外部記憶體存取。
在硬體實現部分,我們設計的處理單元陣列架構在TSMC 90 nm 的製程下能夠達到119MHz 的工作頻率,花費68 k 的門數(gate count),功耗為10.08mW。與Eyeriss [3] 相比,在AlexNet 上它們的外部記憶體存取需要15.4MB,而我們的設計只需4.36MB 外部記憶體存取,減少了能量消耗當中最耗能的部分。
Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers.
In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access.
We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2].
On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption.
1 Introduction 1
1.1 Background and Motivation ......................... 1
1.2 Goal and Contribution ......................... 4
1.3 Thesis Organization ......................... 6
2 Related Works 7
2.1 CNN Basics ......................... 7
2.2 Dataflow Types ......................... 11
2.3 Bit Reduction ......................... 13
2.4 Approaches ......................... 14
2.4.1 FPGA-Based Approach ......................... 14
2.4.2 ASIC-Based Approach ......................... 14
3 Proposed Energy-Aware System Architecture and Dataflow 15
3.1 System Overview ......................... 15
3.2 Computational Performance Enhancement ......................... 17
3.2.1 Loop Tiling ......................... 18
3.2.2 Bit Reduction ......................... 20
3.2.3 Non-Unit Stride to Unit Stride ......................... 22
3.3 Energy Efficiency ......................... 24
3.3.1 Reuse Strategy ......................... 25
3.4 Kernel Decomposition ......................... 28
4 Proposed Bit-Serial Streaming PE Array Architecture and Dataflow 30
4.1 Architecture ......................... 30
4.1.1 PE Array ......................... 31
4.1.2 PE Row ......................... 33
4.1.3 Bit-Serial PE ......................... 34
4.2 Ring Streaming Dataflow ......................... 37
5 System Evaluation 43
5.1 Theoretical Evaluation ......................... 43
5.2 Implementation Results ......................... 56
6 Conclusions 58
[1] S. Charbel and S. Naresh, “An analytical method to determine minimum
per-layer precision of deep neural networks,” in Acoustics, Speech and Signal
Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018.
[2] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,” in
Proceedings of the 2015 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays. ACM, 2015, pp. 161–170.
[3] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient
reconfigurable accelerator for deep convolutional neural networks,”
IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[4] Y.-J. Lin and T. S. Chang, “Data and hardware efficient design for convolutional
neural network,” IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 65, no. 5, pp. 1642–1651, 2018.
[5] Eyeriss dataflow talk at acm/ieee isca 2016. [Online]. Available:
http://www.rle.mit.edu/eems/wp-content/uploads/2016/06/eyeriss_
isca_2016_slides.pdf
[6] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos,
“Stripes: Bit-serial deep neural network computing,” in Microarchitecture
(MICRO), 2016 49th Annual IEEE/ ACM International Symposium on.
IEEE, 2016, pp. 1–12.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in Advances in neural information processing
systems, 2012, pp. 1097–1105.
[8] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual
performance model for multicore architectures,” Communications of the
ACM, vol. 52, no. 4, pp. 65–76, 2009.
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2015, pp. 1–9.
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 770–778.
[12] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human
action recognition,” IEEE transactions on pattern analysis and machine
intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[13] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
“Overfeat: Integrated recognition, localization and detection using convolutional
networks,” arXiv preprint arXiv:1312.6229, 2013.
[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2014, pp.
580–587.
[15] T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional convolutional neural
network for scene text detection,” IEEE transactions on image processing,
vol. 25, no. 6, pp. 2529–2541, 2016.
[16] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural
network cascade for face detection,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015, pp. 5325–5334.
[17] D. Tomè, F. Monti, L. Baroffio, L. Bondi, M. Tagliasacchi, and S. Tubaro,
“Deep convolutional neural networks for pedestrian detection,” Signal processing:
image communication, vol. 47, pp. 482–489, 2016.
[18] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional
encoder-decoder architecture for image segmentation,” arXiv preprint
arXiv:1511.00561, 2015.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 3431–3440.
[20] G. Lin, A. Milan, C. Shen, and I. D. Reid, “Refinenet: Multi-path refinement
networks for high-resolution semantic segmentation,” in Cvpr, vol. 1, no. 2,
2017, pp. 5168–5177.
[21] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep convolutional
neural network architecture with reconfigurable computation patterns,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 8,
pp. 2220–2233, 2017.
[22] L. Cavigelli and L. Benini, “Origami: A 803-gop/s/w convolutional network
accelerator,” IEEE Transactions on Circuits and Systems for Video Technology,
vol. 27, no. 11, pp. 2461–2475, 2017.
[23] A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architecture
to accelerate convolution in deep neural networks,” IEEE Transactions on
Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349–1362, 2018.
[24] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-brain: a deep
learning accelerator that tames the diversity of cnns through adaptive datalevel
parallelization,” in Design Automation Conference (DAC), 2016 53nd
ACM/EDAC/IEEE. IEEE, 2016, pp. 1–6.
[25] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing the convolution
operation to accelerate deep neural networks on fpga,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, no. 99, pp. 1–14, 2018.
[26] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos, “Loom: Exploiting
weight and activation precisions to accelerate convolutional neural networks,” in Proceedings of the 55th Annual Design Automation Conference.
ACM, 2018, pp. 20:1–20:6.
[27] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based
processor for convolutional networks,” in Field Programmable Logic and Applications,
2009 International Conference on. IEEE, 2009, pp. 32–37.
[28] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao:
A small-footprint high-throughput accelerator for ubiquitous machinelearning,”
ACM Sigplan Notices, vol. 49, no. 4, pp. 269–284, 2014.
[29] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in Proceedings
of the 47th Annual IEEE/ACM International Symposium on Microarchitecture.
IEEE Computer Society, 2014, pp. 609–622.
[30] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, and
Y. Chen, “Dadiannao: A neural network supercomputer,” IEEE Transactions
on Computers, vol. 66, no. 1, pp. 73–88, 2017.
[31] N. Li, S. Takaki, Y. Tomiokay, and H. Kitazawa, “A multistage dataflow
implementation of a deep convolutional neural network based on fpga for
high-speed object recognition,” in Image Analysis and Interpretation (SSIAI),
2016 IEEE Southwest Symposium on. IEEE, 2016, pp. 165–168.
[32] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: towards uniformed
representation and acceleration for deep convolutional neural networks,”
in Proceedings of the 35th International Conference on Computer-
Aided Design. ACM, 2016, pp. 12:1–12:8.
[33] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong,
“Automated systolic array architecture synthesis for high throughput cnn
inference on fpgas,” in Proceedings of the 54th Annual Design Automation
Conference. ACM, 2017, pp. 29:1–29:6.
[34] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance
analysis of a tensor processing unit,” in Computer Architecture (ISCA), 2017
ACM/IEEE 44th Annual International Symposium on. IEEE, 2017, pp.
1–12.
[35] Z. Liu, Y. Dou, J. Jiang, and J. Xu, “Automatic code generation of convolutional
neural networks in fpga implementation,” in Field-Programmable
Technology (FPT), 2016 International Conference on. IEEE, 2016, pp. 61–
68.
[36] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,
S. Song et al., “Going deeper with embedded fpga platform for convolutional
neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays. ACM, 2016, pp. 26–35.
[37] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang, “A
reconfigurable streaming deep convolutional neural network accelerator for
internet of things,” IEEE Transactions on Circuits and Systems I: Regular
Papers, vol. 65, no. 1, pp. 198–208, 2018.
[38] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang,
“Angel-eye: A complete design flow for mapping cnn onto embedded fpga,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 37, no. 1, pp. 35–47, 2018.
[39] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Urtasun,
and A. Moshovos, “Reduced-precision strategies for bounded memory in deep
neural nets,” arXiv preprint arXiv:1511.05236, 2015.
[40] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep convolutional
neural networks for object recognition,” in Acoustics, Speech and
Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE,
2015, pp. 1131–1135.
[41] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, and
A. Moshovos, “Bit-pragmatic deep neural network computing,” in Proceedings
of the 50th Annual IEEE/ACM International Symposium on Microarchitecture.
ACM, 2017, pp. 382–394.
[42] B. Moons and M. Verhelst, “An energy-efficient precision-scalable convnet
processor in 40-nm cmos,” IEEE Journal of Solid-State Circuits, vol. 52,
no. 4, pp. 903–914, 2017.
[43] K. Bong, S. Choi, C. Kim, D. Han, and H.-J. Yoo, “A low-power convolutional
neural network face recognition processor and a cis integrated with alwayson
face detector,” IEEE Journal of Solid-State Circuits, vol. 53, no. 1, pp.
115–123, 2018.
[44] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. Le-
Cun, “Neuflow: A runtime reconfigurable dataflow processor for vision,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE
Computer Society Conference on. IEEE, 2011, pp. 109–116.
[45] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240 g-ops/
s mobile coprocessor for deep neural networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops, 2014,
pp. 682–687.
[46] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefficient
dataflow for convolutional neural networks,” in ACM SIGARCH
Computer Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 367–
379.
[47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp.
248–255.
[48] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
networks,” in European conference on computer vision. Springer, 2014, pp.
818–833.
[49] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 770–778.
[51] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and
M. Horowitz, “Towards energy-proportional datacenter memory with mobile
dram,” in Computer Architecture (ISCA), 2012 39th Annual International
Symposium on. IEEE, 2012, pp. 37–48.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *