帳號:guest(18.191.72.224)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):吳浩寧
作者(外文):Wu, Hao-Ning
論文名稱(中文):優化卷積神經網路加速器中高效能逐點卷積層之資料局部性
論文名稱(外文):Optimization of Data Locality for Energy-Efficient Pointwise Convolutional Layers in CNN Inference Accelerators
指導教授(中文):黃稚存
指導教授(外文):Huang, Chih-Tsun
口試委員(中文):劉靖家
謝明得
口試委員(外文):Liou, Jing-Jia
Shieh, Ming-Der
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:105062635
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:36
中文關鍵詞:張量拆解卷積深度學習加速器資料局部性能量效率迴圈最佳化
外文關鍵詞:CP DecompositionConvolutionDNN AcceleratorData LocalityEnergy EfficiencyLoop Optimization
相關次數:
  • 推薦推薦:0
  • 點閱點閱:577
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
深度學習在近幾年來逐漸被重視,其迅速的發展被運用於各個領域,尤其是電腦視覺方面的應用。讓深度學習發揮作用的核心是深度卷積網路,透過一層層神經元緊密的連接,輸入網路的原始特徵經過一連串的操作後,被轉換為較高階的表示形式,藉此達到近似人類智慧的表現。因此,如何設計強大的深度卷積網路是大家關注的焦點。
隨著各式各樣的深度卷積模型被發明,大家開始思考如何將深度學習使用到實際的應用上。為了使模型可以在資源有限的嵌入式系統上面運行,許多模型使用深度可分離卷積來取代一般的卷積,藉此降低模型的大小。儘管模型的確縮小了,但是每一層之間的激活值,仍會造成大量的記憶體存取需求。
根據以上的觀察,我們提出一種優化資料區域局部性的框架,透過獨特的計算順序,我們可以有效率的重新利用1×1卷積中的資料。另外,我們使用層融合來進一步優化資料的存取,對於深度卷積來說,我們可以將其和前後的1×1卷積進行融合;對於一般卷積,我們可以先用張量拆解再進行融合。層融合有效的節省了激活值的讀寫,並降低了記憶體頻寬的需求。
我們提出的方法可以使用在任何基於1×1卷積(矩陣乘法)的應用上,也可以被在各種深度學習加速器上以降低記憶體的需求。我們透過一個記憶體模擬器來驗證這些方法的優缺點,並整理出一些有效率的矩陣分塊方式,當使用64K大小的晶載記憶體時,在拆解後的ResNet-50和MobileNet V2上,我們分別可以節省20\%和67\%的能量消耗。
Deep learning has drawn significant attention recently due to its rapid advance in myriads of intelligent reasoning applications, especially those related to computer vision. The core component that allows deep learning techniques to capture the information (i.e., features) embedded in the input data (e.g., images) is deep convolutional neural network (CNN), where the vast amount of neurons manipluate the features they observe layer by layer. At the end of the network, a higher-level representation of the original input is generated. Therefore, the design of a powerful network have always been an active area of research.
Many complex CNN models have been developed to achieve high accuracy on image classification tasks. More recently, as the demands for real-time intelligent applications grow increasingly, the highly efficient CNN model has become an urgent need on the edge embedded devices with limited resources. One of the essential trends of building up a compact neural network is to adopt depthwise separable convolutions as the substitutes for standard convolutions in CNNs, which remarkably reduce the number of filter weights. However, even the number of multiply-and–accumulate (MAC) operations are fewer with the introduction of depthwise separable convolutions, the amount of intermediate activations stays the same, which creates strong needs for high memory bandwidth in addition to the computation capability.
Based on these observations, we proposed an optimization framework which aims at maximizing data reuse in 1×1 convolutions (or pointwise convolution). Previously unidentified data reuse is explored through innovative Scan execution order. Furthermore, we augment the exploration space across two or more layers through layer fusion. For a depth-wise convolution, it can be fused with the nearest 1×1 convolutions. For a standard convolution, it should be decomposed into depth-wise convolutions and 1×1 convolutions before being fused. Layer fusion eliminates the need for storing intermediate activations back to the DRAM. Combining Scan order with layer fusion, an remarkable amount of DRAM access can be saved.
Our methods can be used to optimize any application built on 1×1 convolutions (or matrix multiplications). We gave a thorough elaboration of the pros and cons of our framework and verified the result against a DRAM simulator. The fruits of our work are several energy-efficient tiling patterns that can integrate into CNN accelerators. For a 64K-entry on-chip buffer, fusion with Scan execution order achieved DRAM energy saving of 20\% on decomposed ResNet-50 and 67\% on MobileNet V2.
1 Introduction 1
1.1 Motivation 1
1.2 Contribution 2
1.3 Thesis Organization 3
2 Preliminaries 4
2.1 Depthwise Separable Convolution 4
2.2 CP Decomposition 5
2.3 DRAMSim2 6
3 Scheduling for Matrix Multiplication 7
3.1 Loop Tiling 7
3.1.1 General Matrix Multiplication 7
3.1.2 Tiled Matrix Multiplication 8
3.1.3 Cost and Buffer Size Constraint 9
3.2 Proposed Tiling Scheme with Scan Order 10
3.2.1 Traveling Salesman Problem Model 10
3.2.2 Maximizing the Reusability with Scan Order 11
3.2.3 Effectiveness of Scan Order 12
3.2.4 Consideration of Fragmented Tiles 13
3.2.5 Cost and Buffer Size Constraint 14
4 Fused Depthwise Separable Convolution 15
4.1 Pointwise Layer Fusion 15
4.2 Layer Fusion with Intermediate Operations 17
5 Experimental Evaluation 19
5.1 Optimization of a Single Layer 21
5.2 Optimization of Layer Fusion 23
5.3 Comparison on Various Buffer Sizes 30
6 Conclusion and Future Work 32
6.1 Conclusion 32
6.2 Future Work 32
[1] Y. Chen et al., “Dadiannao: A machine-learning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2014, pp. 609–622.
[2] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan 2017.
[3] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual international Symposium on Computer Architecture, ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 1–12.
[4] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” ArXiv e-prints, Apr. 2017.
[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” ArXiv e-prints, Jan. 2018.
[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” ArXiv e-prints, Feb. 2016.
[7] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory system simulator,” IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, Jan
2011.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
[9] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 1800–1807.
[10] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Rev., vol. 51, no. 3, pp. 455–500, Aug. 2009.
[11] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, “Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition,” ArXiv e-prints, Dec. 2014.
[12] M. Astrid and S.-I. Lee, “Cp-decomposition with tensor power method for convolutional neural networks compression,” in 2017 IEEE International Conference on Big Data and Smart Computing (BigComp), Feb 2017, pp. 115–118.
[13] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, Optimizing fpga-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’15. New York, NY, USA: ACM, 2015, pp. 161–170.
[14] M. Peemen, B. Mesman, and H. Corporaal, “Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators,” in 2015 Design, Automation Test in Europe Conference Exhibition (DATE), March 2015, pp. 169–174.
[15] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnn accelerators,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2016, pp. 1–12.
[16] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255.
[17] C. Szegedy et al., “Inception-v4, inception-resnet and the impact of residual connections on learning,” 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *