帳號:guest(3.149.255.208)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):郭達毅
作者(外文):Guo, Da-Yi
論文名稱(中文):一個針對卷積神經網路推論具資料局部性感知之平行化方法
論文名稱(外文):A Data-Locality Aware Parallelization Approach for Convolution Neural Network Inference
指導教授(中文):蔡仁松
指導教授(外文):Tsay, Ren-Song
口試委員(中文):吳誠文
呂仁碩
口試委員(外文):WU, CHENG-WEN
LIU, REN-SHUO
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:104061550
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:32
中文關鍵詞:嵌入式系統異質排程平行運算
外文關鍵詞:embedded systemheterogeneous schedulerparallel computing
相關次數:
  • 推薦推薦:0
  • 點閱點閱:662
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
平行化是一種普遍用於提升多核心系統效能的方法。然而,現今的神經網路推論(CNN inference)架構多採用多線程(multi-thread)方法將每一個卷積層(convolution layer)的計算工作分散至不同的核心上。我們觀察到這個方法會引起大量的核心間溝通(inter-core communication)成本並使系統效能降低。在這篇論文中,我們應用流水線執行(pipeline execution)的概念來平行化神經網路推論並且降低溝通成本。我們的實驗結果證明,我們的方法可以達成比傳統多線程方法高出73%的吞吐量(throughput)。
Parallelization is a common design practice for throughput improvement on multicore system. However, existing schedulers for convolution neural network inference essentially divide computational tasks of each convolution layer onto different CPU cores. However, this scheduling approach induces huge inter-core data movement and degrades the overall performance efficiency. In this paper, we proposed a pipeline-based scheduler to parallelize the convolution neural network inference while reducing the overall latency. Nevertheless, the optimization of the proposed pipeline-based scheduler requires careful balance of the workload of each stage so that the total latency is minimized. The experimental results show that our approach can get 73% performance improvement on throughput compared to existing multi-thread scheduler.
Abstract-------------------------------------------------3
Contents-------------------------------------------------4
List of Figures------------------------------------------5
I. Introduction-------------------------------------6
II. Related work-------------------------------------10
III. Methodology------------------------------------- 12
A. Pipeline Configuration Generation----------------15
B. Execution Time Estimating------------------------18
C. Layer-to-stage Allocation Algorithm--------------20
IV. Experimental results-----------------------------25
A. Performance Comparison---------------------------25
B. Execution Time Estimation Error------------------26
C. Optimal Pipeline Configuration Prediction--------27
V. Conclusion---------------------------------------29
References-----------------------------------------------30

[1] “NCNN: a high-performance neural network inference framework optimized for the mobile platform.” https://github.com/Tencent/ncnn.
[2] Albert Chiou and Mat Laibowitz. "Cache Coherent Interconnect Network ".
[3] Neil Parris. "Cache Coherency Fundamentals." https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/extended-system-coherency---part-1---cache-coherency-fundamentals, 2013.H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5325–5334.
[4] “Snoop Control Unit” https://developer.arm.com/docs/100486/latest/snoop-control-unit.M. Motamedi, D. Fong, and S. Ghiasi, “Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference,” ACM Trans. Embedded Comput. Syst., vol. 16, no. 5s, pp. 1–19, 2017.
[5] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size.” arXiv preprint arXiv:1602.07360, 2016.
[6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861, 2017.
[7] “Tengine is a lite, high performance, modular inference engine for embedded device.” https://github.com/OAID/Tengine.
[8] “Compute Library: A Software Library for Computer Vision and Machine Learning.” https://developer.arm.com/technologies/compute-library.J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. "Quantized convolutional neural networks for mobile devices." arXiv preprint arXiv:1512.06473, 2015.
[9] “Arm big.LITTLE technology is a heterogeneous processing architecture that uses two types of processor.” https://www.arm.com/why-arm/technologies/big-littleLinpeng Tang, Yida Wang, Theodore L Willke, and Kai Li. "Scheduling computation graphs of deep learning models on manycore cpus." arXiv preprint arXiv:1807.09667, 2018.
[10] “Intel Lakefield is packed with more than one type of CPU core to create a more stable and better rounded system.” https://www.techradar.com/news/intel-lakefield-video-guides-us-inside-its-first-hybrid-processor?region-switch=1551470279Hsin-Yu Ho, et al. "An Effective Early Multi-core System Shared Cache Design Method Based on Reuse-distance Analysis" National Tsing Hua University, 2017.
[11] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. "Quantized convolutional neural networks for mobile devices." arXiv preprint arXiv:1512.06473, 2015.
[12] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. "Runtime neural pruning." In NIPS, 2017.
[13] Linpeng Tang, Yida Wang, Theodore L Willke, and Kai Li. "Scheduling computation graphs of deep learning models on manycore cpus." arXiv preprint arXiv:1807.09667, 2018.
[14] Siqi Wang, Gayathri Ananthanarayanan, Yifan Zeng, Neeraj Goel, Anuj Pathania, Tulika Mitra. "High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors. " arXiv preprint arXiv:1903.05898, 2019.
[15] B. Lewis and D. J. Berg. “Multithreaded Programming with Pthreads. Prentice Hall”, 1998.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *