帳號:guest(3.12.165.82)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):吳宗憲
作者(外文):Wu, Tsung-Hsien
論文名稱(中文):在TVM中支持異質平台上的彈性平行計算
論文名稱(外文):Supporting Flexible Parallel Computing on Heterogeneous Platform in TVM
指導教授(中文):金仲達
指導教授(外文):King, Chung-Ta
口試委員(中文):黃稚存
董明智
口試委員(外文):Huang, Chih-Tsun
Tung, Ming-Chih
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:108062547
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:39
中文關鍵詞:編譯器深度學習異質平台平行計算
外文關鍵詞:CompilerDeep learningHeterogeneous platformParallel computing
相關次數:
  • 推薦推薦:0
  • 點閱點閱:283
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
TVM是一個編譯器平台,它支持許多機器學習框架和不同類型的處理器/加速器。它可以優化深度神經網路並生成高效能的程式碼以在不同種類硬體裝置上執行。雖然TVM能支持異質平台上的計算,但目前只能將程式碼配置到異質處理器上以循序的方式執行。此外,TVM的排程是靜態的,不能依據處理器的負載動態調適。在本論文中,我們修改TVM以支持動態平行異質計算。所產生的程式碼可以配置計算到多個異質處理器上同時執行,並且可以在執行時彈性地將計算分配到不同的處理器上。我們將GoogleNet經過所修改的TVM編譯後,在PC和嵌入式系統上執行,並彈性的運用其不同的排程與配置策略,展示程式的效能並比較不同的策略,以證明所修改的TVM在平行執行和彈性排程上的能力。
TVM is a compiler framework, which supports many machine learning frameworks and hardware backends. It can optimize deep neural networks and generate efficient codes to execute on different kinds of backend devices. Although TVM supports computations on heterogeneous platforms, so far it can only schedule the codes to execute on the heterogeneous devices in serial. Furthermore, the schedule is static and thus cannot adapt to the dynamic loadings of the devices. In this work, we modify TVM to support dynamic parallel heterogeneous computing, in which computations can be scheduled to execute simultaneously on multiple heterogeneous devices, and the allocation of computations to backend devices can be done flexibly at run time. We demonstrate the parallel execution and flexible scheduling capabilities of the modified TVM by compiling GoogleNet to run on PC and embedded systems, and showing their performances using different scheduling strategies.
摘要 i
Abstract ii
誌謝辭 iii
Table of Contents iv
Chapter 1 Introduction 1
Chapter 2 Related Works and Background 4
Chapter 3 System Design 8
Chapter 4 Method 12
4.1 Find Parallel Operators 12
4.2 Scheduling Methods 14
4.2.1 Schedule by operator 14
4.2.2 Schedule by path 16
Chapter 5 Experiment 18
5.1 Experiment Setup 18
5.2 Evaluation 19
5.2.1 Result in PC 19
5.2.2 Result in embedded system 21
Chapter 6 Conclusions 26
References 27
Appendix A 31
A.1 Details of the schedule results 31
A.2 Execution time of each block in our methods in embedded system 32
A.3 Information of each operator in GoogleNet 36
[1] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
[3] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
[4] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[5] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[6] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
[7] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
[8] Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.
[9] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014, January). Deterministic policy gradient algorithms. In International conference on machine learning (pp. 387-395). PMLR.
[10] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., ... & Yoon, D. H. (2017, June). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture (pp. 1-12).
[11] Liao, H., Tu, J., Xia, J., & Zhou, X. (2019, August). Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS) (pp. 1-44). IEEE Computer Society.
[12] Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., ... & Temam, O. (2014, December). Dadiannao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609-622). IEEE.
[13] Chen, Y. H., Krishna, T., Emer, J. S., & Sze, V. (2016). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1), 127-138.
[14] D. Schor, (2020). Arm Ethos is for Ubiquitous AI At the Edge — WikiChip Fuse. Retrieved from ttps://fuse.wikichip.org/news/ 3282/arm-ethos-is-for-ubiquitous-ai-at-the-edge/
[15] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026-8037.
[16] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (pp. 265-283).
[17] Chollet, F., & others. (2015). Keras. GitHub. Retrieved from https://github.com/fchollet/keras
[18] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., ... & Darrell, T. (2014, November). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 675-678).
[19] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., ... & Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.
[20] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., ... & Krishnamurthy, A. (2018). {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (pp. 578-594).
[21] Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697-8710).
[22] Tang, L., Wang, Y., Willke, T. L., & Li, K. (2018). Scheduling computation graphs of deep learning models on manycore cpus. arXiv preprint arXiv:1807.09667.
[23] Hu, T. C. (1961). Parallel sequencing and assembly line problems. Operations research, 9(6), 841-848.
[24] Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., & Han, S. (2021). IOS: Inter-Operator Scheduler for CNN Acceleration. Proceedings of Machine Learning and Systems, 3.
[25] Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., ... & Zhou, L. (2020). Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 881-897).
[26] Chakaravarthy, R. V., & Jiang, H. (2020, October). Special Session: XTA: Open Source eXtensible, Scalable and Adaptable Tensor Architecture for AI Acceleration. In 2020 IEEE 38th International Conference on Computer Design (ICCD) (pp. 53-56). IEEE.
[27] Wu, H. I., Guo, D. Y., Chin, H. H., & Tsay, R. S. (2020). A Pipeline-Based Scheduler for Optimizing Latency of Convolution Neural Network Inference over Heterogeneous Multicore Systems. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 46-49). IEEE.
[28] Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, S., Dzhabarov, R., ... & Wang, M. (2018). Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907.
[29] Cyphers, S., Bansal, A. K., Bhiwandiwalla, A., Bobba, J., Brookhart, M., Chakraborty, A., ... & Webb, T. J. (2018). Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv preprint arXiv:1801.08058.
[30] Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, W. S., ... & Cohen, A. (2018). Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730.
[31] Chris Leary and Todd Wang. (2017). XLA: TensorFlow, compiled.
[32] Li, M., Liu, Y., Liu, X., Sun, Q., You, X., Yang, H., ... & Qian, D. (2020). The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32(3), 708-727.
[33] An Overview of TVM and Model Optimization. https://tvm.apache.org/docs/tutorials/get_started/introduction.html#an-overview-of-tvm-and-model-optimization
[34] Bai, Junjie and Lu, Fang and Zhang, Ke and others. (2019). ONNX: Open Neural Network Exchange. GitHub. Retrieved from https://github.com/onnx/onnx
[35] Chen, T., Zheng, L., Yan, E., Jiang, Z., Moreau, T., Ceze, L., ... & Krishnamurthy, A. (2018). Learning to optimize tensor programs. arXiv preprint arXiv:1805.08166.
[36] Zheng, L., Jia, C., Sun, M., Wu, Z., Yu, C. H., Haj-Ali, A., ... & Stoica, I. (2020). Ansor: Generating high-performance tensor programs for deep learning. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 863-879).
[37] AMD Ryzen 5 2600 GFLOPS performance. https://gadgetversus.com/processor/amd-ryzen-5-2600-gflops-performance/
[38] NVIDIA GeForce GTX 1080 FLOPS performance. https://www.techpowerup.com/gpu-specs/geforce-gtx-1080.c2839
[39] ARM-A53 and ARM-A72 FLOPS performance. http://web.eece.maine.edu/~vweaver/group/green_machines.html
[40] ARM Mali-T860 MP4 FLOPS performance. https://wikimovel.com/index.php/Rockchip
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *