帳號:guest(3.137.223.190)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):楊峻傑
作者(外文):Yang, Chun-Chieh
論文名稱(中文):在RISC-V Packed SIMD Extension上以Fixed-Point Type對TVM的有效支援
論文名稱(外文):Efficient Support of Fixed-Point Type for TVM on RISC-V Packed SIMD Extension
指導教授(中文):李政崑
指導教授(外文):Lee, Jenq-Kuen
口試委員(中文):賴尚宏
邱瀞德
楊武
游逸平
陳鵬升
黃元欣
洪明郁
口試委員(外文):Lai, Shang-Hong
Chiu, Ching-Te
Yang, Wuu
You, Yi-Ping
Chen, Peng-Sheng
Hwang, Yuan-Shin
Hung, Ming-Yu
學位類別:博士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:103062804
出版年(民國):112
畢業學年度:111
語文別:英文
論文頁數:71
中文關鍵詞:編譯器開放計算語言底層虛擬機器張量虛擬機精簡指令集架構單指令流多資料流定點數運算
外文關鍵詞:CompilerOpenCLLLVMTVMRISC-VSIMDFixed-point
相關次數:
  • 推薦推薦:0
  • 點閱點閱:761
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來,透過 CPU 和 GPU 等專用處理器加速人工智能模型執行,使得深度學習的應用在日常生活中越來越頻繁地被人們所使用。
隨著技術的發展,各種深度學習應用被應用到邊緣設備上,例如嵌入式計算機、移動設備,甚至智能電視,這些都與人們的日常生活密不可分。
然而,由於深度學習中龐大的計算量,如何在邊緣設備中提供解決方案面臨著具有挑戰性的問題。

在這篇論文中提出了一種方法,可以在 TVM 中利用所提出的Fixed-point類別,並且合併 RISC-V Packed 擴展技術規範,期望透過這樣的方法可以在人工智能模型的計算上損失些微的精準度以解決能量耗損的問題。
在這個流程中,提出了一個Fixed-point類型,由16位元整數類型來模擬支援以代替原來的32位元浮點數類型,並且還增加了飽和指令。
TVM 是一個開源的機器學習編譯器框架。它為神經網絡模型提供了關鍵基礎設施。
TVM 提供從前端框架到後端硬體的端點到端之編譯軟件架構。TVM不僅可以支持CPU和GPU的深度學習編譯,還可以支持FPGA、RISC-V等的編譯過程。
RISC-V的可定制性和靈活性引起了全世界的關注。RISC-V是一種基於精簡指令集原理的開源指令集架構,同時它也是指令集擴展技術規範的聯合體。
RISC-V Packed-SIMD擴展技術規範是RISC-V擴展技術規範之一,支持RISC-V架構中的單一指令多數據(SIMD)計算。本方法使用RISC-V Packed-SIMD擴展技術規範來支持深度學習計算中的回退應變引擎。

除了上面提出的過程之外,還提供了另一種基於作者以前所提出的方法。該流程是針對透過TVM的捲積神經網絡模型,利用LLVM實現OpenCL(Open Computing Language)2.0的功能,並在GPU架構下實現。該流程主要提供另一種在邊緣設備上有 GPU 時使用 OpenCL 加速的方法,該方法也可以使用提出的Fixed-point類別。該方法所提出的編譯器流程是從 LLVM 實現的,擴展了OpenCL 2.0的新功能,它使用 Clang 作為前端將 OpenCL 代碼編譯為 LLVM 位元碼,並使用 LLVM llc 作為後端來支持 GPU 架構。

最後,還提出了一種使用統一選擇器機制(uniform selector mechanism)的自動調整點位方法。該機制用於查找所提出的Fixed-point類別使用的二進制小數點位置。並利用 TVM 的 tensorization 特性用於優化特定硬體,例如帶有 RISC-V Packed-SIMD擴展技術規範的 SIMD 指令。透過在 Spike 模擬器上的實驗,所提出的自動調整點位方法可以在精度損失很小的狀態下,使性能提高大約 2.54 到 6.15 倍。
In recent years, the acceleration of artificial intelligence (AI) model execution through special-purpose processors such as CPU and GPU has made deep learning (DL) be used more and more frequent in daily life.
With the development of technology, various DL applications are applied to edge devices, such as embedded computers, mobile devices, and even smart TVs, which are inseparable from people's daily life.
However, due to the heavy computation in DL, how to provide solutions in edge devices faces a challenging problem.

This dissertation proposes a method that enables a flow with the proposed fixed-point type, and also enables RISC-V Packed extension (P extension) in TVM.
In this proposed flow, a fixed-point type is proposed to replace the original 32-bit float type by using an integer of 16-bit type, and saturation instructions is also added to the fixed-point type. An open source machine learning compiler framework, TVM, is gradually gaining attention. TVM provides the key infrastructure for neural network models, and also provides an end-to-end compiled software architecture from front-end framework to back-end hardware. Besides, TVM can not only support deep learning compilation of CPU and GPU, but also support the compilation process of FPGA, RISC-V, etc. The customization and flexibility of RISC-V have attracted worldwide attention. RISC-V is an open source instruction set architecture based on reduced instruction set (RISC) principles, and also is a federation of ISA extensions. RISC-V Packed-SIMD extension (P extension) is one of the RISC-V extensions that supports subword single instruction multiple data (SIMD) computation in the RISC-V architecture.
In this proposed flow, the fallback engines in DL computations is supported by P extension.

In addition to the method proposed above, another flow that is based on the author's previous work is provided. This flow is for the convolutional neural network (CNN) model via TVM by using LLVM to implement the functions of OpenCL (Open Computing Language) 2.0, and is implemented under the GPU architecture. The proposed flow mainly provides another way to use OpenCL acceleration when there is a GPU on the edge device, and this method can also use the proposed fixed-point type. This proposed compiler flow is implemented by extending such features on LLVM to support OpenCL 2.0 that uses Clang as the frontend to compile OpenCL code to LLVM bitcode and LLVM llc as the backend to support the GPU architecture.

Finally, an auto-tuning method is also proposed by using a uniform selector mechanism (USM). This mechanism is used to find the binary point position for fixed-point type. The tensorization feature of TVM can be used to optimize specific hardware, such as subword SIMD instructions with RISC-V P extension. With our experiments on the Spike simulator, the proposed method with the USM can improve performance by approximately 2.54 to 6.15 times in terms of instruction counts with little accuracy loss.
摘要 i
Abstract iii
Acknowledgements
1 Introduction 1
1.1 The Proposed Flow and the Optimization . . . . . . . . . . . . . . . . . . . . 2
1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 7
2.1 TVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 LLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 RISC-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 The Proposed Fixed-point Type 13
3.1 The Proposed Fixed-point Type . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Equivalent Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Fixed-Point Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Advanced Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Fixed-Point Conversion . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Reinterpreting Data Library for the Fixed-Point . . . . . . . . . . . . . 18
3.2.3 The Proposed Fixed-point Math Functions . . . . . . . . . . . . . . . . 19
3.3 SPIR-V Reference Design for the Fixed-point . . . . . . . . . . . . . . . . . . 20
4 The Flow that Enables a RISC-V Packed-SIMD Extension and the Proposed Fixed-point Type in TVM
23
4.1 The RISC-V Packed-SIMD Extension . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Enabling the Fixed-Point Type on TVM . . . . . . . . . . . . . . . . . . . . . 25
4.3 Realization of the RISC-V Packed-SIMD Extension on TVM and LLVM . . . 30
4.4 Deep Learning Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Support OpenCL 2.0 Compile Flow for PTX 35
5.1 The Proposed OpenCL 2.0 Compile Flow . . . . . . . . . . . . . . . . . . . . 35
5.2 The Implementation of OpenCL 2.0 Features . . . . . . . . . . . . . . . . . . 37
5.2.1 Platform Atomic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.2 Workgroup Built-in Function and Program Scope Variable . . . . . . . 37

5.2.3 Device-side Enqueue 39
5.3 Experimental Results 43
5.3.1 The Execution Time of the Benchmarks 44
5.3.2 The Different Setup between the Benchmarks 46
6 Uniform Selector Mechanism with the Fixed-Point Type 49
6.1 The Uniform Selector Mechanism 49
6.2 The Algorithm with Uniform Selector Mechanism 51
6.3 Running Example 54
7 Experimental Results 57
7.1 Experimental Results 57
7.2 Related Work and Discussions 62
8 Conclusions 65
Bibliography 67
VITA 71
[1] “Ai vs machine learning vs deep learning.” https://www.edureka.co/blog/
ai-vs-machine-learning-vs-deep-learning/. Accessed: 2022-08-01.
[2] “How do convolutional neural networks work?.” https://brohrer.mcknote.com/
zh-Hant/how_machine_learning_works/how_convolutional_neural_networks_
work.html. Accessed: 2021-08-01.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25,
pp. 1097–1105, 2012.
[4] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,
“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size,”
arXiv preprint arXiv:1602.07360, 2016.
[5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv preprint arXiv:1409.1556, 2014.
[6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,
and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” arXiv preprint arXiv:1704.04861, 2017.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale
hierarchical image database,” in 2009 IEEE conference on computer vision and pattern
recognition, pp. 248–255, Ieee, 2009.
[8] A. Krizhevsky and G. Hinton, “Convolutional deep belief networks on cifar-10,” Unpublished manuscript, vol. 40, no. 7, pp. 1–9, 2010.
[9] C.-C. Yang, Y.-R. Chen, H.-H. Liao, Y.-M. Chang, and J.-K. Lee, “Auto-tuning fixedpoint precision with tvm on risc-v packed simd extension,” ACM Transactions on Design
Automation of Electronic Systems, 2022.
[10] “Risc-v.” https://riscv.org/. Accessed: 2021-08-01.
[11] “Risc-v packed simd extension.” https://github.com/riscv/riscv-p-spe. Accessed:
2021-08-01.
[12] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu,
L. Ceze, et al., “Tvm: An automated end-to-end optimizing compiler for deep learning,”
in 13th Symposium on Operating Systems Design and Implementation (18), pp. 578–594,
2018.
[13] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep
learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–
8037, 2019.
[14] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” in
12th symposium on operating systems design and implementation (16), pp. 265–283, 2016.
[15] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang,
“Mxnet: A flexible and efficient machine learning library for heterogeneous distributed
systems,” arXiv preprint arXiv:1512.01274, 2015.
[16] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis
& transformation,” in International Symposium on Code Generation and Optimization,
2004. CGO 2004., pp. 75–86, IEEE, 2004.
[17] “High-performance hardware for machine learning.” https://media.nips.cc/
Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf. Accessed: 2021-08-
01.
[18] “opencl-2.0.” https://registry.khronos.org/OpenCL/specs/opencl-2.0.pdf. Accessed:
2021-08-01.
[19] C.-C. Yang, S.-C. Wang, M.-Y. Hsu, Y.-M. Chang, Y.-S. Hwang, and J.-K. Lee, “Support
opencl 2.0 compiler on llvm for ptx simulators,” Journal of Signal Processing Systems,
vol. 91, no. 3, pp. 261–271, 2019.
[20] “Nvptx.” https://llvm.org/docs/NVPTXUsage.html. Accessed: 2021-08-01.
[21] “Cuda.” https://developer.nvidia.com/cuda-zone. Accessed: 2021-08-01.
[22] “Spike, a risc-v isa simulator.” https://github.com/riscv-software-src/riscv-isa-sim.
Accessed: 2021-08-01.
[23] “The microsoft cognitive toolkit (cntk).” https://github.com/microsoft/CNTK. Accessed: 2021-08-01.
[24] “Core ml.” https://developer.apple.com/documentation/coreml. Accessed: 2021-08-
01.
[25] “Keras.” https://keras.io/. Accessed: 2021-08-01.
[26] “Onnx.” https://onnx.ai/. Accessed: 2021-08-01.
[27] “The LLVM Compiler Infrastructure.” http://llvm.org/.
[28] “The khronos group inc.” https://www.khronos.org/. Accessed: 2021-08-01.
[29] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014
IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC),
pp. 10–14, IEEE, 2014.
[30] “Fixed-point real numbers.” http://www.open-std.org/jtc1/sc22/wg21/docs/papers/
2018/p0037r5.html. Accessed: 2021-08-01.
[31] “Xilinx vivado design suite user guide.” https://docs.xilinx.com/v/u/2016.4-English/
ug902-vivado-high-level-synthesis. Accessed: 2021-08-01.
[32] “Spir-v.” https://www.khronos.org/spir/. Accessed: 2021-08-01.
[33] “Vulkan.” https://www.vulkan.org/. Accessed: 2021-08-01.
[34] “Andes technology.” http://www.andestech.com/en/homepage/. Accessed: 2021-08-
01.
[35] W.-L. Shih, Y.-P. You, C.-W. Huang, and J. K. Lee, “Compiler optimization for reducing
leakage power in multithread bsp programs,” ACM Transactions on Design Automation
of Electronic Systems (TODAES), vol. 20, no. 1, pp. 1–34, 2014.
[36] “Andes has donated risc-v p-extension draft.” http://www.andestech.com/en/2019/
12/31/a-look-back-at-the-achievements-andes-made-in-2019/. Accessed: 2021-08-
01.
[37] C.-B. Kuan and J. K. Lee, “Compiler supports for vliw dsp processors with simd intrinsics,” Concurrency and Computation: Practice and Experience, vol. 24, no. 5, pp. 517–
532, 2012.
[38] S.-C. Wang, L.-C. Kan, C.-L. Lee, Y.-S. Hwang, and J.-K. Lee, “Architecture and compiler
support for gpus using energy-efficient affine register files,” ACM Transactions on Design
Automation of Electronic Systems (TODAES), vol. 23, no. 2, pp. 1–25, 2017.
[39] “Kmada in risc-v p extension proposal.” https://github.com/riscv/riscv-p-spec/blob/
master/P-ext-proposal.adoc#kmada-kmaxda. Accessed: 2021-08-01.
[40] D. Sharlet, A. Kunze, S. Junkins, and D. Joshi, “Shevlin park: Implementing c++ amp
with clang/llvm and opencl,” in General Meeting of LLVM Developers and Users, 2012.
[41] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda
workloads using a detailed gpu simulator,” in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 163–174, IEEE, 2009.
[42] “AMD OpenCL Accelerated Parallel Processing (APP).” http://developer.amd.com/
tools-and-sdks/.
[43] “Seven OpenCL Benchmarks for Heterogeneous System Architecture Evaluation.” http:
//mtkntu.ntu.edu.tw/upload/edmfs150404031052772.pdf.
[44] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood, “gem5-gpu: A heterogeneous
cpu-gpu simulator,” IEEE Computer Architecture Letters, vol. 14, no. 1, pp. 34–36, 2015.
[45] “numpy.” https://numpy.org/. Accessed: 2021-08-01.
[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015.
[47] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., “Gradient flow in recurrent
nets: the difficulty of learning long-term dependencies,” 2001.
[48] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[49] “Tensorflow lite 8-bit quantization specification.” https://www.tensorflow.org/lite/
performance/quantization_spec. Accessed: 2021-08-01.
[50] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and
D. Kalenichenko, “Quantization and training of neural networks for efficient integerarithmetic-only inference,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2704–2713, 2018.
[51] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” The Journal
of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.
[52] M.-Y. H. Chao-Lin Lee, Jenq-Kuen Lee, “Case study: Devise quantized schedule primitives in halide to support darknet computation,” CTHPC, 2021.
[53] “8-bit inference with tensorrt.” https://on-demand.gputechconf.com/gtc/2017/
presentation/s7310-8-bit-inference-with-tensorrt.pdf. Accessed: 2021-08-01.
[54] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” arXiv preprint arXiv:1603.01025, 2016.
[55] H.-T. Kung, B. McDanel, and S. Q. Zhang, “Term quantization: furthering quantization
at run time,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16, IEEE, 2020.
[56] C.-L. Lee, M.-Y. Hsu, B.-S. Lu, M.-Y. Hung, and J.-K. Lee, “Experiment and enabled flow
for gpgpu-sim simulators with fixed-point instructions,” Journal of Systems Architecture,
vol. 111, p. 101783, 2020.
[57] “Support tvm qnn flow on risc-v with simd computation.” https://discuss.tvm.apache.
org/t/rfc-enable-tvm-qnn-on-risc-v-with-subword-simd-computation/7967. Accessed: 2021-08-01.
[58] Y.-R. Chen, H.-H. Liao, C.-H. Chang, C.-C. Lin, C.-L. Lee, Y.-M. Chang, C.-C. Yang,
and J.-K. Lee, “Experiments and optimizations for tvm on risc-v architectures with p extension,” in 2020 International Symposium on VLSI Design, Automation and Test (VLSIDAT), pp. 1–4, IEEE, 2020.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *