帳號:guest(216.73.216.146)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):黃明瀧
作者(外文):Huang, Ming-Long
論文名稱(中文):利用 Relax IR 與 NNAPI 整合加速 TVM 之大型語言模型推理
論文名稱(外文):Accelerating TVM LLM Inference with Relax IR-NNAPI Integration
指導教授(中文):李政崑
指導教授(外文):Lee, Jenq-Kuen
口試委員(中文):洪明郁
張元銘
口試委員(外文):Hung, Ming-Yu
Chang, Yuan-Ming
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:111062612
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:34
中文關鍵詞:TVM深度學習編譯器大型語言模型
外文關鍵詞:TVMAI CompilerLarge Language Models
相關次數:
  • 推薦推薦:0
  • 點閱點閱:0
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
由於近期大型語言模型(LLMs)的快速發展,在邊緣運算裝置進行此類模型推論演算的需求有所增加。在現行基於TVM深度學習編譯器的MLC LLM框架中,目前可以將大型語言模型經由OpenCL運行於Android平台上,然而這種做法較難利用GPU以外的加速器硬體的優勢。

為了將這些模型在Android上進行高效的大型語言模型推理,我們整合了NNAPI與TVM Unity中的Relax IR,實現深度學習模型的推論,同時兼顧TVM豐富的語法表達能力及NNAPI所提供的硬體優勢。在此研究中,我們採用TVM BYOC流程,在計算圖層級透過我們實作的NNAPI執行階段系統來調用NNAPI的函式。此外,我們在NNAPI執行階段系統加入了處理動態形狀張量的能力,使得我們的系統能夠處理來自MLC LLM框架的大型語言模型。實驗顯示,我們的Relax-NNAPI整合LLM在prefill執行階段較CPU後端取得了3.5至12.4倍的執行速度優勢,並且能兼顧傳統電腦視覺模型的執行。我們也進一步檢視單一運算的各階段執行成本,分析了prefill及decode執行階段加速效果差異的原因。
Due to the recent advances in large-language models (LLMs), significant interest has been sparked to perform their inferencing on edge devices for privacy and connectivity reasons. While the MLC LLM framework has been successful in delivering LLMs on the Android platform via OpenCL, such implementation may not be able to take advantage of non-GPU accelerator devices from the hardware vendors. To enable efficient execution of such models across all Android devices in general, this work brings the integration of the Android Neural Networks API (NNAPI) with Relax IR from TVM Unity, enabling efficient hybrid inference of deep learning models that takes advantage of both the expressiveness of TVM and the performance of optimized hardware operations via NNAPI. We embrace the Bring Your Own Codegen (BYOC) framework of TVM to perform graph-level operator offloading as custom runtime modules. In addition, we extend the custom runtime to support dynamic shape tensors on top of NNAPI, which makes our system capable of handling LLMs from the MLC LLM framework as well as other deep learning models that requires dynamic shape in general. Experiments show up to 12.4x speedup for prefilling large language models on our Relax-NNAPI integration compared to the CPU backend, and additional experiments show that our work remains performant for traditional computer vision models. We also identify the cause for the speedup gap between prefill and decode phases with detailed performance breakdown for single operations.
摘要i
Abstract ii
誌謝iv
1 Introduction 1
2 Background 5
2.1 NNAPI 5
2.2 TVM Relax and MLC LLM 6
3 TVM Relax NNAPI Integration 9
3.1 JSON Codegen for NNAPI 9
3.2 TVM Runtime for NNAPI 12
3.2.1 Overview 12
3.2.2 Runtime Dynamic Shape Support 13
3.2.3 Input-Based Device Selection 15
3.2.4 NNAPI Compilation Caching 16
3.3 Integration with MLC LLM 17
4 Experimental Results 19
4.1 Environment 19
4.2 Evaluation with Large Language Models 20
4.3 Evaluation with Computer Vision Models 22
4.4 Performance Characteristics with Respect to the Input Size 24
5 Conclusion 28
5.1 Conclusion 29
5.2 Future Work 30
Bibliography 32
[1] T. Chen et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” [Online]. Available: https://arxiv.org/abs/1802.04799
[2] MLC team, “MLC-LLM.” [Online]. Available: https://github.com/mlc-ai/mlc-llm
[3] R. Lai et al., “Relax: Composable Abstractions for End-to-End Dynamic Machine Learning.” 2023.
[4] S. Feng et al., “TensorIR: An Abstraction for Automatic Tensorized Program Optimization,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 804–817.
[5] Google Inc., “Android NDK | Android Developers.” Accessed: Jun. 23, 2024. [Online]. Available: https://developer.android.com/ndk
[6] S.-Y. Cheng, C.-P. Chung, R. Lai, and J.-K. Lee, “Application Showcases for TVM with NeuroPilot on Mobile Devices,” in Workshop Proceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–8.
[7] M.-Y. Lai, C.-Y. Sung, J.-K. Lee, and M.-Y. Hung, “Enabling Android NNAPI Flow for TVM Runtime,” in Workshop Proceedings of the 49th International Conference on Parallel Processing, in ICPP Workshops '20. Edmonton, AB, Canada: Association for Computing Machinery, 2020. doi: 10.1145/3409390.3409393.
[8] Z. Chen et al., “Bring Your Own Codegen to Deep Learning Compiler.” 2021.
[9] J. Roesch et al., “Relay: a new IR for machine learning frameworks,” in Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, in MAPL 2018. Philadelphia, PA, USA: Association for Computing Machinery, 2018, pp. 58–68. doi: 10.1145/3211346.3211348.
[10] J. Bai, F. Lu, K. Zhang, and others, “ONNX: Open Neural Network Exchange.” GitHub, 2019.
[11] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning.” 2017.
[12] B. Zhang and R. Sennrich, “Root Mean Square Layer Normalization.” 2019.
[13] A. Q. Jiang et al., “Mistral 7B.” [Online]. Available: https://arxiv.org/abs/2310.06825
[14] J. Bai et al., “Qwen Technical Report,” arXiv preprint arXiv:2309.16609, 2023.
[15] P. Zhang, G. Zeng, T. Wang, and W. Lu, “TinyLlama: An Open Source Small Language Model.” 2024.
[16] H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models.” [Online]. Available: https://arxiv.org/abs/2307.09288
[17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[18] A. Howard et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
[19] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” [Online]. Available: https://arxiv.org/abs/1905.11946
[20] T. maintainers and contributors, “TorchVision: PyTorch's Computer Vision library.” GitHub, 2016.
(此全文20260801後開放外部瀏覽)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *