帳號:guest(3.16.217.218)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):王領崧
作者(外文):Wang, Ling-Sung
論文名稱(中文):Libra: 透過 Python 層級的張量重組化來優化深度學習中 GPU 記憶體的使用量
論文名稱(外文):Libra: Python-level tensor re-materialization for GPU memory optimization in deep learning
指導教授(中文):周志遠
指導教授(外文):Chou, Jerry
口試委員(中文):賴冠州
李哲榮
口試委員(外文):LEE, CHE-RUNG
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:111062584
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:40
中文關鍵詞:張量重組化交換重新計算深度學習GPU 記憶體優化Python
外文關鍵詞:Tensor Re-materializationSwappingRecomputationDeep LearningGPU Memory OptimizationPython
相關次數:
  • 推薦推薦:0
  • 點閱點閱:0
  • 評分評分:*****
  • 下載下載:4
  • 收藏收藏:0
為了應對更複雜和具有挑戰性的問題,深度神經網路變得越來越深、越來越廣,導致模型規模呈指數增長。然而與此同時,硬體設備上的記憶體增長卻是線性的,使得有限的記憶體資源成為先進神經網路發展的一大限制。
張量重組化是一種單 GPU 的記憶體優化技術,旨在減少中間張量的記憶體使用。它在前向計算後驅逐張量,並在反向使用之前重新生成它們,以減輕 GPU 上的記憶體壓力。
在張量重組化中,交換和重新計算是兩種主要技術。交換利用外部記憶 體來回移動張量,而重新計算則重新播放前向計算以重新生成張量。然而,以往的研究並未有效解決交換帶來的顯著訓練時間或重新計算中鏈狀結構可 能減少優化性能的問題。此外,隨著異構計算和深度神經網路變得越來越廣、越來越深,如何在各種記憶體預算中有效生成計劃也是一個關鍵挑戰。
我們提出了 Libra,一種 Python 層級的深度學習框架,通過交換和重新計算的混合策略,並採用貪心優化方法生成記憶體優化的重組化計劃,以減少深度學習訓練的峰值記憶體使用。為了應對交換或重新計算的局限性,Libra 採用了混合策略,由於重新計算的再生成時間較低,Libra 將其設置為初始策略,並使用交換作為權衡,增加訓練時間開銷,但能夠解決會造成新峰值的重新計算鏈的潛在問題。
各種深度學習模型的評估表明,Libra 在記憶體減少性能方面表現出色,與 PyTorch 相比,最大批量大小增加了 3.12 倍,與其他最先進的研究相比,平均增加了 1.42 倍和 2.57 倍。
To tackle more complex and challenging problems, Deep Neural Networks (DNNs) are becoming larger and deeper, leading to exponential growth in model sizes. In contrast, on-device memory grows linearly, making limited memory resources a bottleneck that restricts the development of advance DNN architectures.
Tensor re-materialization is a single GPU memory optimization technique that focuses on reducing the memory usage of intermediate tensors. It evicts tensors after their forward calculation and re-generates them before their backward use to reduce memory pressure on the GPU.
Swapping and recomputation are two major techniques in tensor re-materialization. Swapping utilizes external memory to move the tensors back and forth, while recomputation replays the forward calculation to re-generate the tensors. However, prior works have not effectively addressed the significant overhead associated with swapping or the decline in memory reduction performance from recomputation in their re-materialization plans. Moreover, with heterogeneous computing and DNNs becoming larger and deeper, how to generate the plan efficiently in various memory budgets is also a crucial challenge.
In this paper, we propose Libra, a deep learning framework that uses a greedy refinement approach to generate a memory optimization re-materialization plan through swapping and recomputation to reduce peak memory usage of DNN training. To address the limitations of swapping or recomputation, Libra employs a hybrid policy, setting recomputation as the initial policy due to its low re-generation time and using swapping as a trade-off to increase training time overhead but significantly reduce the memory overhead associated with the recomputation chain that will create a new peak.
Evaluations on various DNN models shows that Libra achieve exceptional
memory reduction performance, increasing the maximum batch size by 3.12x compared to PyTorch, and by 1.42x and 2.57x in average compared to other state-of-the-art baselines.
1 Introduction . . . . . . 1
2 Motivation . . . . . . 6
2.0.1 Opportunity: Repetitive Tensor Access Pattern . . . . . . 6
2.0.2 Challenge: Limitation of Swapping and Recomputation in Tensor Re-materialization . . . . . . 8
2.0.3 Challenge: Dilemma of hybrid policy . . . . . . . . . . . 9
2.0.4 Challenge: Generate the plan for various memory budgets efficiently . . . . . . 10
3 Methodology . . . . . . 12
3.0.1 DesignOverview . . . . . . 12
3.0.2 Peak memory profiling . . . . . . 13
3.0.3 Memory reduction estimation . . . . . . . . 15
3.0.4 The usage of swapping and recomputation . . . . . . . . . . 15
3.0.5 Greedy refinement algorithm . . . . . . . . 18
4 Implementation . . . . . . . . 20
4.0.1 System Architecture . . . . . . . . 20
4.0.2 Components . . . . . . . . 21
4.0.3 Applicable tool . . . . . . . . 23
5 Evaluation . . . . . . . . 25
5.0.1 Methodology . . . . . . . . 25
5.0.2 Peak Memory Usage Reduction Effectiveness . . . . . . . . 26
5.0.3 Throughput performance . . . . . . . . 29
5.0.4 Configuration Adaptability . . . . . . . . 30
5.0.5 Refinement optimization . . . . . . . . 30
5.0.6 Offline Solving Efficiency . . . . . . . . 32
6 Related works . . . . . . . . 34
6.0.1 Graph Level Techniques . . . . . . . . 34
6.0.2 GPU Memory Management . . . . . . . . 35
6.0.3 Model Parallelism and Pipeline Parallelism . . . . . . . . . 35
7 Conclusion . . . . . . . . 37
References . . . . . . . . 38
[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. {TensorFlow}: a system for {Large- Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16) (2016), pp. 265–283.
[2] Brutzkus, A., and Globerson, A. Why do larger models generalize better? A theoretical perspective via the XOR problem. In Proceedings of the 36th International Conference on Machine Learning (09–15 Jun 2019), K. Chaud- huri and R. Salakhutdinov, Eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 822–830.
[3] Hu,Z.,Xiao,J.,Deng,Z.,Li,M.,Zhang,K.,Zhang,X.,Meng,K.,Sun,N.,and Tan, G. Megtaichi: Dynamic tensor-based memory management optimization for dnn training. In Proceedings of the 36th ACM International Conference on Supercomputing (2022), pp. 1–13.
[4] Huang, C.-C., Jin, G., and Li, J. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty- Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (2020), pp. 1341–1355.
[5] Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information process- ing systems 32 (2019).
[6] Imambi, S., Prakash, K. B., and Kanagachidambaresan, G. Pytorch. Program- ming with TensorFlow: Solution for Edge Computing Applications (2021), 87– 104.
[7] Jain, A., Phanishayee, A., Mars, J., Tang, L., and Pekhimenko, G. Gist: Effi- cient data encoding for deep neural network training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) (2018), IEEE, pp. 776–789.
[8] Jain,P.,Jain,A.,Nrusimha,A.,Gholami,A.,Abbeel,P.,Gonzalez,J.,Keutzer, K., and Stoica, I. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497–511.
[9] Liao,J.,Li,M.,Sun,Q.,Hao,J.,Yu,F.,Chen,S.,Tao,Y.,Zhang,Z.,Yang,H., Luan, Z., et al. Mimose: An input-aware checkpointing planner for efficient training on gpu. arXiv preprint arXiv:2209.02478 (2022).
[10] Nie,X.,Miao,X.,Yang,Z.,andCui,B.Tsplit:Fine-grainedgpumemoryman- agement for efficient dnn training via tensor splitting. In 2022 IEEE 38th In- ternational Conference on Data Engineering (ICDE) (2022), IEEE, pp. 2615– 2628.
[11] Peng, X., Shi, X., Dai, H., Jin, H., Ma, W., Xiong, Q., Yang, F., and Qian, X. Capuchin: Tensor-based gpu memory management for deep learning. In Pro- ceedings of the Twenty-Fifth International Conference on Architectural Sup- port for Programming Languages and Operating Systems (2020), pp. 891–905.
[12] Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Yang, S., Zhang, M., Li, D., and He, Y. Zero-offload: Democratizing billion-scale model training, 2021.
[13] Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., and Keckler, S. W. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural net- work design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2016), IEEE, pp. 1–13.
[14] Shah,A.,Wu,C.-Y.,Mohan,J.,Chidambaram,V.,andKrähenbühl,P.Memory optimization for deep networks, 2021.
[15] Shoeybi,M.,Patwary,M.,Puri,R.,LeGresley,P.,Casper,J.,andCatanzaro,B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
[16] Steiner, B., Elhoushi, M., Kahn, J., and Hegarty, J. Olla: Optimizing the life- time and location of arrays to reduce the memory usage of neural networks. arXiv preprint arXiv:2210.12924 (2022).
[17] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015).
[18] Tang, Y., Wang, C., Zhang, Y., Liu, Y., Zhang, X., Qiao, L., Lai, Z., and Li, D. Delta: Dynamically optimizing gpu memory beyond tensor recomputation. arXiv preprint arXiv:2203.15980 (2022).
[19] Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) (October 2021), pp. 32–42.
[20] Wang, L., Ye, J., Zhao, Y., Wu, W., Li, A., Song, S. L., Xu, Z., and Kraska, T. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming (2018), pp. 41–53.
[21] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization, 2017.
[22] Zhao, X., Le Hellard, T., Eyraud-Dubois, L., Gusak, J., and Beaumont, O. Rockmate: an efficient, fast, automatic and generic tool for re-materialization in pytorch. In International Conference on Machine Learning (2023), PMLR, pp. 42018–42045.
[23] Zhu, F., Gong, R., Yu, F., Liu, X., Wang, Y., Li, Z., Yang, X., and Yan, J. Towards unified int8 training for convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 1969–1979.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *