帳號:guest(3.14.129.106)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳煜翔
作者(外文):Chen, Yu-Hsiang
論文名稱(中文):利用資料與負載認知改善多GPU伺服器的效能與資源使用率
論文名稱(外文):A data and load aware task migration strategy for improving resource utilization and performance of multi-GPU server
指導教授(中文):周志遠
指導教授(外文):Chou, Jerry
口試委員(中文):李哲榮
賴冠州
鍾武君
口試委員(外文):Lee, Che-Rung
Lai, Kuan-Chou
Chung, Wu-Chun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:110062514
出版年(民國):112
畢業學年度:111
語文別:英文
論文頁數:31
中文關鍵詞:多GPU共享GPU排程演算法
外文關鍵詞:Multi-GPUGPU SharingSchedulingAlgorithm
相關次數:
  • 推薦推薦:0
  • 點閱點閱:131
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在現代計算領域,GPU因其卓越的平行處理能力而受到廣泛歡迎,這些能力已被用於通用計算任務。為了實現更高的計算效能和內部網路頻寬,通常會將多個GPU組成多GPU伺服器。應用程式與GPU之間傳統上專用且固定的綁定限制了多GPU伺服器上可以同時執行的應用程式的數量。這種限制會導致資源使用效率低下,因為並非所有應用程式都能始終充分利用GPU的全部資源。為了解決這個問題,共享GPU作為一種實用的解決方案應運而生。在這項研究中,我們介紹了一種稱為多GPU記憶體池的共享多GPU的創新方法。通過多GPU記憶體池,我們將多GPU伺服器中CPU和所有GPU的記憶體空間視為一個統一的記憶體池。在執行時,記憶體可以根據需要動態搬移到任何GPU或CPU,從而提高資源使用率並最大限度地縮短完工時間。為了實現這些目標,我們開發了專門的排程算法,可以有效應對記憶體搬移開銷的挑戰,同時確保多GPU記憶體池的負載平衡。實驗表明,與現有的基於NVIDIA統一記憶體和時間片輪轉調度的解決方案相比,我們的方法可以將完工時間縮短高達61%。此外,我們的方法也讓伺服器使用率顯著提高了約30%。
In the modern computing landscape, GPUs have gained widespread popularity due to their remarkable parallel processing capabilities, which have been harnessed for general-purpose computing tasks. To achieve even higher computational performance and internal network bandwidth, multiple GPUs are often combined within a multi-GPU server. The traditional dedicated and fixed binding between applications and GPUs restricts the number of applications that can be simultaneously run on multi-GPU servers. This limitation leads to inefficient resource utilization since not all applications can fully exploit the complete resources of a GPU at all times. To address this issue, GPU sharing has emerged as a practical solution. In this work, we introduce an innovative multi-GPU sharing approach called Multi-GPU Memory Pool. With the Multi-GPU Memory Pool, we treat the memory space of the CPU and all GPUs in the multi-GPU server as a unified memory pool. During runtime, memory can be dynamically migrated to any GPU or CPU as needed, which enhances resource utilization and minimizes makespan. To achieve these objectives, we have developed a specialized scheduling algorithm that effectively handles the challenges of memory migration overhead while ensuring load balancing for the Multi-GPU Memory Pool. The experiment shows that our approach can reduce makespan by up to 61% compared to the existing solution based on NVIDIA unified memory and round-robin scheduling. Additionally, server utilization has also seen a significant improvement of around 30% with our approach.
1 Introduction 1
2 Motivation 4
2.1 Multi-GPU Memory Pool . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Kernel-Level Scheduling . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Proposed Approach 10
3.1 Scheduling Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Kernel launch . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Setups 18
4.1 Evaluation Workloads . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Compared Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Experimental Results 21
5.1 Makespan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Memory Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 GPU Utilization and Memory Loading . . . . . . . . . . . . . . . . 26
5.4 Scheduling Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Conclusions 29
References 30
[1] Chen, Q., Lee, H., Yeom, H. Y., and Son, Y. Flexgpu: A flexible and efficient scheduler for gpu sharing systems. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (2020), pp. 300–309.
[2] Chien, S., Peng, I., and Markidis, S. Performance evaluation of advanced features in cuda unified memory. In 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC) (2019), pp. 50–57.
[3] Ganguly, D., Zhang, Z., Yang, J., and Melhem, R. Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture(ISCA) (2019), pp. 224–235.
[4] Ganguly, D., Zhang, Z., Yang, J., and Melhem, R. Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription. In 2020 IEEE International Parallel and Distributed Processing Symposium(IPDPS) (2020), pp. 451–461.
[5] Gonthier, M., Marchal, L., and Thibault, S. Memory-aware scheduling of tasks sharing data on multiple gpus with dynamic runtime systems. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2022), pp. 694–704.
[6] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015.
[7] Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks, 2018.
[8] Landaverde, R., Zhang, T., Coskun, A. K., and Herbordt, M. An investigation of unified memory access performance in cuda. In 2014 IEEE High Performance Extreme Computing Conference (HPEC) (2014), pp. 1–6.
[9] Li, W., Jin, G., Cui, X., and See, S. An evaluation of unified memory technology on nvidia gpus. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2015), pp. 1092–1098.
[10] Linux. Linux manual page. https://man7.org/linux/man-pages/man8/ld.so.8.html.
[11] NVIDIA. NVIDIA cuda-samples. https://github.com/NVIDIA/cuda-samples.
[12] NVIDIA. NVIDIA nvidia-smi. https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf.
[13] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition, 2015.
[14] Yu, Q., Childers, B., Huang, L., Qian, C., and Wang, Z. A quantitative evaluation of unified memory in gpus. The Journal of Supercomputing 76, 4 (2020), 2958–2985.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *