字體大小: 字級放大   字級縮小   預設字形  


作者(外文):Hsieh, Tsung-Tso
論文名稱(外文):Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Cluster
指導教授(外文):Lee, Che-Rung
口試委員(外文):Chou, Jerry
Chung, I-Hsin
外文關鍵詞:Deep Learning PlatformElastic TrainingCluster SchedulingDistributed ComputingKubernetes
Elastic deep learning that dynamically adjusts the resource allocation for a group of training jobs can effectively enhance the utilization of accelerators, which are essential for training large scale deep learning models nowadays. Although many scheduling algorithms for elastic training have been proposed, they lack an easy use while efficient platform to carry out. In this thesis, we presented Voda, a GPU scheduling platform for elastic training. Unlike previous works that uses the parameter server for elastic training, Voda is designed for AllReduce communication, which is more effective, but also more complicated to be adjusted. Voda, built on top of Kubernetes, consists of a set of loosely coupled components, that can collect the run-time information, dynamically alter the resource allocation, and decide a job placement to optimize the job execution based on the communication cost among underlying GPUs. We compared four elastic algorithms, three existing methods and one newly proposed, on Voda, with different workloads, job distributions, and arrival times. Experimental results show that no algorithm can dominate all performance metrics, such as makespan, average job completion time, or average running time. However, some algorithms do work better than others in some workloads and job distributions. Experiments also showed that the job placement is critical to the performance on GPU clusters, and the proposed job placement algorithm can effectively optimize the communication cost among different workers of a job.
中文摘要 1
Abstract 2
List of Figures 5
List of Tables 6
Listings 7
1 Introduction 8
2 Background and Motivation 11
2.1 Elastic Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Scheduling Algorithms for Elastic Deep Learning . . . . . . . . . . . . 12
2.2.1 Elastic-Tiresias . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 FfDL Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 AFS-L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Challenges and Design Principle . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Used Open Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Horovod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Kubeflow MPI Operator . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Scheduling Algorithm for Deep Learning . . . . . . . . . . . . . 20
2.5.2 System for Elastic Training . . . . . . . . . . . . . . . . . . . . 20
2.5.3 ML Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Voda Design and Implementation 22
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Job Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Resource Allocator . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Thrashing Avoidance . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Placement Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Placement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Continuous Throughput Measurement . . . . . . . . . . . . . . . . . . 36
4 Experiments 38
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.3 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.4 Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Placement Management . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Overhead of Elastic Training . . . . . . . . . . . . . . . . . . . . 47
5 Conclusion and Future Work 51
References 53
