帳號:guest(18.118.226.117)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):謝宗佐
作者(外文):Hsieh, Tsung-Tso
論文名稱(中文):在Kubernetes集群中用於深度學習彈性訓練的GPU調度平台
論文名稱(外文):Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Cluster
指導教授(中文):李哲榮
指導教授(外文):Lee, Che-Rung
口試委員(中文):周志遠
鐘一新
口試委員(外文):Chou, Jerry
Chung, I-Hsin
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:108062632
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:57
中文關鍵詞:深度學習平台彈性訓練集群調度分散式計算
外文關鍵詞:Deep Learning PlatformElastic TrainingCluster SchedulingDistributed ComputingKubernetes
相關次數:
  • 推薦推薦:0
  • 點閱點閱:533
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
動態調整一組訓練作業的資源分配的彈性深度學習可以有效提高硬體加速器的使用效率,這對於當今訓練大規模深度學習模型至關重要。儘管已經有許多用於彈性訓練的調度算法被提出,但它們缺乏易於使用且高效的平台來執行。在本論文中,我們介紹了Voda,一個用於彈性訓練的GPU調度平台。與之前使用參數服務器進行彈性訓練的作品不同,Voda是為AllReduce通信而設計的,它更有效,但調整起來也更複雜。Voda建立在Kubernetes之上,由一組鬆散耦合的組件組成,它們可以收集運行時信息,動態地改變資源分配,並根據底層GPU之間的通信成本決定作業放置以優化作業執行。我們在Voda上使用不同的工作負載、作業分佈和到達時間比較了四種彈性調度演算法、包含三種現有算法和一種新提出的算法。實驗結果表明,沒有一種算法可以支配所有的性能指標,例如完工時間、平均作業完成時間或平均運行時間。但是,某些算法在某些工作負載和作業分佈方面確實比其他算法工作得更好。實驗還表明,作業放置對GPU集群的性能至關重要,所提出的作業放置算法可以有效優化作業中不同工作者之間的通信成本。
Elastic deep learning that dynamically adjusts the resource allocation for a group of training jobs can effectively enhance the utilization of accelerators, which are essential for training large scale deep learning models nowadays. Although many scheduling algorithms for elastic training have been proposed, they lack an easy use while efficient platform to carry out. In this thesis, we presented Voda, a GPU scheduling platform for elastic training. Unlike previous works that uses the parameter server for elastic training, Voda is designed for AllReduce communication, which is more effective, but also more complicated to be adjusted. Voda, built on top of Kubernetes, consists of a set of loosely coupled components, that can collect the run-time information, dynamically alter the resource allocation, and decide a job placement to optimize the job execution based on the communication cost among underlying GPUs. We compared four elastic algorithms, three existing methods and one newly proposed, on Voda, with different workloads, job distributions, and arrival times. Experimental results show that no algorithm can dominate all performance metrics, such as makespan, average job completion time, or average running time. However, some algorithms do work better than others in some workloads and job distributions. Experiments also showed that the job placement is critical to the performance on GPU clusters, and the proposed job placement algorithm can effectively optimize the communication cost among different workers of a job.
中文摘要 1
Abstract 2
List of Figures 5
List of Tables 6
Listings 7
1 Introduction 8
2 Background and Motivation 11
2.1 Elastic Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Scheduling Algorithms for Elastic Deep Learning . . . . . . . . . . . . 12
2.2.1 Elastic-Tiresias . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 FfDL Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 AFS-L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Challenges and Design Principle . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Used Open Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Horovod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Kubeflow MPI Operator . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Scheduling Algorithm for Deep Learning . . . . . . . . . . . . . 20
2.5.2 System for Elastic Training . . . . . . . . . . . . . . . . . . . . 20
3
2.5.3 ML Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Voda Design and Implementation 22
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Job Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Resource Allocator . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Thrashing Avoidance . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Placement Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Placement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Continuous Throughput Measurement . . . . . . . . . . . . . . . . . . 36
4 Experiments 38
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.3 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.4 Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Placement Management . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Overhead of Elastic Training . . . . . . . . . . . . . . . . . . . . 47
5 Conclusion and Future Work 51
References 53
[1] Ross Girshick et al. “Rich feature hierarchies for accurate object detection and
semantic segmentation”. Proceedings of the IEEE conference on computer vision
and pattern recognition. 2014, pp. 580–587.
[2] Jeff Donahue et al. “Decaf: A deep convolutional activation feature for generic
visual recognition”. International conference on machine learning. PMLR. 2014,
pp. 647–655.
[3] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for
language understanding”. arXiv preprint arXiv:1810.04805 (2018).
[4] Zhenzhong Lan et al. “Albert: A lite bert for self-supervised learning of language
representations”. arXiv preprint arXiv:1909.11942 (2019).
[5] Yinhan Liu et al. “Roberta: A robustly optimized bert pretraining approach”.
arXiv preprint arXiv:1907.11692 (2019).
[6] Tong Qin and Shaojie Shen. “Online temporal calibration for monocular visualinertial systems”. 2018 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE. 2018, pp. 3662–3669.
[7] Jiyang Gao et al. “Vectornet: Encoding hd maps and agent dynamics from vectorized representation”. Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2020, pp. 11525–11533.
[8] Chengxi Li et al. “Learning 3d-aware egocentric spatial-temporal interaction via
graph convolutional networks”. 2020 IEEE International Conference on Robotics
and Automation (ICRA). IEEE. 2020, pp. 8418–8424.
[9] Mu Li et al. “Scaling distributed machine learning with the parameter server”.
11th USENIX Symposium on Operating Systems Design and Implementation
(OSDI 14). 2014, pp. 583–598.
[10] Mu Li et al. “Communication efficient distributed machine learning with the parameter server”. Advances in Neural Information Processing Systems 27 (2014),
pp. 19–27.
[11] Andrew Gibiansky. “Bringing HPC Techniques to Deep Learning”. 2017. url:
https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/.
[12] Priya Goyal et al. “Accurate, large minibatch sgd: Training imagenet in 1 hour”.
arXiv preprint arXiv:1706.02677 (2017).
[13] “Kubernetes”. url: https://kubernetes.io/.
[14] Yidi Wu et al. “Elastic Deep Learning in Multi-Tenant GPU Clusters”. IEEE
Transactions on Parallel and Distributed Systems (2021), pp. 1–1. doi: 10 .
1109/TPDS.2021.3064966.
53
[15] Vaibhav Saxena et al. “Effective Elastic Scaling of Deep Learning Workloads”.
2020 28th International Symposium on Modeling, Analysis, and Simulation of
Computer and Telecommunication Systems (MASCOTS). 2020, pp. 1–8. doi:
10.1109/MASCOTS50786.2020.9285954.
[16] Changho Hwang et al. “Elastic Resource Sharing for Distributed Deep Learning”.
18th USENIX Symposium on Networked Systems Design and Implementation
(NSDI 21). USENIX Association, Apr. 2021, pp. 721–739. isbn: 978-1-939133-
21-2. url: https://www.usenix.org/conference/nsdi21/presentation/
hwang.
[17] Myeongjae Jeon et al. “Analysis of Large-Scale Multi-Tenant GPU Clusters
for DNN Training Workloads”. 2019 USENIX Annual Technical Conference
(USENIX ATC 19). Renton, WA: USENIX Association, July 2019, pp. 947–
960. isbn: 978-1-939133-03-8. url: https://www.usenix.org/conference/
atc19/presentation/jeon.
[18] Travis Addair, Xu Ning, and Richard Liaw. “Elastic Deep Learning with Horovod
on Ray”. 2021. url: https://eng.uber.com/horovod-ray/.
[19] Yanghua Peng et al. “Optimus: an efficient dynamic resource scheduler for
deep learning clusters”. Proceedings of the Thirteenth EuroSys Conference. 2018,
pp. 1–14.
[20] Jingoo Han et al. “Marble: A multi-gpu aware job scheduler for deep learning
on hpc systems”. 2020 20th IEEE/ACM International Symposium on Cluster,
Cloud and Internet Computing (CCGRID). IEEE. 2020, pp. 272–281.
[21] Kshiteej Mahajan et al. “Themis: Fair and efficient GPU cluster scheduling”.
17th USENIX Symposium on Networked Systems Design and Implementation
NSDI 20). 2020, pp. 289–304.
[22] Shaoqi Wang et al. “An Efficient and Non-Intrusive GPU Scheduling Framework
for Deep Learning Training Systems”. SC20: International Conference for High
Performance Computing, Networking, Storage and Analysis. 2020, pp. 1–13.
doi: 10.1109/SC41405.2020.00094.
[23] Juncheng Gu et al. “Tiresias: A GPU Cluster Manager for Distributed Deep
Learning”. 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, Feb. 2019, pp. 485–
500. isbn: 978-1-931971-49-2. url: https://www.usenix.org/conference/
nsdi19/presentation/gu.
[24] KR Jayaram et al. “Ffdl: A flexible multi-tenant deep learning platform”. Proceedings of the 20th International Middleware Conference. 2019, pp. 82–95.
[25] Wencong Xiao et al. “Gandiva: Introspective Cluster Scheduling for Deep Learning”. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 595–
610. isbn: 978-1-939133-08-3. url: https://www.usenix.org/conference/
osdi18/presentation/xiao.
[26] Marcelo Amaral et al. “Topology-aware gpu scheduling for learning workloads
in cloud environments”. Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis. 2017, pp. 1–12.
54
[27] Alexander Sergeev and Mike Del Balso. “Horovod: fast and easy distributed
deep learning in TensorFlow”. arXiv preprint arXiv:1802.05799 (2018).
[28] Martín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. 2015. url: https : / /
www.tensorflow.org/.
[29] Francois Chollet et al. Keras. 2015. url: https://github.com/fchollet/
keras.
[30] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep
Learning Library”. Advances in Neural Information Processing Systems 32. Ed.
by H. Wallach et al. Curran Associates, Inc., 2019, pp. 8024–8035. url: http:
//papers.neurips.cc/paper/9015-pytorch-an-imperative-style-highperformance-deep-learning-library.pdf.
[31] Tianqi Chen et al. “Mxnet: A flexible and efficient machine learning library for
heterogeneous distributed systems”. arXiv preprint arXiv:1512.01274 (2015).
[32] “Kubeflow: The Machine Learning Toolkit for Kubernetes”. 2021. url: https:
//www.kubeflow.org/.
[33] Ou Rong et al. “MPI Operator in Kubeflow”. 2020. url: https://github.com/
kubeflow/mpi-operator.
[34] Yixin Bao et al. “Online job scheduling in distributed machine learning clusters”.
IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE.
2018, pp. 495–503.
[35] Chan-Yi Lin, Ting-An Yeh, and Jerry Chou. “DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in
Kubernetes Cluster.” CLOSER. 2019, pp. 569–577.
[36] Philipp Moritz et al. “Ray: A distributed framework for emerging AI applications”. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018, pp. 561–577.
[37] “Open Platform for AI (OpenPAI)”. url: https://github.com/microsoft/
pai.
[38] Mourad Mourafiq. Polyaxon: Cloud native machine learning automation platform. Web page. 2017. url: https://github.com/polyaxon/polyaxon.
[39] “Apache Submarine: Cloud Native Machine Learning Platform”. url: https:
//submarine.apache.org/.
[40] Yanjun Ma et al. “PaddlePaddle: An open-source deep learning platform from
industrial practice”. Frontiers of Data and Domputing 1.1 (2019), pp. 105–115.
[41] Michael Mui et al. “Elastic Distributed Training with XGBoost on Ray”. 2021.
url: https://eng.uber.com/elastic-xgboost-ray/.
[42] Richard Liaw et al. “Hypersched: Dynamic resource reallocation for model development on a deadline”. Proceedings of the ACM Symposium on Cloud Computing. 2019, pp. 61–73.
[43] Johnu George et al. A Scalable and Cloud-Native Hyperparameter Tuning System. 2020. arXiv: 2006.02085 [cs.DC].
55
[44] Hanyu Zhao et al. “Hived: sharing a GPU cluster for deep learning with guarantees”. 14th USENIX symposium on operating systems design and implementation
(OSDI 20). 2020, pp. 515–532.
[45] “Apache Hadoop”. url: https://hadoop.apache.org/.
[46] “Amazon Elastic Block Store: Easy to use, high performance block storage at
any scale”. url: https://aws.amazon.com/ebs/.
[47] “Azure Disk Storage: High-performance, highly durable block storage for Azure
Virtual Machines”. url: https://azure.microsoft.com/en-us/services/
storage/disks/.
[48] “Google Compute Engine (GCE) Persistent Disk: Reliable, high-performance
block storage for virtual machine instances”. url: https://cloud.google.
com/persistent-disk.
[49] “Prometheus”. url: https://prometheus.io/.
[50] Zhaoyun Chen et al. “Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters”. IEEE Transactions on Parallel and Distributed Systems 31.1 (2019), pp. 34–50.
[51] Isabelle Guyon et al. “Analysis of the AutoML Challenge series 2015-2018”.
AutoML. Springer series on Challenges in Machine Learning. 2019. url: https:
//www.automl.org/wp-content/uploads/2018/09/chapter10-challenge.
pdf.
[52] Harold W Kuhn. “The Hungarian method for the assignment problem”. Naval
research logistics quarterly 2.1-2 (1955), pp. 83–97.
[53] James Munkres. “Algorithms for the assignment and transportation problems”.
Journal of the society for industrial and applied mathematics 5.1 (1957), pp. 32–
38.
[54] Haibin Zhu, MengChu Zhou, and Rob Alkins. “Group role assignment via a
Kuhn–Munkres algorithm-based solution”. IEEE Transactions on Systems, Man,
and Cybernetics-Part A: Systems and Humans 42.3 (2011), pp. 739–750.
[55] “MongoDB”. url: https://www.mongodb.com/.
[56] David Kirk. “NVIDIA cuda software and gpu parallel computing architecture”.
Proceedings of the 6th International Symposium on Memory Management, ISMM
2007, Montreal, Quebec, Canada, October 21-22, 2007. Ed. by Greg Morrisett
and Mooly Sagiv. ACM, 2007, pp. 103–104. doi: 10.1145/1296907.1296909.
url: https://doi.org/10.1145/1296907.1296909.
[57] Sharan Chetlur et al. “cuDNN: Efficient primitives for deep learning”. arXiv
preprint arXiv:1410.0759 (2014).
[58] Ammar Ahmad Awan et al. “Efficient large message broadcast using NCCL and
CUDA-aware MPI for deep learning”. Proceedings of the 23rd European MPI
Users’ Group Meeting. 2016, pp. 15–22.
[59] Li Deng. “The mnist database of handwritten digit images for machine learning
research”. IEEE Signal Processing Magazine 29.6 (2012), pp. 141–142.
[60] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for
large-scale image recognition”. arXiv preprint arXiv:1409.1556 (2014).
56
[61] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech.
rep. 2009.
[62] Kaiming He et al. “Deep residual learning for image recognition”. Proceedings of
the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–
778.
[63] Christian Szegedy et al. “Rethinking the inception architecture for computer
vision”. Proceedings of the IEEE conference on computer vision and pattern
recognition. 2016, pp. 2818–2826.
[64] Ashish Vaswani et al. “Attention is all you need”. Advances in neural information
processing systems. 2017, pp. 5998–6008.
[65] “Tab-delimited Bilingual Sentence Pairs”. url: https : / / www . manythings .
org/anki/.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *