Schedulearn: 使用微服務的彈性學習平台__國立清華大學博碩士論文全文影像系統

帳號：guest(3.16.67.134) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	林湧
作者(外文):	Bijon Setyawan Raya
論文名稱(中文):	Schedulearn: 使用微服務的彈性學習平台
論文名稱(外文):	Schedulearn: An Elastic Learning Platform Using Microservices
指導教授(中文):	李哲榮
指導教授(外文):	Lee, Che-Rung
口試委員(中文):	周志遠賴冠州
口試委員(外文):	Chou, Chi-Yuan Lai, Kuan-Chiou
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	109062710
出版年(民國):	112
畢業學年度:	111
語文別:	英文
論文頁數:	47
中文關鍵詞:	深度學習平台、彈性訓練、集群調度、分散式計算
外文關鍵詞:	Deep Learning Platform、Elastic Training、Cluster Scheduling、Distributed Computing
相關次數:	推薦:0 點閱:39 評分: 下載:0 收藏:0

在這篇論文中，我們將展示一種輕量級分散式深度學習系統的新方法，使用戶能夠使用多個計算顯卡來訓練他們的模型。該系統提供可定製的調度演算法，使用者無需重新啟動系統即可更改調度演算法。與現有的分散式深度學習系統相比，我們的系統重量輕，安裝在任何具有計算顯卡的機器上都相當簡單。此外，我們還創建了兩種調度演算法來展示我們系統的靈活性，以及它們如何影響使用不同深度學習框架製作的模型的平均構建時間和周轉時間。

為了創建這個輕量級的分散式深度學習系統，我們仔細選擇了創建這個系統所需的工具，以避免質量差的設計。為了保持系統的簡單性，我們選擇了易於使用且經得起時間考驗的成熟技術。

由於系統由不同的微服務組成，我們發現很容易進行更改，所做的任何更改都不會影響整個系統。不僅可以輕鬆地對系統進行次要或重大自定義，而且使用者可以確定最適合其需求的調度演算法，而無需重新啟動整個系統。此外，模型的平均製作跨度和周轉時間受每種調度技術的影響而不同。彈性先進先出演算法能夠保持最大的資源利用率，而時間片輪轉演算法減少了某些模型的平均製造時間。

通過這個項目，我們希望提供一個易於安裝和使用的輕量級分散式深度學習系統，特別是對於那些剛接觸深度學習的人。如此一來，他們就可以更專注於開發模型而不是排除系統故障。此外，我們希望對系統進行增量更改，並提供一個足夠靈活的系統，可以根據使用者的需求進行定製。

In this dissertation, we will demonstrate a novel method to a lightweight distributed deep learning system that enables users to train their model using several computational graphics cards. This system offers customizable scheduling algorithms, and users can alter the scheduling algorithm without having to restart the system. Compared to existing distributed deep learning systems, our system is lightweight and considerably simpler to install on any machines that have computational graphics cards. Furthermore, we also created two scheduling algorithms to demonstrate the flexibility of our system, and how they affect the average makespan and turnaround time of models made with distinct deep learning frameworks.

To create this lightweight distributed deep learning system, we carefully chose the tools required to create this system to avoid a parsimonious design. In order to keep the simplicity of the system, we chose easy-to-use and proven technologies that stood the test of time.

Since the system consists of different microservices, we found that it is easy to make changes and any changes made will not affect the entire system. Not only minor or major customization can be made easily to the system, but the users can determine the scheduling algorithm that best suits their needs without having to restart the entire system. In addition, the average makespan and turnaround time of models are affected differently by each scheduling technique. The Elastic FIFO algorithm is capable of maintaining maximal resource utilization, while the Round Robin algorithm reduces the average makespan of some models.

With this project, we hope to provide a lightweight distributed deep learning system that is easy to install and use, especially for those who are new to deep learning so that they can focus more on developing models instead of troubleshooting system fault. Also, we hope to enable incremental changes to the system, and to provide a system that is flexible enough to be customized to users' needs.

Abstract (Chinese) I
Abstract II
Contents IV
List of Figures VI
List of Tables VII
List of Algorithms VIII
1 Introduction 1
2 Motivation 4
3 Related Works 6
3.1 TraditionalClusterManagers ...................... 6
3.2 DeepLearningSchedulers .......................... 7
3.3 DistributedDeepLearningFramework ................ 8
3.3.1 Horovod...................................... 8
3.3.2 DistributedTensorFlow ...................... 12
3.3.3 TensorFlowOnSpark .......................... 12
3.3.4 BigDL.............................. 13
3.3.5 PyTorch............................. 13
3.4 Tools................................... 13
3.4.1 Containerization ........................ 13
3.4.2 DistributedDeepLearning................... 14
3.4.3 WebService........................... 14
3.4.4 RelationalDatabase ...................... 15
4 Design and Implementation 16
4.1 Implementation............................. 16
4.2 SystemOverview ............................ 17
4.3 JobSubmission ............................. 18
4.4 SchedulingAlgorithm.......................... 19
4.4.1 FIFO............................... 21
4.4.2 ElasticFIFO .......................... 22
4.5 Advantages ............................... 22
4.5.1 EasySetup ........................... 22
4.5.2 EasyJobSubmission...................... 23
4.5.3 MultipleSchedulingAlgorithms ................ 23
5 Experimentation 24
5.1 ExperimentSetup........................... 25
5.2 ExperimentResult ........................ 25
5.3 Comparison .............................. 28
6 Conclusion and Future Work Bibliography ..... 30
Bibliography .................................. 32

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef- frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning, 2016.
[2] Andrew Gibiansky. Bringing HPC Techniques to Deep Learning. Available at https://andrew.gibiansky.com/blog/machine-learning/ baidu-allreduce/. Accessed: 2022-10-05.
[3] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech represen- tations, 2020.
[4] Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. Online job schedul- ing in distributed machine learning clusters, 2018.
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan- dlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[6] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representa- tion learning at scale, 2019.
[7] Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan Wan, Zhichao Li, Jiao Wang, Shengsheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, Dongjie Shi, Qi Lu, Kai Huang, and Guoqiong Song. BigDL. In Proceedings of the ACM Symposium on Cloud Computing. ACM, nov 2019.
[8] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
[10] Docker. Docker SDK for Python. Available at https://docker-py. readthedocs.io/en/stable/. Accessed: 2022-11-09.
[11] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition, 2013.
[12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, 2013.
[13] Google. KubeFlow. Available at https://www.kubeflow.org/. Accessed: 2022-11-09.
[14] Google. Kubernetes. Available at https://kubernetes.io/. Accessed: 2022- 11-09.
[15] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017.
[16] Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485–500, Boston, MA, February 2019. USENIX Association.
[17] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, 1951.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
[19] Mike Del Balso Jeremy Hermann. Meet michelangelo: Uber’s ma- chine learning platform. Available at https://www.uber.com/en-TW/blog/ michelangelo-machine-learning-platform/. Accessed: 2022-10-06.
[20] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for Fine-Grained resource sharing in the data center. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), Boston, MA, March 2011. USENIX Association.
[21] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real- time style transfer and super-resolution, 2016.
[22] Holden Karau. Apache Spark. Available at https://spark.apache.org/.
[23] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019.
[24] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
[25] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation, 2014.
[26] Amy X. Lu, Haoran Zhang, Marzyeh Ghassemi, and Alan Moses. Self- supervised contrastive learning of protein representations by mutual infor- mation maximization. bioRxiv, 2020.
[27] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Mar- tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high- performance deep learning library, 2019.
[28] Pitch Patarasuk and Xin Yuan. Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117–124, 2009.
[29] Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
[30] Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, and George Tucker. On variational bounds of mutual information, 2019.
[31] Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, and Tushar Krishna. Themis. In Proceedings of the 49th Annual International Symposium on Computer Architecture. ACM, jun 2022.
[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern- stein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recogni- tion challenge, 2014.
[33] Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, Eu- roSys ’13, page 351–364, New York, NY, USA, 2013. Association for Com- puting Machinery.
[34] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow, 2018.
[35] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parame- ter language models using model parallelism, 2019.
[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
[37] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015.
[38] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks, 2019.
[39] TensorFlow Authors. TensorFlow Benchmarks. Available at https:// github.com/tensorflow/benchmarks. Accessed: 2022-11-09.
[40] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC ’13, New York, NY, USA, 2013. Association for Computing Machinery.
[41] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Sys- tems, EuroSys ’15, New York, NY, USA, 2015. Association for Computing Machinery.
[42] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020.
[43] Wikipedia. Wikipedia - Data Parallelism. Available at https://en. wikipedia.org/wiki/Data_parallelism. Accessed: 2022-11-09.
[44] Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, and Fan Yu. Elastic deep learning in multi-tenant gpu cluster, 2019.
[45] Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective clus- ter scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, Carlsbad, CA, October 2018. USENIX Association.
[46] Yahoo. TensorFlowOnSpark. Available at https://github.com/yahoo/ TensorFlowOnSpark. Accessed: 2022-11-09.
[47] Zhuangdi Zhu, Kaixiang Lin, Anil K. Jain, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey, 2020.

Full-text
Abstract

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文