作者(中文):林 湧
作者(外文):Bijon Setyawan Raya
論文名稱(中文):Schedulearn: 使用微服務的彈性學習平台
論文名稱(外文):Schedulearn: An Elastic Learning Platform Using Microservices
指導教授(外文):Lee, Che-Rung
口試委員(外文):Chou, Chi-Yuan
Lai, Kuan-Chiou
外文關鍵詞:Deep Learning PlatformElastic TrainingCluster SchedulingDistributed Computing
In this dissertation, we will demonstrate a novel method to a lightweight distributed deep learning system that enables users to train their model using several computational graphics cards. This system offers customizable scheduling algorithms, and users can alter the scheduling algorithm without having to restart the system. Compared to existing distributed deep learning systems, our system is lightweight and considerably simpler to install on any machines that have computational graphics cards. Furthermore, we also created two scheduling algorithms to demonstrate the flexibility of our system, and how they affect the average makespan and turnaround time of models made with distinct deep learning frameworks.

To create this lightweight distributed deep learning system, we carefully chose the tools required to create this system to avoid a parsimonious design. In order to keep the simplicity of the system, we chose easy-to-use and proven technologies that stood the test of time.

Since the system consists of different microservices, we found that it is easy to make changes and any changes made will not affect the entire system. Not only minor or major customization can be made easily to the system, but the users can determine the scheduling algorithm that best suits their needs without having to restart the entire system. In addition, the average makespan and turnaround time of models are affected differently by each scheduling technique. The Elastic FIFO algorithm is capable of maintaining maximal resource utilization, while the Round Robin algorithm reduces the average makespan of some models.

With this project, we hope to provide a lightweight distributed deep learning system that is easy to install and use, especially for those who are new to deep learning so that they can focus more on developing models instead of troubleshooting system fault. Also, we hope to enable incremental changes to the system, and to provide a system that is flexible enough to be customized to users' needs.
Abstract (Chinese) I
Abstract II
Contents IV
List of Figures VI
List of Tables VII
List of Algorithms VIII
1 Introduction 1
2 Motivation 4
3 Related Works 6
3.1 TraditionalClusterManagers ...................... 6
3.2 DeepLearningSchedulers .......................... 7
3.3 DistributedDeepLearningFramework ................ 8
3.3.1 Horovod...................................... 8
3.3.2 DistributedTensorFlow ...................... 12
3.3.3 TensorFlowOnSpark .......................... 12
3.3.4 BigDL.............................. 13
3.3.5 PyTorch............................. 13
3.4 Tools................................... 13
3.4.1 Containerization ........................ 13
3.4.2 DistributedDeepLearning................... 14
3.4.3 WebService........................... 14
3.4.4 RelationalDatabase ...................... 15
4 Design and Implementation 16
4.1 Implementation............................. 16
4.2 SystemOverview ............................ 17
4.3 JobSubmission ............................. 18
4.4 SchedulingAlgorithm.......................... 19
4.4.1 FIFO............................... 21
4.4.2 ElasticFIFO .......................... 22
4.5 Advantages ............................... 22
4.5.1 EasySetup ........................... 22
4.5.2 EasyJobSubmission...................... 23
4.5.3 MultipleSchedulingAlgorithms ................ 23
5 Experimentation 24
5.1 ExperimentSetup........................... 25
5.2 ExperimentResult ........................ 25
5.3 Comparison .............................. 28
6 Conclusion and Future Work Bibliography ..... 30
Bibliography .................................. 32
