帳號:guest(18.224.69.84)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):林展逸
作者(外文):Lin, Chan-Yi
論文名稱(中文):針對深度學習計算研析容器化雲平台之自動部署及執行管理技術
論文名稱(外文):Building a Container-based Cloud Platform for Deep Learning from Service Deployment to Runtime Management
指導教授(中文):周志遠
指導教授(外文):Jerry, Chou
口試委員(中文):金仲達
李哲榮
口試委員(外文):King, Chung-Ta
Lee, Che-Rung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:105062617
出版年(民國):108
畢業學年度:108
語文別:英文
論文頁數:41
中文關鍵詞:深度學習資源編排工作排程自動擴展
外文關鍵詞:Deep learningResource OrchestrationJob SchedulingAutoscaling
相關次數:
  • 推薦推薦:0
  • 點閱點閱:106
  • 評分評分:*****
  • 下載下載:21
  • 收藏收藏:0
隨著過去十年由深度學習驅使的人工智慧服務其快速增⻑趨勢,深度學習, 特別是資源密集型和耗時的模型訓練工作,已成為當今生產集群的主要工作 之一。然而,由於深度學習的特性,例如:複雜工作負載以及共享資源環境, 管理集群中分佈式訓練工作的資源分配和執行生命週期具有挑戰性。這篇 論文旨在通過開發和實施調度和擴展控制器來動態管理 Kubernetes(K8s) 集 群上的分佈式訓練工作來解決這些問題,該集群是一個廣泛用於管理容器 化工作負載和服務的平台。我們提出的方法的目標是通過三種功能來增強 K8S:(1)偵測任務間的依賴性以進行群組調度以避免空閒資源。(2)偵測 任務放置位置以最小化通信開銷。(3)偵測工作負載量以動態縮放資源以提 高成本效率。我們的方法在真實的測試平台和模擬環境使用一組 TensorFlow 工作進行評估。與預設的 K8S 調度程序相比,我們的方法成功地將資源利用 率提高了 20% ∼ 30%,並將工作耗用時間減少了 65%。
With the fast growing trend in deep learning driven AI services over the past decade, deep learning, especially the resource-intensive and time-consuming training jobs, have become one of the main workload in today’s production clusters. However, due to the complex workload characteristics of deep learning, and the dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of distributed training jobs in cluster can be challenging. This work aims to address these issues by developing and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and services. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication overhead. (3) Load aware job scaling to improve cost efficiency. Our approach is evaluated by real testbed and simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully improved resource utilization by 20% ∼ 30% and reduced job elapsed time by over 65%.
Abstract
Content
Introduction ------------------------------------- 1
Background ------------------------------------- 5
3.Performance Study of Resource Orchestration - 10
4.Challenges of Job Runtime Management ------ 18
5.DRAGON -------------------------------------- 24
6.Experiment ----------------------------------- 29
7.Related Work ---------------------------------- 36
8.Conclusion ------------------------------------ 38
References -------------------------------------- 39
[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on OSDI (2016), pp. 265–283.
[2] Amaral, M., Polo, J., Carrera, D., Seelam, S. R., and Steinder, M. Topology- aware GPU scheduling for learning workloads in cloud environments. In Pro- ceedings of the SuperComputing (SC) (2017), pp. 17:1–17:12.
[3] Bao, Y., Peng, Y., Wu, C., and Li, Z. Online job scheduling in distributed machine learning clusters. In 2018 IEEE Conference on Computer Commu- nications, INFOCOM 2018, Honolulu, HI, USA, April 16-19, 2018 (2018), pp. 495–503.
[4] Chan-Yi Lin. DRAGON: Deep Learning with Auto-scale and Gang-schedule On Kubernetes. https://github.com/ChanYiLin/tf-operator-Dragon/, 2019.
[5] DeSa,C.,Feldman,M.,Ré,C.,andOlukotun,K.Understandingandoptimiz- ing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual ISCA (2017), pp. 561–574.
[6] Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In Proceedings of the 25th International Con- ference on Neural Information Processing Systems - Volume 1 (USA, 2012), NIPS’12, Curran Associates Inc., pp. 1223–1231.
[7] Docker. Docker swarm. https://docs.docker.com/engine/swarm/, 2017.
[8] Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. In arXiv preprint arXiv:1706.02677 (2017).
[9] Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and Gibbons, P. B. Pro- teus: agile ML elasticity through tiered reliability in dynamic resource mar- kets. In Proceedings of the Twelfth EuroSys (April 2017), pp. 589–604.
[10] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In arXiv:1512.03385v1 (2015).
[11] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).
[12] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on NSDI (2011), pp. 295–308.
[13] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine 29, 6 (Nov 2012), 82–97.
[14] IBM. Fabric for deep learning (ffdl). https://github.com/IBM/FfDL.
[15] Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J., Xiao, W., and Yang, F. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications. Microsoft Research Technical Report (MSR-TR-2018-13) (May 2018).
[16] Krizhevsky, A. One weird trick for parallelizing convolutional neural net- works. CoRR abs/1404.5997 (2014).
[17] Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th In- ternational Conference on Neural Information Processing Systems - Volume 1 (2012), pp. 1097–1105.
[18] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Pro- cessing Systems 25: 26th Annual Conference on Neural Information Process- ing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. (2012), pp. 1106–1114.
[19] Kubeflow. The machine learning toolkit for kubernetes. https://www.kubeflow.org/.
[20] LeCun,Y.,Bengio,Y.,andHinton,G.Deeplearning.Nature521,7553(2015), 436–444.
[21] Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on OSDI (2014), pp. 583–598.
[22] Mayer, R., Mayer, C., and Laich, L. The tensorflow partitioning and schedul- ing problem: It’s the critical path! In Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning (2017), pp. 1–6.
[23] Microsoft. Open platform for ai(openpai). https://github.com/Microsoft/pai.
[24] Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean, J. Device placement optimization with reinforcement learning. CoRR abs/1706.04972 (2017).
[25] Niu, F., Recht, B., Re, C., and Wright, S. J. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th Interna- tional Conference on Neural Information Processing Systems (2011), pp. 693– 701.
[26] OpenStack. Open source software for creating private and public clouds. https://www.openstack.org/, 2017.
[27] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch, 2017.
[28] Peng,Y.,Bao,Y.,Chen,Y.,Wu,C.,andGuo,C.Optimus:anefficientdynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (April 2018), pp. 3:1–3:14.
[29] RiseML. Machine learning platform for kubernetes. https://riseml.com/.
[30] Sergeev,A.,andBalso,M.D.Horovod:fastandeasydistributeddeeplearning
in tensorflow. CoRR abs/1802.05799 (2018).
[31] Simonyan,K.,andZisserman,A.”verydeepconvolutionalnetworksforlarge- scale image recognition”. In International Conference on Learning Represen- tations (2015).
[32] Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2.
[33] Szegedy,C.,Vanhoucke,V.,Ioffe,S.,Shlens,J.,andWojna,Z.Rethinkingthe inception architecture for computer vision. CoRR abs/1512.00567 (2015).
[34] Xiao,W.,Bhardwaj,R.,Ramjee,R.,Sivathanu,M.,Kwatra,N.,Han,Z.,Patel, P., Peng, X., Zhao, H., Zhang, Q., Yang, F., and Zhou, L. Gandiva: Introspec- tive Cluster Scheduling for Deep Learning. In 13th USENIX OSDI (2018), pp. 595–610.
[35] You, Y., Gitman, I., and Ginsburg, B. Scaling SGD batch size to 32k for ima- genet training. CoRR abs/1708.03888 (2017).
[36] Yu, D., Eversole, A., Seltzer, M., Yao, K., Guenter, B., Kuchaiev, O., Seide, F., Wang, H., Droppo, J., Huang, Z., Zweig, G., Rossbach, C., and Currey, J. An introduction to computational networks and the computational network toolkit. Microsoft Technical Report (2014).
[37] Zhang, W., Gupta, S., Lian, X., and Liu, J. Staleness-aware async-SGD for Distributed Deep Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2016), IJCAI’16, pp. 2350–2356.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *