帳號:guest(18.216.120.209)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):李泓諭
作者(外文):Li, Hong-Yu
論文名稱(中文):Elastic TensorFlow: 可擴展之深度學習計算框架的設計與實作
論文名稱(外文):Elastic TensorFlow: A Novel Network Overlay Design and Implementation to Support Elastic Deep Learning Computing
指導教授(中文):周志遠
指導教授(外文):Chou, Jerry
口試委員(中文):李哲榮
賴冠州
口試委員(外文):Lee, Che-Rung
Lai, Kuan-Chou
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:106062641
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:35
中文關鍵詞:分散式深度學習深度學習彈性運算分散式計算平行系統
外文關鍵詞:Distributed Deep LearningDeep LearningElastic ComputingDistributed ComputingParallel Systems
相關次數:
  • 推薦推薦:0
  • 點閱點閱:206
  • 評分評分:*****
  • 下載下載:14
  • 收藏收藏:0
TensorFlow 是一個熱門的的深度學習框架,除了提供單機版本外,也支 援多機多卡的分散式版本,擴展度很好。因為深度學習的計算量十分龐大, 常常花費數小時至數天的時間。透過分散式計算,使用者可以利用更多有效 的計算資源加入計算,加速訓練過程。然而大部分的深度學習都需要在已知機器叢集的前提下才能進行訓練,TensorFlow 也不例外。這樣的限制,造成更多的機器要加速訓練時,必須先中止訓練工作、更新機器叢集並重新啟動工作。也因為這樣的設計,系統的資源運用會較不彈性,也無法完全發揮 雲端計算的好處。所以我們基於 TensorFlow 改良一個可以動態擴展的框架, ElasticTF,可以在不中斷訓練過程的前提下彈性的增減計算節點。有了這個功能,在有限資源下,我們可以動態增減訓練所需資源,維持整體系統的高度此使用率。在雲端計算下,我們可以利用一些訓練評估,使用更多的機器來加速訓練或是節省花費。在我們的實驗中,相對於傳統重新啟動的方法, ElasticTF 可以在有限時間達到相同計算量的前提下,節省將近 18.6% 的花費,而且花費與理想的靜態設定訓練相近。
TensorFlow is one of the most popular deep learning frameworks. With good scalability, it only provides single node training but also distributed training with multiple devices and multiple nodes. Due to the computational complexity of deep learning, it often takes hours to days to finish training. Through distributed computing, users can exploit more resources to join training to speed up training process. However, most of these deep learning frameworks follow a constraint of static cluster, including TensorFlow. With this constraint, to scale up workers, we need to shutdown the training job, update cluster specification, and restart training job by checkpoint files. Thus, the system resources usage is not that flexible, and also limit the advantages of cloud computing. So, we introduce ElasticTF, a framework based on TensorFlow, that is capable of scaling workers in runtime dynamically without suspending execution. With elasticity, we can remain high utilization in limited resources system. Furthermore, in public cloud, we can predict the scaling strategies by historical performance logs to speed up training and/or reduce cost of training. Compared to checkpoint-restart with same computation before deadline, ElasticTF can reduce up to 18.6% of cost. Besides, the cost is comparable to ideal static train- ing setting.
1 Introduction ---------------- 1
2 Distributed Deep Learning ---------------- 4
2.1 Distributed Architecture ---------------- 4
2.2 Existing solution: Checkpoint-Restart ---------------- 6
2.3 Goal ---------------- 6
3 Distributed TensorFlow ---------------- 7
3.1 TensorFlow Introduction ---------------- 7
3.2 Workflow from Client to Internal TensorFlow ---------------- 9
3.3 SendOp/RecvOp Communication ---------------- 12
3.4 Challenges ---------------- 14
4 Design and Implementation of Elastic TensorFlow ---------------- 15
4.1 Methodology ---------------- 15
4.2 SendTensor Service ---------------- 16
4.3 StSend/StRecv Operations ---------------- 17
4.3.1 Insertion ---------------- 18
4.3.2 Kernel Implementation ---------------- 19
5 Evaluation ---------------- 21
5.1 Environment Setup ---------------- 21
5.2 Correctness Validation ---------------- 22
5.3 Restart Penalty ---------------- 22
5.4 Private Cloud Scenario ---------------- 23
5.5 Public Cloud Scenario ---------------- 26
5.6 Performance Degradation ---------------- 28
6 Related Work ---------------- 30
7 Conclusion ---------------- 32
References ---------------- 33
[1] Amazon Spot Instance. ”https://aws.amazon.com/ec2/spot/”.
[2] Google Preemptive VMs. ”https://cloud.google.com/preemptible-vms/”.
[3] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[4] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
[5] Shijian Li, Robert J Walls, Lijie Xu, and Tian Guo. Speeding up deep learning with transient servers. arXiv preprint arXiv:1903.00045, 2019.
[6] Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, and Phillip B. Gibbons. Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, pages 589–604, New York, NY, USA, 2017. ACM.
[7] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
[8] François Chollet et al. Keras. https://keras.io, 2015.
[9] Facebook Research. ”Caffe2”. ”https://caffe2.ai”.
[10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
[11] A Scalable Deep Learning Framework MXNet.”https://mxnet.apache.org”.
[12] Introducing Dynamic Training for deep learning with Amazon EC2. [Online]. Available: https://aws.amazon.com/blogs/machine-learning/ introducing-dynamic-training-for-deep-learning-with-amazon-ec2/.
[13] Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I Jordan. Sparknet: Training deep networks in spark. arXiv preprint arXiv:1511.06051, 2015.
[14] gRPC. A high performance universal rpc framework. https://grpc.io/.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[16] Angelia Nedich, Dimitri P Bertsekas, and Vivek S Borkar. Distributed asyn- chronous incremental subgradient methods. Studies in Computational Mathe- matics, 8(C):381–407, 2001.
[17] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale dis- tributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
[18] Protocol Buffer. A language-neutral, platform-neutral extensible mech- anism for serializing structured data. https://developers.google.com/ protocol-buffers.
[19] TensorFlow: Add a New Op.https://www.tensorflow.org/guide/extend/op.
[20] TensorFlow. A benchmark framework for tensorflow. [Online]. Available:
https://github.com/tensorflow/benchmarks.
[21] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefow- icz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
[22] Andrew Gibiansky. Bringing hpc techniques to deep learning, 2017. http: //research.baidu.com/bringing-hpc-techniques-deep-learning.
[23] Andrew Gibiansky and Joel Hestness. Baidu allreduce, 2017. https://github. com/baidu-research/tensorflow-allreduce.
[24] Aaron Harlap, Gregory R Ganger, and Phillip B Gibbons. Tierml: Using tiers of reliability for agile elasticity in machine learning. Carnegie Mellon University, Tech. Rep., 2016.
[25] Shang-Xuan Zou, Chun-Yen Chen, Jui-Lin Wu, Chun-Nan Chou, Chia-Chin Tsao, Kuan-Chieh Tung, Ting-Wei Lin, Cheng-Lung Sung, and Edward Y
Chang. Distributed training large-scale deep architectures. In International Conference on Advanced Data Mining and Applications, pages 18–32. Springer, 2017.
[26] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. Solving the straggler problem with bounded staleness. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems, Santa Ana Pueblo, NM, 2013. USENIX.
[27] J. Keuper and F. Preundt. Distributed training of deep neural networks: The- oretical and practical limits of parallel scalability. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pages 19–26, Nov 2016.
[28] Víctor Campos, Francesc Sastre, Maurici Yagües, Míriam Bellver, Xavier Giró-i Nieto, and Jordi Torres. Distributed training strategies for a computer vision deep learning algorithm on a distributed gpu cluster. Procedia Computer Science, 108:315–324, 2017.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *