|
[1] Amazon Spot Instance. ”https://aws.amazon.com/ec2/spot/”. [2] Google Preemptive VMs. ”https://cloud.google.com/preemptible-vms/”. [3] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. [4] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018. [5] Shijian Li, Robert J Walls, Lijie Xu, and Tian Guo. Speeding up deep learning with transient servers. arXiv preprint arXiv:1903.00045, 2019. [6] Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, and Phillip B. Gibbons. Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, pages 589–604, New York, NY, USA, 2017. ACM. [7] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016. [8] François Chollet et al. Keras. https://keras.io, 2015. [9] Facebook Research. ”Caffe2”. ”https://caffe2.ai”. [10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017. [11] A Scalable Deep Learning Framework MXNet.”https://mxnet.apache.org”. [12] Introducing Dynamic Training for deep learning with Amazon EC2. [Online]. Available: https://aws.amazon.com/blogs/machine-learning/ introducing-dynamic-training-for-deep-learning-with-amazon-ec2/. [13] Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I Jordan. Sparknet: Training deep networks in spark. arXiv preprint arXiv:1511.06051, 2015. [14] gRPC. A high performance universal rpc framework. https://grpc.io/. [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [16] Angelia Nedich, Dimitri P Bertsekas, and Vivek S Borkar. Distributed asyn- chronous incremental subgradient methods. Studies in Computational Mathe- matics, 8(C):381–407, 2001. [17] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale dis- tributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012. [18] Protocol Buffer. A language-neutral, platform-neutral extensible mech- anism for serializing structured data. https://developers.google.com/ protocol-buffers. [19] TensorFlow: Add a New Op.https://www.tensorflow.org/guide/extend/op. [20] TensorFlow. A benchmark framework for tensorflow. [Online]. Available: https://github.com/tensorflow/benchmarks. [21] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefow- icz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016. [22] Andrew Gibiansky. Bringing hpc techniques to deep learning, 2017. http: //research.baidu.com/bringing-hpc-techniques-deep-learning. [23] Andrew Gibiansky and Joel Hestness. Baidu allreduce, 2017. https://github. com/baidu-research/tensorflow-allreduce. [24] Aaron Harlap, Gregory R Ganger, and Phillip B Gibbons. Tierml: Using tiers of reliability for agile elasticity in machine learning. Carnegie Mellon University, Tech. Rep., 2016. [25] Shang-Xuan Zou, Chun-Yen Chen, Jui-Lin Wu, Chun-Nan Chou, Chia-Chin Tsao, Kuan-Chieh Tung, Ting-Wei Lin, Cheng-Lung Sung, and Edward Y Chang. Distributed training large-scale deep architectures. In International Conference on Advanced Data Mining and Applications, pages 18–32. Springer, 2017. [26] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. Solving the straggler problem with bounded staleness. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems, Santa Ana Pueblo, NM, 2013. USENIX. [27] J. Keuper and F. Preundt. Distributed training of deep neural networks: The- oretical and practical limits of parallel scalability. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pages 19–26, Nov 2016. [28] Víctor Campos, Francesc Sastre, Maurici Yagües, Míriam Bellver, Xavier Giró-i Nieto, and Jordi Torres. Distributed training strategies for a computer vision deep learning algorithm on a distributed gpu cluster. Procedia Computer Science, 108:315–324, 2017. |