|
[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on OSDI (2016), pp. 265–283. [2] Amaral, M., Polo, J., Carrera, D., Seelam, S. R., and Steinder, M. Topology- aware GPU scheduling for learning workloads in cloud environments. In Pro- ceedings of the SuperComputing (SC) (2017), pp. 17:1–17:12. [3] Bao, Y., Peng, Y., Wu, C., and Li, Z. Online job scheduling in distributed machine learning clusters. In 2018 IEEE Conference on Computer Commu- nications, INFOCOM 2018, Honolulu, HI, USA, April 16-19, 2018 (2018), pp. 495–503. [4] Chan-Yi Lin. DRAGON: Deep Learning with Auto-scale and Gang-schedule On Kubernetes. https://github.com/ChanYiLin/tf-operator-Dragon/, 2019. [5] DeSa,C.,Feldman,M.,Ré,C.,andOlukotun,K.Understandingandoptimiz- ing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual ISCA (2017), pp. 561–574. [6] Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In Proceedings of the 25th International Con- ference on Neural Information Processing Systems - Volume 1 (USA, 2012), NIPS’12, Curran Associates Inc., pp. 1223–1231. [7] Docker. Docker swarm. https://docs.docker.com/engine/swarm/, 2017. [8] Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. In arXiv preprint arXiv:1706.02677 (2017). [9] Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and Gibbons, P. B. Pro- teus: agile ML elasticity through tiered reliability in dynamic resource mar- kets. In Proceedings of the Twelfth EuroSys (April 2017), pp. 589–604. [10] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In arXiv:1512.03385v1 (2015). [11] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). [12] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on NSDI (2011), pp. 295–308. [13] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine 29, 6 (Nov 2012), 82–97. [14] IBM. Fabric for deep learning (ffdl). https://github.com/IBM/FfDL. [15] Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J., Xiao, W., and Yang, F. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications. Microsoft Research Technical Report (MSR-TR-2018-13) (May 2018). [16] Krizhevsky, A. One weird trick for parallelizing convolutional neural net- works. CoRR abs/1404.5997 (2014). [17] Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th In- ternational Conference on Neural Information Processing Systems - Volume 1 (2012), pp. 1097–1105. [18] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Pro- cessing Systems 25: 26th Annual Conference on Neural Information Process- ing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. (2012), pp. 1106–1114. [19] Kubeflow. The machine learning toolkit for kubernetes. https://www.kubeflow.org/. [20] LeCun,Y.,Bengio,Y.,andHinton,G.Deeplearning.Nature521,7553(2015), 436–444. [21] Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on OSDI (2014), pp. 583–598. [22] Mayer, R., Mayer, C., and Laich, L. The tensorflow partitioning and schedul- ing problem: It’s the critical path! In Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning (2017), pp. 1–6. [23] Microsoft. Open platform for ai(openpai). https://github.com/Microsoft/pai. [24] Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean, J. Device placement optimization with reinforcement learning. CoRR abs/1706.04972 (2017). [25] Niu, F., Recht, B., Re, C., and Wright, S. J. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th Interna- tional Conference on Neural Information Processing Systems (2011), pp. 693– 701. [26] OpenStack. Open source software for creating private and public clouds. https://www.openstack.org/, 2017. [27] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch, 2017. [28] Peng,Y.,Bao,Y.,Chen,Y.,Wu,C.,andGuo,C.Optimus:anefficientdynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (April 2018), pp. 3:1–3:14. [29] RiseML. Machine learning platform for kubernetes. https://riseml.com/. [30] Sergeev,A.,andBalso,M.D.Horovod:fastandeasydistributeddeeplearning in tensorflow. CoRR abs/1802.05799 (2018). [31] Simonyan,K.,andZisserman,A.”verydeepconvolutionalnetworksforlarge- scale image recognition”. In International Conference on Learning Represen- tations (2015). [32] Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. [33] Szegedy,C.,Vanhoucke,V.,Ioffe,S.,Shlens,J.,andWojna,Z.Rethinkingthe inception architecture for computer vision. CoRR abs/1512.00567 (2015). [34] Xiao,W.,Bhardwaj,R.,Ramjee,R.,Sivathanu,M.,Kwatra,N.,Han,Z.,Patel, P., Peng, X., Zhao, H., Zhang, Q., Yang, F., and Zhou, L. Gandiva: Introspec- tive Cluster Scheduling for Deep Learning. In 13th USENIX OSDI (2018), pp. 595–610. [35] You, Y., Gitman, I., and Ginsburg, B. Scaling SGD batch size to 32k for ima- genet training. CoRR abs/1708.03888 (2017). [36] Yu, D., Eversole, A., Seltzer, M., Yao, K., Guenter, B., Kuchaiev, O., Seide, F., Wang, H., Droppo, J., Huang, Z., Zweig, G., Rossbach, C., and Currey, J. An introduction to computational networks and the computational network toolkit. Microsoft Technical Report (2014). [37] Zhang, W., Gupta, S., Lian, X., and Liu, J. Staleness-aware async-SGD for Distributed Deep Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2016), IJCAI’16, pp. 2350–2356. |