|
[1] Kubernetes Device plugin. https://github.com/kubernetes/ design-proposals-archive/blob/main/resource-management/ device-plugin.md. [2] NVIDIA container runtime. https://github.com/NVIDIA/ nvidia-container-runtime. [3] NVIDIA Device plugin. https://github.com/NVIDIA/k8s-device-plugin. [4] registry of kube-scheduler. https://github.com/kubernetes/kubernetes/ blob/v1.18.10/pkg/scheduler/algorithmprovider/registry.go. [5] Chen, H.-H., Lin, E.-T., Chou, Y.-M., and Chou, J. Gemini: Enabling multitenant gpu sharing based on kernel burst estimation. IEEE Transactions on Cloud Computing (2021). [6] Gu, J., Song, S., Li, Y., and Luo, H. Gaiagpu: sharing gpus in container clouds. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) (2018), IEEE, pp. 469– 476. [7] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770–778. [8] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780. [9] Le, Y., and Yang, X. Tiny imagenet visual recognition challenge. CS 231N 7, 7 (2015), 3. [10] Mahajan, K., Balasubramanian, A., Singhvi, A., Venkataraman, S., Akella, A., Phanishayee, A., and Chawla, S. Themis: Fair and efficient {GPU} cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (2020), pp. 289–304. [11] Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision (July 2001), vol. 2, pp. 416–423. [12] Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016). [13] Merkel, D., et al. Docker: lightweight linux containers for consistent development and deployment. Linux j 239, 2 (2014), 2. [14] NVIDIA. NVIDIA GPU Specification. https://www.nvidia.com/zh-tw/ geforce/graphics-cards/compare. [15] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019). [16] Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (2018), pp. 1–14. [17] Pytorch. TorchElastic. https://github.com/pytorch/elastic. [18] Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., Rueckert, D., and Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 1874–1883. [19] Song, S., Deng, L., Gong, J., and Luo, H. Gaia scheduler: A kubernetesbased scheduler framework. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) (2018), IEEE, pp. 252–259. [20] Thinakaran, P., Gunasekaran, J. R., Sharma, B., Kandemir, M. T., and Das, C. R. Kube-knots: Resource harvesting through dynamic container orchestration in gpu-based datacenters. In 2019 IEEE International Conference on Cluster Computing (CLUSTER) (2019), IEEE, pp. 1–13. [21] VMware. The State of Kubernetes 2021. https://tanzu.vmware.com/ content/ebooks/the-state-of-kubernetes-2021. [22] Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra, N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q., et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (2018), pp. 595–610. [23] Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z., Feng, Y., Lin, W., and Jia, Y. {AntMan}: Dynamic scaling on {GPU} clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (2020), pp. 533–548. [24] Yeh, T.-A., Chen, H.-H., and Chou, J. Kubeshare: A framework to manage gpus as first-class and shared resources in container cloud. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (2020), pp. 173–184. [25] Zhao, H., Han, Z., Yang, Z., Zhang, Q., Yang, F., Zhou, L., Yang, M., Lau, F. C., Wang, Y., Xiong, Y., et al. {HiveD}: Sharing a {GPU} cluster for deep learning with guarantees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (2020), pp. 515–532. [26] Zhu, X., Gong, L., Zhu, Z., and Zhou, X. Vapor: A gpu sharing scheduler with communication and computation pipeline for distributed deep learning. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) (2021), IEEE, pp. 108–116. |