|
[1] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. [2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105. [3] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9). [4] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). [5] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. [6] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. [7] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533. [8] Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR. [9] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014, January). Deterministic policy gradient algorithms. In International conference on machine learning (pp. 387-395). PMLR. [10] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., ... & Yoon, D. H. (2017, June). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture (pp. 1-12). [11] Liao, H., Tu, J., Xia, J., & Zhou, X. (2019, August). Davinci: A scalable architecture for neural network computing. In 2019 IEEE Hot Chips 31 Symposium (HCS) (pp. 1-44). IEEE Computer Society. [12] Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., ... & Temam, O. (2014, December). Dadiannao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609-622). IEEE. [13] Chen, Y. H., Krishna, T., Emer, J. S., & Sze, V. (2016). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1), 127-138. [14] D. Schor, (2020). Arm Ethos is for Ubiquitous AI At the Edge — WikiChip Fuse. Retrieved from ttps://fuse.wikichip.org/news/ 3282/arm-ethos-is-for-ubiquitous-ai-at-the-edge/ [15] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026-8037. [16] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (pp. 265-283). [17] Chollet, F., & others. (2015). Keras. GitHub. Retrieved from https://github.com/fchollet/keras [18] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., ... & Darrell, T. (2014, November). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 675-678). [19] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., ... & Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. [20] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., ... & Krishnamurthy, A. (2018). {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (pp. 578-594). [21] Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697-8710). [22] Tang, L., Wang, Y., Willke, T. L., & Li, K. (2018). Scheduling computation graphs of deep learning models on manycore cpus. arXiv preprint arXiv:1807.09667. [23] Hu, T. C. (1961). Parallel sequencing and assembly line problems. Operations research, 9(6), 841-848. [24] Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., & Han, S. (2021). IOS: Inter-Operator Scheduler for CNN Acceleration. Proceedings of Machine Learning and Systems, 3. [25] Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., ... & Zhou, L. (2020). Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 881-897). [26] Chakaravarthy, R. V., & Jiang, H. (2020, October). Special Session: XTA: Open Source eXtensible, Scalable and Adaptable Tensor Architecture for AI Acceleration. In 2020 IEEE 38th International Conference on Computer Design (ICCD) (pp. 53-56). IEEE. [27] Wu, H. I., Guo, D. Y., Chin, H. H., & Tsay, R. S. (2020). A Pipeline-Based Scheduler for Optimizing Latency of Convolution Neural Network Inference over Heterogeneous Multicore Systems. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 46-49). IEEE. [28] Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, S., Dzhabarov, R., ... & Wang, M. (2018). Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907. [29] Cyphers, S., Bansal, A. K., Bhiwandiwalla, A., Bobba, J., Brookhart, M., Chakraborty, A., ... & Webb, T. J. (2018). Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv preprint arXiv:1801.08058. [30] Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, W. S., ... & Cohen, A. (2018). Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730. [31] Chris Leary and Todd Wang. (2017). XLA: TensorFlow, compiled. [32] Li, M., Liu, Y., Liu, X., Sun, Q., You, X., Yang, H., ... & Qian, D. (2020). The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32(3), 708-727. [33] An Overview of TVM and Model Optimization. https://tvm.apache.org/docs/tutorials/get_started/introduction.html#an-overview-of-tvm-and-model-optimization [34] Bai, Junjie and Lu, Fang and Zhang, Ke and others. (2019). ONNX: Open Neural Network Exchange. GitHub. Retrieved from https://github.com/onnx/onnx [35] Chen, T., Zheng, L., Yan, E., Jiang, Z., Moreau, T., Ceze, L., ... & Krishnamurthy, A. (2018). Learning to optimize tensor programs. arXiv preprint arXiv:1805.08166. [36] Zheng, L., Jia, C., Sun, M., Wu, Z., Yu, C. H., Haj-Ali, A., ... & Stoica, I. (2020). Ansor: Generating high-performance tensor programs for deep learning. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 863-879). [37] AMD Ryzen 5 2600 GFLOPS performance. https://gadgetversus.com/processor/amd-ryzen-5-2600-gflops-performance/ [38] NVIDIA GeForce GTX 1080 FLOPS performance. https://www.techpowerup.com/gpu-specs/geforce-gtx-1080.c2839 [39] ARM-A53 and ARM-A72 FLOPS performance. http://web.eece.maine.edu/~vweaver/group/green_machines.html [40] ARM Mali-T860 MP4 FLOPS performance. https://wikimovel.com/index.php/Rockchip |