|
[1] NVIDIA, “Nvidias next generation cuda compute architecture: Fermi,” 2009. [2] J. Sanders and E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010. [3] J. Nickolls and W. Dally, “The gpu computing era,” Micro, IEEE, vol. 30, no. 2, pp. 56–69, 2010. [4] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, “Gpu computing,” Proceedings of the IEEE, vol. 96, no. 5, pp. 879–899, 2008. [5] J. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming standard for heterogeneous computing systems,” Computing in science & engineering, vol. 12, no. 3, p. 66, 2010. [6] J. Yin, P. Zhou, A. Holey, S. Sapatnekar, and A. Zhai, “Energy-efficient non-minimal path on-chip interconnection network for heterogeneous systems,” in Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design. ACM, 2012, pp. 57–62. [7] A. Bakhoda, J. Kim, and T. Aamodt, “Throughput-effective on-chip networks for manycore accelerators,” in Proceedings of the 2010 43rd Annual IEEE/ACM international symposium on Microarchitecture. IEEE Computer Society, 2010, pp. 421–432. 41 [8] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, “Memory access scheduling,” in Computer Architecture, 2000. Proceedings of the 27th International Symposium on. IEEE, 2000, pp. 128–138. [9] O. Mutlu and T. Moscibroda, “Stall-time fair memory access scheduling for chip multiprocessors,” in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007, pp. 146–160. [10] G. Yuan, A. Bakhoda, and T. Aamodt, “Complexity effective memory access scheduling for many-core accelerator architectures,” in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on. IEEE, 2009, pp. 34–44. [11] W. Dally, “Virtual-channel flow control,” Parallel and Distributed Systems, IEEE Transactions on, vol. 3, no. 2, pp. 194–205, 1992. [12] W. Dally and B. Towles, Principles and practices of interconnection networks. Morgan Kaufmann, 2004. [13] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009, pp. 163–174. [14] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 2009, pp. 44–54. [15] “Parboil benchmark suite.” http://impact.crhc.illinois.edu/parboil.php. [16] Pcchen, “N-queens solver,” http://forums.nvidia.com/index.php?showtopic=76893. [17] “Nvidia gpu computing sdk suite.” https://developer.nvidia.com/gpu-computing-sdk. 42 [18] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems,” in ACM SIGARCH Computer Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 63–74. [19] K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith, “Fair queuing memory systems,” in Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on. IEEE, 2006, pp. 208–222. [20] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, “Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems,” in Proceedings of the 39th International Symposium on Computer Architecture. IEEE Press, 2012, pp. 416–427. [21] Y. Kim, H. Lee, and J. Kim, “An alternative memory access scheduling in manycore accelerators,” in Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 2011, pp. 195–196. [22] R. Das, O. Mutlu, T. Moscibroda, and C. Das, “A´ergia: A network-on-chip exploiting packet latency slack,” Micro, IEEE, vol. 31, no. 1, pp. 29–41, 2011. |