|
[1] M. Hegde, \Heterogeneous systems architectures and its implications for the software ecosystem," in Proc. of the 13th international Forum on Embedded MPSoC and Multicore, 2013. [2] Z. Wang et al., \Using machine learning to partition streaming programs," ACM Trans. Archit. Code Optim., vol. 10, no. 3, pp. 20:1{20:25, Sep. 2008. [3] Y. Kim et al., \Cumapz: a tool to analyze memory access patterns in cuda," in Proc. of the 48th Design Automation Conf., 2011, pp. 128{133. [4] S. Hong et al., \An analytical model for a gpu architecture with memory- level and thread-level parallelism awareness," in Proc. of the 36th annual international symposium on Computer architecture, 2009, pp. 152{163. [5] B. Jang et al., \Exploiting memory access patterns to improve mem- ory performance in data-parallel architectures," IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 105{118, Jan. 2011. [6] I.-J. Sung et al., \Dl: A data layout transformation system for heterogeneous computing," in Proc. IEEE Conf. Innovative Parallel Computing(InPar 12). IEEE, 2012. [7] M. Maggioni et al., \Adell: An adaptive warp-balancing ell format for efficient sparse matrix-vector multiplication on gpus," in Proceedings of the 2013 42Nd International Conference on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 2013, pp. 11{20. [8] Monakov et al., \Automatically tuning sparse matrix-vector multiplication for gpu architectures," in Proc. of the 5th international conference on High Performance Embedded Architectures and Compilers, 2010, pp.111 [9] N. Bell and M. Garland, \Implementing sparse matrix-vector multiplication on throughput-oriented processors," in High Performance Computing Networking, Storage and Analysis, 2009. [10] Baskaran et al., \A compiler framework for optimization of ane loop nests for gpgpus," in Proc. of the 22nd Int'l conf. on Supercomputing(ICS), 2008, pp. 225{234. [11] J. W. Choi et al., \Model-driven autotuning of sparse matrix-vector multiply on gpus," in Proc. of the 15th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming(PPoPP), 2010, pp. 115{126. [12] S. Che et al., \Dymaxion: optimizing memory access patterns for heterogeneous systems," in Proc. of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, pp. 13:1{13:11. [13] S. Saidi et al., \Optimizing explicit data transfers for data parallel applications on the cell architecture," ACM Trans. Archit. Code Optim.,vol. 8, no. 4, pp. 37:1{37:20, Jan. 2012. [14] M. K. Jeong et al., \A qos-aware memory controller for dynamically balancing gpu and cpu bandwidth use in an mpsoc," in Proc. of the 49th Annual Design Automation Conf., 2012, pp. 850 [15] \Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2012. [16] S. S. Baghsorkhi et al., \Ecient performance evaluation of memory hierarchy for highly multithreaded graphics processors," SIGPLAN Not., vol. 47, no. 8, pp. 23{34, Feb. 2012. [17] J. M. Anderson et al., \Data and computation transformations for multiprocessors," SIGPLAN Not., vol. 30, no. 8, pp. 166{178, Aug. 1995. [18] M. Bauer et al., \Cudadma: optimizing gpu memory bandwidth via warp specialization," in Proc. of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, New York, USA, 2011, pp. 12:1 [19] V. Sathish et al., \Lossless and lossy memory i/o link compression for improving performance of gpgpu workloads," in Proc. of the 21st international conference on Parallel architectures and compilation techniques, 2012, pp. 325{334. [20] W. Jia et al., \Characterizing and improving the use of demand-fetched caches in gpus," in Proc. of the 26th ACM international conference on Supercomputing, 2012, pp. 15{24. [21] I.-J. Sung et al., \Data layout transformation exploiting memory-levelparallelism in structured grid many-core applications," International Journal of Parallel Programming, pp. 4{24, 2012. [22] Y. Yang, Xiang et al., \A gpgpu compiler for memory optimization and parallelism management," SIGPLAN Not., vol. 45, no. 6, pp. 86,Jun. 2010. [23] B.Wu et al., \Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu," SIGPLAN Not., vol. 48, no. 8, pp. 57{68, Feb. 2013. [24] W. Cesario, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu,Y. Paviot, S. Yoo, A. A. Jerraya, and M. Diaz-Nava, \Component-based design approach for multicore socs," in Proceedings of the 39th AnnualDesign Automation Conference. New York, NY, USA: ACM, 2002, pp.789 [25] F. R. Wagner, W. Cesario, and A. A. Jerraya, \Hardware/software ip integration using the roses design environment," ACM Trans. Embed.Comput. Syst., vol. 6, no. 3, Jul. 2007. [26] \Intel math kernel library," 2011. [27] S. Che et al., \Rodinia: A benchmark suite for heterogeneous computing," in IISWC'09, 2009, pp. 44 [28] P. project, \Matrix market," in available on line at:http://math.nist.gov/MatrixMarket/. [29] P. Viola and M. J. Jones, \Robust real-time face detection," Int. J.Comput. Vision, vol. 57, no. 2, pp. 137{154, May 2004. [30] \The hsa foundationcompubench." [Online]. Available:https://compubench.com/result.jsp |