|
[1] B.-W. Cheng, E.-M. Haung, C.-H. Chao, W.-F. Sun, T.-T. Yeh, and C.-Y.Lee, “Colab: Collaborative and efficient processing ofreplicated cache requestsin gpu,” inProceedings of 28th Asia and South Pacific Design AutomationConference, ASP-DAC ’23, forthcoming. [2] NVIDIA, “Nvidia ampere ga102 gpu architecture,” Sept. 2020. [3] J. Wang, L. Jiang, J. Ke, X. Liang, and N. Jing, “A sharing-aware l1.5d cache for data reuse in gpgpus,” in Proceedings of the 24th Asia and South Pacific Design Automation Conference, ASPDAC ’19, (New York, NY, USA), p. 388–393, Association for Computing Machinery, 2019. [4] M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging decoupled l1 caches in gpus,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 467– 478, Feb 2021. [5] M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging shared l1 caches in gpus,” in Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, PACT ’20, (New York, NY, USA), p. 161–173, Association for Computing Machinery, 2020. [6] K. Choo, W. Panlener, and B. Jang, “Understanding and optimizing gpu cache memory performance for compute workloads,” in 2014 IEEE 13th In- ternational Symposium on Parallel and Distributed Computing, pp. 189–196, 2014. [7] S. Dublish, V. Nagarajan, and N. Topham, “Cooperative caching for gpus,” ACM Trans. Archit. Code Optim., vol. 13, dec 2016. [8] M. A. Ibrahim, H. Liu, O. Kayiran, and A. Jog, “Analyzing and leveraging remote-core bandwidth for enhanced performance in gpus,” in 2019 28th In- ternational Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 258–271, 2019. [9] B.-W. Cheng, E.-M. Haung, C.-H. Chao, W.-F. Sun, T.-T. Yeh, and C.-Y. Lee, “Remote access tag array for efficient gpu intra-cluster data sharing,” in Proceedings of the 24th Workshop on Synthesis And System Integration of Mixed Information technologies, SASIMI ’22, p. 221–222, 2022. [10] D. Tarjan and K. Skadron, “The sharing tracker: Using ideas from cache co- herence hardware to reduce off-chip memory traffic with non-coherent caches,” in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, (USA), p. 1–10, IEEE Computer Society, 2010. [11] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An exten- sible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 473–486, 2020. [12] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0,” in 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 3–14, 2007. [13] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “Gpuwattch: Enabling energy optimizations in gpgpus,” in Proceedings of the 40th Annual International Symposium on Computer Ar- chitecture, ISCA ’13, (New York, NY, USA), p. 487–498, Association for Computing Machinery, 2013. [14] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54, 2009. [15] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and H. Jeon, “Detailed characterization of deep neural networks on gpus and fpgas,” in Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, GPGPU ’19, (New York, NY, USA), p. 12–21, Association for Computing Machinery, 2019. [16] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to gpu codes,” in 2012 Inno- vative Parallel Computing (InPar), pp. 1–10, 2012. [17] X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu, “Adaptive cache management for energy-efficient gpu computing,” in Proceed- ings of the 47th Annual IEEE/ACM International Symposium on Microarchi- tecture, MICRO-47, (USA), p. 343–355, IEEE Computer Society, 2014. [18] G. Koo, Y. Oh, W. W. Ro, and M. Annavaram, “Access pattern-aware cache management for improving data utilization in gpu,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, (New York, NY, USA), p. 307–319, Association for Computing Machinery, 2017. [19] Y. Oh, G. Koo, M. Annavaram, and W. W. Ro, “Linebacker: Preserving victim cache lines in idle register files of gpus,” in Proceedings of the 46th International Symposium on Computer Architecture, ISCA ’19, (New York, NY, USA), p. 183–196, Association for Computing Machinery, 2019. [20] C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, “Locality- driven dynamic gpu cache bypassing,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, (New York, NY, USA), p. 67–77, Association for Computing Machinery, 2015. [21] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-conscious wavefront scheduling,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 72–83, 2012. [22] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kan- demir, O. Mutlu, R. Iyer, and C. R. Das, “Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance,” in Proceed- ings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, (New York, NY, USA), p. 395–406, Association for Computing Machinery, 2013. [23] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Divergence-aware warp scheduling,” in 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 99–110, 2013. [24] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, “Neither more nor less: Optimizing thread-level parallelism for gpgpus,” in Proceedings of the 22nd In- ternational Conference on Parallel Architectures and Compilation Techniques, pp. 157–166, 2013.
|