帳號:guest(3.138.36.71)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):許家禎
作者(外文):Hsu, Chia-Chen
論文名稱(中文):異質多核心系統架構之記憶體存取優化技術
論文名稱(外文):Optimized Memory Access Support for Data Layout Conversion on Heterogeneous Multi-core Systems
指導教授(中文):李政崑
口試委員(中文):劉志尉
陳呈瑋
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:101062594
出版年(民國):103
畢業學年度:102
語文別:英文中文
論文頁數:44
中文關鍵詞:異質多核心通用圖行處理器效能資料架構轉換稀疏矩陣
相關次數:
  • 推薦推薦:0
  • 點閱點閱:329
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
異質多核心系統在因應現代科技之龐大複雜的運算需求下成為一重點發展項目。其中通用圖形處理器(GPGPU)與中央處理器(CPU)是當中熱門的整合方向。然而,在異質多核系統中,由於資料需在不同處理器之間大量傳遞與搬移,以及不同的處理器間硬體架構與特性不同,導致在不同的處理架構中對於相同資料排列存取的memory locality將存在差異性,而此差異性會導致整體系統的效能降低。因此需要重新對資料排列進行重組以適應目標硬體架構,如適用於CPU的陣列結構(AOS),與適用於GPGPU的結構陣列(SOA)之轉換,以及
稀疏矩陣中的Cordinate(COO)格式與適用於平行架構的ELLPACK(ELL)格式間的轉換。
目前雖有些研究以軟體的方式提供資料排列的轉換,其效能仍有改善的空間。

在本篇論文中,致力於研發記憶體優化存取技術以解決上述效能降低的問題,以一套硬體設計與搬移演算機制控制資料的存放,使資料在不同處理器之間搬移的同時,針對不同的處理器之特性做資料排列的轉換,進而大幅提升應用的效能。我們提出了一套乒乓轉換架構解決資料區域性的問題,以及一套控至稀疏矩陣資料排列的設計。總體來說,相較於其他研究可達到68.5-2.19倍的提升。
Abstract i
Contents iii
List of Figures v
List of Tables vii
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Architecture of Memory Manager 7
2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Transpose Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Sparse Converter . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Software Interface . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Advanced Design Issues 18
3.1 The Design of Transpose Unit . . . . . . . . . . . . . . . . . . 18
3.2 Synchronous-Pipeline by PPTU . . . . . . . . . . . . . . . . . 19
3.3 Out-of-Order Data Flow . . . . . . . . . . . . . . . . . . . . . 19
3.4 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Details of Sparse Converter . . . . . . . . . . . . . . . . . . . 23
4 CUDA Kernel Adaptation 27
5 Experiment Results 29
5.1 Coalesced Transpose evaluation . . . . . . . . . . . . . . . . . 29
5.2 Application-level Analysis for Coalescing Converter . . . . . . 33
5.3 Analysis of Sparse Converter . . . . . . . . . . . . . . . . . . . 35
5.4 Potential Application Discussion . . . . . . . . . . . . . . . . . 36
6 Conclusion 39
[1] M. Hegde, \Heterogeneous systems architectures and its implications for the software ecosystem," in Proc. of the 13th international Forum on Embedded MPSoC and Multicore, 2013.
[2] Z. Wang et al., \Using machine learning to partition streaming programs," ACM Trans. Archit. Code Optim., vol. 10, no. 3, pp. 20:1{20:25, Sep. 2008.
[3] Y. Kim et al., \Cumapz: a tool to analyze memory access patterns in cuda," in Proc. of the 48th Design Automation Conf., 2011, pp. 128{133.
[4] S. Hong et al., \An analytical model for a gpu architecture with memory-
level and thread-level parallelism awareness," in Proc. of the 36th annual
international symposium on Computer architecture, 2009, pp. 152{163.
[5] B. Jang et al., \Exploiting memory access patterns to improve mem-
ory performance in data-parallel architectures," IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 105{118, Jan. 2011.
[6] I.-J. Sung et al., \Dl: A data layout transformation system for heterogeneous computing," in Proc. IEEE Conf. Innovative Parallel Computing(InPar 12). IEEE, 2012.
[7] M. Maggioni et al., \Adell: An adaptive warp-balancing ell format for efficient sparse matrix-vector multiplication on gpus," in Proceedings of the 2013 42Nd International Conference on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 2013, pp. 11{20.
[8] Monakov et al., \Automatically tuning sparse matrix-vector multiplication for gpu architectures," in Proc. of the 5th international conference on High Performance Embedded Architectures and Compilers, 2010, pp.111
[9] N. Bell and M. Garland, \Implementing sparse matrix-vector multiplication on throughput-oriented processors," in High Performance Computing Networking, Storage and Analysis, 2009.
[10] Baskaran et al., \A compiler framework for optimization of ane loop nests for gpgpus," in Proc. of the 22nd Int'l conf. on Supercomputing(ICS), 2008, pp. 225{234.
[11] J. W. Choi et al., \Model-driven autotuning of sparse matrix-vector multiply on gpus," in Proc. of the 15th ACM SIGPLAN Symp. on Principles
and Practice of Parallel Programming(PPoPP), 2010, pp. 115{126.
[12] S. Che et al., \Dymaxion: optimizing memory access patterns for heterogeneous systems," in Proc. of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, pp.
13:1{13:11.
[13] S. Saidi et al., \Optimizing explicit data transfers for data parallel applications on the cell architecture," ACM Trans. Archit. Code Optim.,vol. 8, no. 4, pp. 37:1{37:20, Jan. 2012.
[14] M. K. Jeong et al., \A qos-aware memory controller for dynamically balancing gpu and cpu bandwidth use in an mpsoc," in Proc. of the 49th Annual Design Automation Conf., 2012, pp. 850
[15] \Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2012.
[16] S. S. Baghsorkhi et al., \Ecient performance evaluation of memory hierarchy for highly multithreaded graphics processors," SIGPLAN Not.,
vol. 47, no. 8, pp. 23{34, Feb. 2012.
[17] J. M. Anderson et al., \Data and computation transformations for multiprocessors," SIGPLAN Not., vol. 30, no. 8, pp. 166{178, Aug. 1995.
[18] M. Bauer et al., \Cudadma: optimizing gpu memory bandwidth via warp specialization," in Proc. of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, New York, USA, 2011, pp. 12:1
[19] V. Sathish et al., \Lossless and lossy memory i/o link compression for improving performance of gpgpu workloads," in Proc. of the 21st international conference on Parallel architectures and compilation techniques,
2012, pp. 325{334.
[20] W. Jia et al., \Characterizing and improving the use of demand-fetched caches in gpus," in Proc. of the 26th ACM international conference on Supercomputing, 2012, pp. 15{24.
[21] I.-J. Sung et al., \Data layout transformation exploiting memory-levelparallelism in structured grid many-core applications," International Journal of Parallel Programming, pp. 4{24, 2012.
[22] Y. Yang, Xiang et al., \A gpgpu compiler for memory optimization and parallelism management," SIGPLAN Not., vol. 45, no. 6, pp. 86,Jun. 2010.
[23] B.Wu et al., \Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu," SIGPLAN Not., vol. 48, no. 8, pp. 57{68, Feb. 2013.
[24] W. Cesario, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu,Y. Paviot, S. Yoo, A. A. Jerraya, and M. Diaz-Nava, \Component-based design approach for multicore socs," in Proceedings of the 39th AnnualDesign Automation Conference. New York, NY, USA: ACM, 2002, pp.789
[25] F. R. Wagner, W. Cesario, and A. A. Jerraya, \Hardware/software ip integration using the roses design environment," ACM Trans. Embed.Comput. Syst., vol. 6, no. 3, Jul. 2007.
[26] \Intel math kernel library," 2011.
[27] S. Che et al., \Rodinia: A benchmark suite for heterogeneous computing," in IISWC'09, 2009, pp. 44
[28] P. project, \Matrix market," in available on line at:http://math.nist.gov/MatrixMarket/.
[29] P. Viola and M. J. Jones, \Robust real-time face detection," Int. J.Comput. Vision, vol. 57, no. 2, pp. 137{154, May 2004.
[30] \The hsa foundationcompubench." [Online]. Available:https://compubench.com/result.jsp
(此全文未開放授權)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *