多GPU架構之Strassen矩陣乘法__國立清華大學博碩士論文全文影像系統

帳號：guest(3.145.163.13) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	高魁良
作者(外文):	Kao, Quey-Liang
論文名稱(中文):	多GPU架構之Strassen矩陣乘法
論文名稱(外文):	Strassen’s Matrix Multiplication on Multi-GPU
指導教授(中文):	李哲榮
指導教授(外文):	Lee, Che-Rung
口試委員(中文):	周志遠洪哲倫
口試委員(外文):	Chou, Jerry Hung, Che-Lun
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	100062539
出版年(民國):	101
畢業學年度:	100
語文別:	中文
論文頁數:	54
中文關鍵詞:	矩陣乘法、快速矩陣乘法、通用圖形處理器
外文關鍵詞:	matrix multiplication、Strassen、multi-GPU、GPGPU、GPU
相關次數:	推薦:0 點閱:359 評分: 下載:23 收藏:0

裝配有大量平行處理單元的通用圖形處理器已經成為高效能計算領域中不可或缺的設備。為了能更有效率地處理更大尺度的問題，多GPU架構也開始被使用在高效能的機器上。因此，設計良好的演算法以符合多GPU平台在現今已是重要研究課題。
在本論文中，我們在多GPU架構實作了高效能的Strassen乘法演算法。Strassen矩陣乘法演算法具有遞迴性質，在每個步驟中，輸入矩陣將被分解為2*2的子矩陣，遞迴進行7次的子矩陣乘法以及18個子矩陣加法，再組合出最後的結果。比起傳統的O(n^3)矩陣乘法，Strassen演算法能夠達到低於立方次的時間複雜度。
我們提出3種多GPU平台上的快速矩陣乘法實作：
傳統方法：使用傳統的block形式矩陣乘法將工作分配給多個GPU，然後再於GPU上運行cublas4.2的O(n^3)矩陣乘法sgemm。
Hybrid方法：使用傳統的block形式矩陣乘法將工作分配給多個GPU，然後再於GPU上運行快速矩陣乘法。
Strassen方法：使用Strassen的子矩陣分割方式遞迴展開2次，然後再於GPU上運行快速矩陣乘法。
我們在裝配有四張Tesla C2070運算卡的平台上測試，相較於O(n^3)的平行sgemm，我們的Strassenz方法獲得了28.5%的效能進展；以單一GPU的sgemm作為比較對象，則可以達到4.37的speedup。

General Purpose Graphics Processing Unit (GPGPU) that equips massively parallel processing units has become an indispensible building block in High Performance Computing (HPC). To achieve better scalability, the multi-GPU architecture is used in modern HPC machines. Therefore, designing efficient algorithms for multi-GPU platforms has been an important research topic already.
In this thesis, we presented the high performance implementations of Stassen’s matrix multiplication algorithm on multi-GPU. Strassen’s algorithm is recursive defined. In each level, the matrices are evenly partitioned into 2x2 submatrices, and Strassen’s algorithm uses 7 matrix multiplications and 18 matrix additions on those submatrices to assemble the final result. Comparing to the general O(n^3) matrix multiplication algorithms, Strassen’s algorithm can achieve subcubic time complexity.
Three implementations of matrix multiplication for multi-GPU are presented.
Traditional method: which uses the block matrix multiplication algorithm to partition the multiplications of submatrices to multiple GPUs, and uses sgemm on each single GPU.
Hybrid method: which uses the block matrix multiplication algorithm to partition the multiplications of submatrices to multiple GPUs, and uses fast matrix multiplication on each single GPU.
Strassen method: which applies 2 times Strassen’s partition(which gives 49 sub-multiplications and 126 sub-additions) to distribute the workload for multiple GPU, and uses fast matrix multiplication on each single GPU.
Those implementations were experimented on the platform equipped with four Tesla C2070 GPUs, and compared with the sgemm kernel for single GPU in cublas 4.2. About 4.37 times speedup can be obtained by using the Strassen implementation。

1. Introduction
1.1. GPGPU
1.2. Matrix Multiplication
1.3. Algorithms and Implementations
1.3.1. Single GPU
1.3.2. Multi-GPU
1.4. Contributions
1.5. Outline
2. Background
2.1. Terminology
2.2. GPU Architecture
2.3. CUDA Programming Model
2.4. Matrix Multiplication
2.5. Strassen’s Matrix Multiplication
3. Single GPU
3.1. Algorithm
3.2. Choosing the recursive level
3.3. Result
4. Multi-GPU Implementations
4.1. Multi-GPU: traditional method & hybrid method
4.2. Multi-GPU: Strassen method
4.2.1. Level-1 Strassen
4.2.2. Level-2 Strassen
5. Experiments & Discussions
5.1. Performance Model
5.2. Experiments
6. Conclusion
6.1. Summary
6.2. Future work
References
A. Appendix
A.1 Implementation Detail of Strassen’s and Winograd’s Algorithm
A.2 Brief Introduction to Hypergragh
A.3 Work assignment of level-2 Strassen
A.4 Proof to 4.2.2 - 54 -

[1] Volker Strassen. Gaussian Elimination is not Optimal. NUMERISCHE MATHEMATIK Volume 13, Number 4, 354-356.
[2] Jacques Cohen and Martin Roth. On the implementation of Strassen's fast multiplication algorithm. ACTA INFORMATICA Volume 6, Number 4, 341-355.
[3] David H. Bailey. Extra-High Speed Matrix Multiplication on the Cray-2. SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3 , pg. 603 607.
[4] Craig C. Douglas , Michael Heroux , Gordon Slishman , Roger M. Smith , Roger M. A. GEMMW: Portable Level 3 Blas Winograd Variant Of Strassen's Matrix-Matrix Multiply Algorithm. Yale University, Department of Computer Scuence.
[5] Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson, Anna Tsao, Thomas Turnbull. Implementation of Strassen's Algorithm for Matrix Multiplication.
[6] Thomas H. Cormen,Charles E. Leiserson,Ronald L. Rivest,Clifford Stein. Introduction to Algorithms 3rd Edition. P. 795-796.
[7] Thomas H. Cormen,Charles E. Leiserson,Ronald L. Rivest,Clifford Stein. Introduction to Algorithms 3rd Edition. P. 111-112.
[8] Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, Wei Zhou. Memory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm. International Symposium on Symbolic and Algebraic Computation 2009.
[9] Jesse D. Hall, Nathan A. Carr, John C. Hart. Cache and Bandwidth Aware Matrix Multiplication on the GPU
[10] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efciency of GPU Algorithms for Matrix-Matrix Multiplication. Graphics Hardware 2004.
[11] N. P. Karunadasa & D. N. Ranasinghe. On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters. HiPC 2009.
[12] Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun. Fast Implementation of DGEMM on Fermi GPU. SC11.
[13] Cesur Baransel, Kayhan M. Imre. A parallel implementation of Strassen’s matrix multiplication algorithm for wormhole-routed all-port 2D torus networks. The Journal of Supercomputing. DOI 10.1007/s11227-011-0730-1.
[14] Junjie Li, Sanjay Ranka, Sartaj Sahni. Strassen’s Matrix Multiplication on GPUs. Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference.
[15] Fengguang Song, Jack Dongarra, and Shirley Moore. EXPERIMENTS WITH STRASSEN’S ALGORITHM FROM SEQUENTIAL TO PARALLEL. Proceedings of Parallel and Distributed Computing and Systems (PDCS). ACTA, Nov. 2006.
[16] MAGMA, http://icl.cs.utk.edu/magma/
[17] cublas, http://developer.NVIDIA.com/cublas
[18] CULA, http://www.culatools.com/
[19] Jakub Kurzak, Piotr Luszczek, Mathieu Faverge, Jack Dongarra, Lapack Working Notes 266, http://www.netlib.org/lapack/lawnspdf/lawn266.pdf
[20] CUDA, http://www.NVIDIA.com.tw/object/cuda_home_new_tw.html
[21] CUDA Best Practice Guide P38, 39 http://developer.download.NVIDIA.com/compute/DevZone/docs/html/C/doc/CUDA_C_Best_Practices_Guide.pdf
[22] Vasily Volkov, James W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra
[23] Rajib Nath , Stanimire Tomov , and Jack Dongarra. An Improved MAGMA GEMM for Fermi GPUs.
[24] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms 2nd , ch. 23 Fast Matrix Multiplication.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文