GPU實作增廣塊狀西門諾分散式演算法解三對角矩陣__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.198) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳又權
作者(外文):	Chen, Yu Chuan
論文名稱(中文):	GPU實作增廣塊狀西門諾分散式演算法解三對角矩陣
論文名稱(外文):	Augmented Block Cimmino Distributed Algorithm for solving a tridiagonal Matrix on GPU
指導教授(中文):	李哲榮
指導教授(外文):	Lee, Che Rung
口試委員(中文):	王偉仲林俊淵
口試委員(外文):	Wang, Wei chung Lin, Chun-Yuan
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	102062533
出版年(民國):	104
畢業學年度:	103
語文別:	英文
論文頁數:	60
中文關鍵詞:	ABCD、GPU、平行、三對角矩陣、演算法、solver
外文關鍵詞:	ABCD、GPU、parallel、tridiagonal matrix、algorithm、solver
相關次數:	推薦:0 點閱:119 評分: 下載:2 收藏:0

解三對角矩陣在現在自然科學及工程學裡的問題，例如微分方程、偏微分方程、或是流體動力學等問題，已成為不可或缺的一環。因為三對角矩陣本身的特殊稀疏架構，有很多針對此架構的演算法被提出，在這之前，主流是運用對角旋轉(Diagonal Pivoting)的方式解決準確性的問題。然而對角旋轉有它的限制，在某些矩陣的測試下會得到錯誤的答案。此時，找尋不同於主流或改良主流演算法的必要性與日俱增，而增廣塊狀西門諾分散式演算法(Augmented Block Cimmino Distributed Algorithm)提供了一個不同於主流的演算法。
在這篇論文中，我們研究並將增廣塊狀西門諾分散式演算法實作在GPU上。其中針對GPU的計算架構，我們應用了CUDA程式設計實作時需遵循的基本原則，例如透過轉換資料儲存格式達成GPU連續記憶體存取來增進效能，並提出了邊界填補修正方法，此方法能夠弭平大幅降低GPU效能的計算分歧，或是降低計算分歧的影響，進而使得GPU的效能獲得顯著提升。在實驗部分，我們比較了ABCD在Matlab上的準確性、ABCD在GPU上的準確性以及主流演算法的準確性，並比較了不同演算法的計算時間。

The tridiagonal solver nowadays appears as a fundamental component in scientific and engi-neering problems, such as Alternating Direction Implicit methods (ADI), fluid Simulation, and Poisson’s equation. Due to the particular sparse format of tridiagonal matrix, many algorithms of solving the system are conceived. Previously, the main stream of solving the system is by using Diagonal Pivoting to reduce the accuracy issue. But, Diagonal Pivoting has its limits and will lead to error solution while the condition number increases. Augmented Block Cimmino Distributed (ABCD) algorithm serves as another option when trying to resolve the problem accurately.
In this thesis, we study and implement the ABCD algorithm on GPU. Because of the spe-cial structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPU for better performance. In addition, our implementation incorpo-rates various performance optimization techniques, such as memory coalesce, to further enhance the performance. In the experiments, we evaluate the accuracy and performance of our GPU im-plementation against CPU implementation, and analyze the effectiveness of each performance op-timization technique. The performance of GPU version is about 15 times faster than that of the CPU version.

摘要 i
Abstract ii
Table of Contents iii
1. Introduction 1
2. Preliminaries 2
2.1. Notation 2
2.1.1. Letters presented in uppercase style: A 2
2.1.2. Letters presented in lowercase style: X 2
2.1.3. Letters presented in uppercase style: F 2
2.1.4. Letters presented in uppercase style: V and W 2
2.1.5. Kernel functions 2
2.2. Tridiagonal Solvers 2
2.2.1. Cyclic Reduction 3
2.2.2. Parallel Cyclic Reduction 3
2.2.3. Recursive doubling 4
2.2.4. Diagonal Pivoting 4
2.3. SPIKE algorithm 4
3. Algorithm 8
3.1. ABCD algorithm 8
3.1.1. Cimmino Method 8
3.1.2. Augmented Block Cimmino Distributed algorithm 12
3.2. ABCD algorithm for solving many small tridiagonal systems 16
3.3. ABCD algorithm for solving a huge tridiagonal system 18
3.4. Operation counts for Givens rotation and direct multiplication 21
4. Implementation and performance optimization 24
4.1. ABCD implementation 24
4.1.1. Coalesce implementation on GPU 24
4.1.2. Augmented matrix 25
4.1.3. Givens rotation and direct multiplication for solving 〖(A ̅_i A ̅_i^T)〗^(-1) 26
4.1.4. Calculation of S and solving by substitution 27
4.1.5. Calculating v and x for u+v 28
4.2. mix version of ABCD and SPIKE on GPU 29
4.2.1. SPIKE implementation 29
4.2.2. SPIKE-ABCD algorithm implementation 30
4.3. boundary padding technique 31
4.3.1. Padding of C_(i,j) and negative identity for bottom and top partitions 31
4.3.2. Padding of first partition for givens rotation 32
4.3.3. Padding and using sparse storage on u 35
5. Experiments 38
5.1.1. Coalesce speedup evaluation 38
5.1.2. C_(i,j) and I padding and additional padding on first partition evaluation 39
5.1.3. Sparse storage and padding on u evaluation 40
5.2. Numerical stability test on different matrices 41
5.3. GPU calculation time evaluation 47
6. Conclusions 50
7. References 51
8. Appendix 53

[1] Nikolai Sakharnykh, “Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation,” San Jose Convention Center, San Jose, CA, September 21, 2010
[2] R. W. Hockney, “A fast direct solution of Poisson’s equation using Fourier analysis,” J. ACM, vol. 12, pp. 95–113, January 1965.
[3] Y. Zhang, J. Cohen, and J. D. Owens, “Fast tridiagonal solvers on the GPU,” in Proceed-ings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’10, (New York, NY, USA), pp. 127–136, ACM, 2010.
[4] E. Polizzi and A. H. Sameh, “A parallel hybrid banded system solver: The SPIKE algo-rithm,” Parallel Computing, vol. 32, no. 2, pp. 177–194, 2006.
[5] Walter Gander and Gene H. Golub, “Cyclic Reduction – History and Applications,” Pro-ceedings of the Workshop on Scientific Computin, Hong Kong, 10-12 March, 1997.
[6] Harold S. Stone, “An efficient parallel algorithm for the solution of a tridiagonal linear system of equations,” Journal of ACM,Vol. 20, No. 1, pp. 27-38, January 1973.
[7] J. B. Erway, R. F. Marcia, and J. Tyson, “Generalized diagonal pivoting methods for tridi-agonal systems without interchanges,” IAENG International Journal of Applied Mathematics, vol. 4, no. 40, pp. 269–275, 2010.
[8] Ali Cevahir, Akira Nukada, Satoshi Matsuoka, “ High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning,” Computer Science - Research and Development May 2010, Volume 25, Issue 1-2, pp 83-91
[9] I. S. Duff, R. Guivarch, D. Ruiz, and M. Zenadi, “The Augmented Block Cimmino Dis-tributed Method,” Technical Report TR/PA/13/11, CERFACS, Toulouse, France, 2013.
[10] Chang Li-Wen, John A. Stratton, Hee-Seok Kim, and Wen-Mei W. Hwu, “A scalable, nu-merically stable, high performance tridiagonal solver using GPUs.” In: Proceedings of the Interna-tional Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 27:1–27:11
[11] CUDA C Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[12] cuSPARSE, CUDA toolkit documentation http://docs.nvidia.com/cuda/cusparse/#abstract
[13] CUDA Batching Kernels http://docs.nvidia.com/cuda/cublas/#batching-kernels

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文