|
[1] Volker Strassen. Gaussian Elimination is not Optimal. NUMERISCHE MATHEMATIK Volume 13, Number 4, 354-356. [2] Jacques Cohen and Martin Roth. On the implementation of Strassen's fast multiplication algorithm. ACTA INFORMATICA Volume 6, Number 4, 341-355. [3] David H. Bailey. Extra-High Speed Matrix Multiplication on the Cray-2. SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3 , pg. 603 607. [4] Craig C. Douglas , Michael Heroux , Gordon Slishman , Roger M. Smith , Roger M. A. GEMMW: Portable Level 3 Blas Winograd Variant Of Strassen's Matrix-Matrix Multiply Algorithm. Yale University, Department of Computer Scuence. [5] Steven Huss-Lederman, Elaine M. Jacobson, Jeremy R. Johnson, Anna Tsao, Thomas Turnbull. Implementation of Strassen's Algorithm for Matrix Multiplication. [6] Thomas H. Cormen,Charles E. Leiserson,Ronald L. Rivest,Clifford Stein. Introduction to Algorithms 3rd Edition. P. 795-796. [7] Thomas H. Cormen,Charles E. Leiserson,Ronald L. Rivest,Clifford Stein. Introduction to Algorithms 3rd Edition. P. 111-112. [8] Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, Wei Zhou. Memory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm. International Symposium on Symbolic and Algebraic Computation 2009. [9] Jesse D. Hall, Nathan A. Carr, John C. Hart. Cache and Bandwidth Aware Matrix Multiplication on the GPU [10] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efciency of GPU Algorithms for Matrix-Matrix Multiplication. Graphics Hardware 2004. [11] N. P. Karunadasa & D. N. Ranasinghe. On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters. HiPC 2009. [12] Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun. Fast Implementation of DGEMM on Fermi GPU. SC11. [13] Cesur Baransel, Kayhan M. Imre. A parallel implementation of Strassen’s matrix multiplication algorithm for wormhole-routed all-port 2D torus networks. The Journal of Supercomputing. DOI 10.1007/s11227-011-0730-1. [14] Junjie Li, Sanjay Ranka, Sartaj Sahni. Strassen’s Matrix Multiplication on GPUs. Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference. [15] Fengguang Song, Jack Dongarra, and Shirley Moore. EXPERIMENTS WITH STRASSEN’S ALGORITHM FROM SEQUENTIAL TO PARALLEL. Proceedings of Parallel and Distributed Computing and Systems (PDCS). ACTA, Nov. 2006. [16] MAGMA, http://icl.cs.utk.edu/magma/ [17] cublas, http://developer.NVIDIA.com/cublas [18] CULA, http://www.culatools.com/ [19] Jakub Kurzak, Piotr Luszczek, Mathieu Faverge, Jack Dongarra, Lapack Working Notes 266, http://www.netlib.org/lapack/lawnspdf/lawn266.pdf [20] CUDA, http://www.NVIDIA.com.tw/object/cuda_home_new_tw.html [21] CUDA Best Practice Guide P38, 39 http://developer.download.NVIDIA.com/compute/DevZone/docs/html/C/doc/CUDA_C_Best_Practices_Guide.pdf [22] Vasily Volkov, James W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra [23] Rajib Nath , Stanimire Tomov , and Jack Dongarra. An Improved MAGMA GEMM for Fermi GPUs. [24] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms 2nd , ch. 23 Fast Matrix Multiplication.
|