帳號:guest(18.117.231.127)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):鄭凱榮
作者(外文):Cheng, Kai-Jung
論文名稱(中文):高維度張量在GPU上的原地轉置與排列分解方法
論文名稱(外文):ITTPD: In-place Tensor Transposition with Permutation Decomposition on GPUs
指導教授(中文):李哲榮
指導教授(外文):Lee, Che-Rung
口試委員(中文):韓永楷
陳柏安
口試委員(外文):Hon, Wing-Kai
Chen, Po-An
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:111062589
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:54
中文關鍵詞:張量張量轉置原地轉置排列分解演算法最佳化
外文關鍵詞:TensorTensor transpositionInplace transpositionPermutation rearrangementAlgorithmOptimization
相關次數:
  • 推薦推薦:0
  • 點閱點閱:9
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
張量轉置是一個被廣泛使用於高效能張量計算中的基本操作,例如在張量縮約中。然而,現有的張量轉置方法多為非原地(Out-Of-Place)操作,即這些方法需要使用相當於原張量兩倍的記憶體空間來完成轉置。對於多數加速器而言,記憶體空間的使用是一個嚴重問題,例如圖形記憶體(GPU)上的記憶體空間相對有限。本論文提出了一種高階張量的原地轉置演算法 ITTPD。為了有效利用已經最佳化的矩陣轉置函式核心(kernels),ITTPD 首先將高階張量拆解為較小的子張量。接著,將高階張量轉置視為一種排序問題,並利用相關的排序還原方法,將轉置降階成更小的單位,例如矩陣轉置、三階張量轉置、四階張量轉置,具體取決於轉置目標。ITTPD 結合這些不同的張量轉置結果以達成最終目標。透過理論分析可知,相較於非原地轉置的方法,對於足夠大的張量,ITTPD 可以節省至少百分之九十五以上的額外記憶體使用量。本論文還提出了 ITTPD 在 GPU 上的高效實現。實驗結果顯示,ITTPD 的性能接近當前的 GPU 張量轉置函式庫 CUDA Tensor Transpose(cuTT)。考慮到空間不足所需的搬運成本,ITTPD 能在相對較短的時間內完成轉置,並且能處理比 cuTT 大兩倍的張量,結果顯示 ITTPD 能作為有效的並行方案來處理 張量轉置運算。
Tensor transposition is a crucial operation in tensor calculations with diverse applications. Most algorithms rely on out-of-place transposition, which duplicates the tensor in memory and copies elements to their transposed positions, leading to high memory usage demand. However, on memory-limited devices such as Graphic Processing Units (GPUs), doubling the tensor size may not be feasible. In this thesis, we introduce ITTPD, an algorithm and its implementation that addresses the challenge of insufficient extra memory space.

ITTPD leverages permutation decomposition, which breaks down the permutation into simpler transpositions called primitives, and determines a sequence that fits within memory constraints. Additionally, ITTPD handles permutations exceeding memory limits by partitioning the tensor into smaller tensors and transposing each one separately. The GPU implementation optimizes memory access performance and reduces the execution time of each primitive. Experimental results demonstrate that ITTPD achieves a great performance compared to state-of-the-art out-of-place GPU implementations, with lower extra-memory usage.
Furthermore, ITTPD can handle nearly double the size of tensors compared to out-of-place methods, making it suitable for various transpositions of N-order tensors.
中文摘要 1
List of Figures 3
List of Tables 4
1 Introduction 5
2 Background 8
2.1 In-place and Out-of-place Transposition . . . . . . . . . . . . . . . . . . 8
2.2 Permutation Rearrangement . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Tensor Contraction with BLAS . . . . . . . . . . . . . . . . . . . . . . 11
3 Algorithm 13
3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Low-order Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Catanzaro’s algorithm . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Implementation of Primitives . . . . . . . . . . . . . . . . . . . 20
3.4 Memory Reduction Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Tensor Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Tensor Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Dimension Padding . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Transpose Permutation Decomposition . . . . . . . . . . . . . . . . . . 24
3.5.1 Sorting By Transpositions . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Pre-order Index Reordering Method . . . . . . . . . . . . . . . . 25
3.5.3 Post-order Index Reordering Method . . . . . . . . . . . . . . . 25
3.5.4 Index Reordering Method Selection . . . . . . . . . . . . . . . . 26
4 GPU Implementation 28
4.1 Implementation of Catanzaro’s Algorithm . . . . . . . . . . . . . . . . 28
4.1.1 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Global Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Implementations of Low-order Transpose . . . . . . . . . . . . . . . . . 32
4.2.1 Linearization methods . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Extension of Catanzaro’s algorithm . . . . . . . . . . . . . . . . 32
4.2.3 Combination methods . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Implementation of Memory Reducing Algorithm . . . . . . . . . . . . . 34
4.3.1 Partition and Join . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Dimension Padding . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 General Linearization Methods . . . . . . . . . . . . . . . . . . 35
5 Performance Evaluation 36
5.1 Low-order Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Memory Reduction Overhead . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Benchmark Set of TTC . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Reordering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Hyperparameter α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Input Tensor Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Comparison with Insufficient Space Out-of-place Methods . . . . . . . . 42
6 Conclusion and Future Work 45
[1] Chetan Nayak et al. “Non-Abelian anyons and topological quantum computa-
tion”. In: Reviews of Modern Physics 80.3 (2008), 1083–1159. issn: 1539-0756.
doi: 10.1103/revmodphys.80.1083. url: http://dx.doi.org/10.1103/
RevModPhys.80.1083.
[2] Hans-Joachim Werner et al. “Molpro: a general-purpose quantum chemistry
program package”. In: WIREs Computational Molecular Science 2.2 (2012),
pp. 242–253. doi: https : / / doi . org / 10 . 1002 / wcms . 82. eprint: https :
/ / onlinelibrary . wiley . com / doi / pdf / 10 . 1002 / wcms . 82. url: https :
//onlinelibrary.wiley.com/doi/abs/10.1002/wcms.82.
[3] M. Alex O. Vasilescu and Demetri Terzopoulos. “Multilinear Analysis of Image
Ensembles: TensorFaces”. In: Computer Vision — ECCV 2002. Ed. by Anders
Heyden et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 447–460.
isbn: 978-3-540-47969-7.
[4] Alexander Novikov et al. Tensorizing Neural Networks. 2015. arXiv: 1509.06569
[cs.LG].
[5] Tamara G. Kolda and Brett W. Bader. “Tensor Decompositions and Applica-
tions”. In: SIAM Review 51.3 (2009), pp. 455–500. doi: 10.1137/07070111X.
eprint: https://doi.org/10.1137/07070111X. url: https://doi.org/10.
1137/07070111X.
[6] So Hirata. “Tensor Contraction Engine: Abstraction and Automated Parallel
Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories”. In: The Journal of Physical Chemistry A 107 (Nov.2003), pp. 9887–9897. doi: 10.1021/jp034596z.
[7] A. Abdelfattah et al. “High-performance Tensor Contractions for GPUs”. In:
Procedia Computer Science 80 (2016). International Conference on Computa-
tional Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA,
pp. 108 –118. issn: 1877-0509. doi: https://doi.org/10.1016/j.procs.
2016.05.302. url: http://www.sciencedirect.com/science/article/pii/
S1877050916306536.
[8] Edgar Solomonik et al. “A massively parallel tensor contraction framework for
coupled-cluster computations”. In: Journal of Parallel and Distributed Comput-
ing 74.12 (2014). Domain-Specific Languages and High-Level Frameworks for
High-Performance Computing, pp. 3176 –3190. issn: 0743-7315. doi: https://
doi.org/10.1016/j.jpdc.2014.06.002. url: http://www.sciencedirect.
com/science/article/pii/S074373151400104X.
[9] Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU
and GPU”. In: 2016 IEEE 23rd International Conference on High Performance
Computing (HiPC) (2016). doi: 10.1109/hipc.2016.031. url: http://dx.
doi.org/10.1109/HiPC.2016.031.
[10] Paul Springer and Paolo Bientinesi. “Design of a High-Performance GEMM-like
Tensor–Tensor Multiplication”. In: ACM Trans. Math. Softw. 44.3 (Jan. 2018).
issn: 0098-3500. doi: 10.1145/3157733. url: https://doi.org/10.1145/
3157733.
[11] Devin A. Matthews. “High-Performance Tensor Contraction without Transposi-
tion”. In: SIAM Journal on Scientific Computing 40.1 (2018), pp. C1–C24. doi:
10.1137/16M108968X. eprint: https://doi.org/10.1137/16M108968X. url:
https://doi.org/10.1137/16M108968X.
[12] Glen Evenbly. “A Practical Guide to the Numerical Implementation of Tensor
Networks I: Contractions, Decompositions, and Gauge Freedom”. In: Frontiers in Applied Mathematics and Statistics 8 (2022). issn: 2297-4687. doi: 10.3389/fams . 2022 . 806549. url: https : / / www . frontiersin . org / articles / 10 .3389/fams.2022.806549.
[13] L Susan Blackford et al. “An updated set of basic linear algebra subprograms
(BLAS)”. In: ACM Transactions on Mathematical Software 28.2 (2002), pp. 135–
151.
[14] Dmitry I. Lyakh. “An efficient tensor transpose algorithm for multicore CPU,
Intel Xeon Phi, and NVidia Tesla GPU”. In: Computer Physics Communications
189 (Jan. 2015). doi: 10.1016/j.cpc.2014.12.013.
[15] Antti-Pekka Hynninen and Dmitry I. Lyakh. cuTT: A High-Performance Ten-
sor Transpose Library for CUDA Compatible GPUs. 2017. arXiv: 1705.01598
[cs.MS].
[16] Paul Springer, Tong Su, and Paolo Bientinesi. “HPTT: A High-Performance
Tensor Transposition C++ Library”. In: ARRAY 2017. Barcelona, Spain: Asso-
ciation for Computing Machinery, 2017, 56–62. isbn: 9781450350693. doi: 10.
1145/3091966.3091968. url: https://doi.org/10.1145/3091966.3091968.
[17] J. Vedurada et al. “TTLG - An Efficient Tensor Transposition Library for
GPUs”. In: 2018 IEEE International Parallel and Distributed Processing Sym-
posium (IPDPS). 2018, pp. 578–588. doi: 10.1109/IPDPS.2018.00067.
[18] NVIDIA Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/mig-
user-guide/index.html. 2022.
[19] Fred Gustavson, Lars Karlsson, and Bo Kågström. “Parallel and Cache-Efficient
In-Place Matrix Storage Format Conversion”. In: 38.3 (Apr. 2012). issn: 0098-
3500. doi: 10.1145/2168773.2168775. url: https://doi.org/10.1145/
2168773.2168775.
[20] Fred G. Gustavson and David W. Walker. “Algorithms for in-place matrix trans-
position”. In: Concurrency and Computation: Practice and Experience 31.13
(2019). e5071 cpe.5071, e5071. doi: 10 . 1002 / cpe . 5071. eprint: https : / /
onlinelibrary . wiley . com / doi / pdf / 10 . 1002 / cpe . 5071. url: https :
//onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5071.
[21] I-Jui Sung et al. “In-Place Transposition of Rectangular Matrices on Accelera-
tors”. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming. PPoPP ’14. Orlando, Florida, USA: Associ-
ation for Computing Machinery, 2014, 207–218. isbn: 9781450326568. doi: 10.
1145/2555243.2555266. url: https://doi.org/10.1145/2555243.2555266.
[22] J. Gómez-Luna et al. “In-Place Matrix Transposition on GPUs”. In: IEEE Trans-
actions on Parallel and Distributed Systems 27.3 (2016), pp. 776–788. doi:
10.1109/TPDS.2015.2412549.
[23] Bryan Catanzaro, Alexander Keller, and Michael Garland. “A Decomposition for
In-Place Matrix Transposition”. In: SIGPLAN Not. 49.8 (Feb. 2014), 193–206.
issn: 0362-1340. doi: 10.1145/2692916.2555253. url: https://doi.org/
10.1145/2692916.2555253.
[24] A.A. Tretyakov and E.E. Tyrtyshnikov. “Optimal in-place transposition of rect-
angular matrices”. In: Journal of Complexity 25.4 (2009), pp. 377 –384. issn:
0885-064X. doi: https://doi.org/10.1016/j.jco.2009.02.008. url: http:
//www.sciencedirect.com/science/article/pii/S0885064X09000120.
[25] Fred Gehrung Gustavson and John A Gunnels. “Method and structure for cache
aware transposition via rectangular subsections”. In: (Feb. 2014).
[26] Ali El-Moursy, Ahmed El-Mahdy, and Hisham El-Shishiny. “An Efficient In-
Place 3D Transpose for Multicore Processors with Software Managed Memory
Hierarchy”. In: Proceedings of the 1st International Forum on Next-Generation
Multicore/Manycore Technologies. IFMT ’08. Cairo, Egypt: Association for Com-
puting Machinery, 2008. isbn: 9781605584072. doi: 10.1145/1463768.1463781.
url: https://doi.org/10.1145/1463768.1463781.
[27] Muhammad Elsayed, Saleh El-shehaby, and Mohamed Abougabal. “NDPA: A
generalized efficient parallel in-place N-Dimensional Permutation Algorithm”.
In: Alexandria Engineering Journal 32 (Apr. 2015). doi: 10 . 1016 / j . aej .
2015.03.024.
[28] Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. “Efficient 3D Transpo-
sitions in Graphics Processing Units”. In: Int. J. Parallel Program. 43.5 (Oct.
2015), 876–891. issn: 0885-7458. doi: 10 . 1007 / s10766 - 015 - 0366 - 5. url:
https://doi.org/10.1007/s10766-015-0366-5.
[29] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. “TTC: a tensor trans-
position compiler for multiple architectures”. In: Proceedings of the 3rd ACM
SIGPLAN International Workshop on Libraries, Languages, and Compilers for
Array Programming - ARRAY 2016 (2016). doi: 10.1145/2935323.2935328.
url: http://dx.doi.org/10.1145/2935323.2935328.
[30] Sangeeta Bhatia, Pedro Feijão, and Andrew R. Francis. “Position and Con-
tent Paradigms in Genome Rearrangements: The Wild and Crazy World of
Permutations in Genomics”. In: Bulletin of Mathematical Biology 80.12 (2018),
pp. 3227–3246. issn: 1522-9602. doi: 10 . 1007 / s11538 - 018 - 0514 - 3. url:
https://doi.org/10.1007/s11538-018-0514-3.
[31] Laurent Bulteau and Mathias Weller. “Parameterized Algorithms in Bioinfor-
matics: An Overview”. In: Algorithms 12.12 (2019). issn: 1999-4893. doi: 10.
3390/a12120256. url: https://www.mdpi.com/1999-4893/12/12/256.
[32] Ron Zeira and Ron Shamir. “Genome Rearrangement Problems with Single and
Multiple Gene Copies: A Review”. In: Bioinformatics and Phylogenetics: Sem-
inal Contributions of Bernard Moret. Ed. by Tandy Warnow. Cham: Springer
International Publishing, 2019, pp. 205–241. isbn: 978-3-030-10837-3. doi: 10.
1007/978-3-030-10837-3_10. url: https://doi.org/10.1007/978-3-030-
10837-3_10.
[33] Andre Rodrigues Oliveira et al. “Rearrangement Distance Problems: An up-
dated survey”. In: ACM Comput. Surv. 56.8 (2024). issn: 0360-0300. doi: 10.
1145/3653295. url: https://doi.org/10.1145/3653295.
[34] John Kececioglu and David Sankoff. “Exact and Approximation Algorithms for
Sorting by Reversals, with Application to Genome Rearrangement”. In: Algo-
rithmica 13 (Feb. 1995), pp. 180–210. doi: 10.1007/BF01188586.
[35] Vineet Bafna and Pavel A. Pevzner. “Sorting by Transpositions”. In: SIAM
Journal on Discrete Mathematics 11.2 (1998), pp. 224–240. doi: 10 . 1137 /
S089548019528280X. eprint: https://doi.org/10.1137/S089548019528280X.
url: https://doi.org/10.1137/S089548019528280X.
[36] Vineet Bafna and Pavel A. Pevzner. “Genome Rearrangements and Sorting by
Reversals”. In: SIAM Journal on Computing 25.2 (1996), pp. 272–289. doi: 10.
1137/S0097539793250627. eprint: https://doi.org/10.1137/S0097539793250627.
url: https://doi.org/10.1137/S0097539793250627.
[37] David A. Christie. “A 3/2-approximation algorithm for sorting by reversals”.
In: Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algo-
rithms. SODA ’98. San Francisco, California, USA: Society for Industrial and
Applied Mathematics, 1998, 244–252. isbn: 0898714109.
[38] Piotr Berman, Sridhar Hannenhalli, and Marek Karpinski. “1.375-Approximation
Algorithm for Sorting by Reversals”. In: Algorithms — ESA 2002. Ed. by Rolf
Möhring and Rajeev Raman. Berlin, Heidelberg: Springer Berlin Heidelberg,
2002, pp. 200–210. isbn: 978-3-540-45749-7.
[39] Alberto Caprara. “Sorting by reversals is difficult”. In: Proceedings of the First
Annual International Conference on Computational Molecular Biology. RECOMB
’97. Santa Fe, New Mexico, USA: Association for Computing Machinery, 1997,
75–83. isbn: 0897918827. doi: 10.1145/267521.267531. url: https://doi.
org/10.1145/267521.267531.
[40] Isaac Elias and Tzvika Hartman. “A 1.375-Approximation Algorithm for Sorting
by Transpositions”. In: IEEE/ACM Transactions on Computational Biology and
Bioinformatics 3.4 (2006), pp. 369–379. doi: 10.1109/TCBB.2006.44.
[41] Laurent Bulteau, Guillaume Fertin, and Irena Rusu. “Sorting by Transposi-
tions Is Difficult”. In: SIAM Journal on Discrete Mathematics 26.3 (2012),
pp. 1148–1180. doi: 10 . 1137 / 110851390. eprint: https : / / doi . org / 10 .
1137/110851390. url: https://doi.org/10.1137/110851390.
[42] ULISSES DIAS and ZANONI DIAS. “HEURISTICS FOR THE TRANSPO-
SITION DISTANCE PROBLEM”. In: Journal of Bioinformatics and Compu-
tational Biology 11.05 (2013). PMID: 24131057, p. 1350013. doi: 10 . 1142 /
S0219720013500133. eprint: https://doi.org/10.1142/S0219720013500133.
url: https://doi.org/10.1142/S0219720013500133.
[43] Luís Felipe I. Cunha et al. “A Faster 1.375-Approximation Algorithm for Sorting
by Transpositions*”. In: Journal of Computational Biology 22.11 (2015). PMID:
26383040, pp. 1044–1056. doi: 10.1089/cmb.2014.0298. eprint: https://doi.
org/10.1089/cmb.2014.0298. url: https://doi.org/10.1089/cmb.2014.
0298.
[44] Luiz Augusto G. Silva et al. “A new 1.375-approximation algorithm for sorting
by transpositions”. In: Algorithms for Molecular Biology 17.1 (2022), p. 1. issn:
1748-7188. doi: 10.1186/s13015-022-00205-z. url: https://doi.org/10.
1186/s13015-022-00205-z.
[45] Alexsandro Oliveira Alexandrino et al. “A 1.375-Approximation Algorithm for Sort-
ing by Transpositions with Faster Running Time”. In: Advances in Bioinfor-
matics and Computational Biology. Ed. by Nicole M. Scherer and Raquel C. de
Melo-Minardi. Cham: Springer Nature Switzerland, 2022, pp. 147–157. isbn:
978-3-031-21175-1.
[46] Brett W. Bader and Tamara G. Kolda. “Algorithm 862: MATLAB tensor classes
for fast algorithm prototyping”. In: ACM Trans. Math. Softw. 32.4 (2006),
635–653. issn: 0098-3500. doi: 10.1145/1186785.1186794. url: https://
doi.org/10.1145/1186785.1186794.
[47] Devin A. Matthews. “High-Performance Tensor Contraction without Transposi-
tion”. In: SIAM Journal on Scientific Computing 40.1 (2018), pp. C1–C24. doi:
10.1137/16M108968X.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *