Cheng, Kai-Jung
ITTPD: In-place Tensor Transposition with Permutation Decomposition on GPUs
Lee, Che-Rung
Hon, Wing-Kai
Chen, Po-An
Tensor, Tensor transposition, Inplace transposition, Permutation rearrangement, Algorithm, Optimization
張量轉置是一個被廣泛使用於高效能張量計算中的基本操作,例如在張量縮約中。然而,現有的張量轉置方法多為非原地(Out-Of-Place)操作,即這些方法需要使用相當於原張量兩倍的記憶體空間來完成轉置。對於多數加速器而言,記憶體空間的使用是一個嚴重問題,例如圖形記憶體(GPU)上的記憶體空間相對有限。本論文提出了一種高階張量的原地轉置演算法 ITTPD。為了有效利用已經最佳化的矩陣轉置函式核心(kernels),ITTPD 首先將高階張量拆解為較小的子張量。接著,將高階張量轉置視為一種排序問題,並利用相關的排序還原方法,將轉置降階成更小的單位,例如矩陣轉置、三階張量轉置、四階張量轉置,具體取決於轉置目標。ITTPD 結合這些不同的張量轉置結果以達成最終目標。透過理論分析可知,相較於非原地轉置的方法,對於足夠大的張量,ITTPD 可以節省至少百分之九十五以上的額外記憶體使用量。本論文還提出了 ITTPD 在 GPU 上的高效實現。實驗結果顯示,ITTPD 的性能接近當前的 GPU 張量轉置函式庫 CUDA Tensor Transpose(cuTT)。考慮到空間不足所需的搬運成本,ITTPD 能在相對較短的時間內完成轉置,並且能處理比 cuTT 大兩倍的張量,結果顯示 ITTPD 能作為有效的並行方案來處理 張量轉置運算。
Tensor transposition is a crucial operation in tensor calculations with diverse applications. Most algorithms rely on out-of-place transposition, which duplicates the tensor in memory and copies elements to their transposed positions, leading to high memory usage demand. However, on memory-limited devices such as Graphic Processing Units (GPUs), doubling the tensor size may not be feasible. In this thesis, we introduce ITTPD, an algorithm and its implementation that addresses the challenge of insufficient extra memory space.

ITTPD leverages permutation decomposition, which breaks down the permutation into simpler transpositions called primitives, and determines a sequence that fits within memory constraints. Additionally, ITTPD handles permutations exceeding memory limits by partitioning the tensor into smaller tensors and transposing each one separately. The GPU implementation optimizes memory access performance and reduces the execution time of each primitive. Experimental results demonstrate that ITTPD achieves a great performance compared to state-of-the-art out-of-place GPU implementations, with lower extra-memory usage.
Furthermore, ITTPD can handle nearly double the size of tensors compared to out-of-place methods, making it suitable for various transpositions of N-order tensors.
