帳號:guest(3.128.201.232)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳建廷
作者(外文):Chen, Chien-Ting
論文名稱(中文):根據晶片網路之行為模式增進晶片外之記憶體存取速度之架構
論文名稱(外文):Exploiting the NoC Traffic Behavior of GPGPU for Improving the Off-Chip Memory Access
指導教授(中文):金仲達
指導教授(外文):King, Chung-Ta
口試委員(中文):劉靖家
黃婷婷
金仲達
口試委員(外文):Liou, Jing-Jia
Huang, Ting-Ting
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:100065508
出版年(民國):102
畢業學年度:101
語文別:英文
論文頁數:42
中文關鍵詞:晶片網路記憶體存取
外文關鍵詞:Network-on-chipMemory Access
相關次數:
  • 推薦推薦:0
  • 點閱點閱:328
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
現今的通用圖形處理器(GPGPU)已廣泛的運用在高速運算的工作需求上,
而大量的多執行緒架構使得通用圖形處理器非常適合運算資料獨立的平行程式(data parallel programming),
然而這樣的多執行緒架構也使得其對於記憶體和晶片網路的需求遠大於傳統的多核心晶片(Chip Multiprocessors),因此,
考量到資料獨立的平行程式的硬體行為,一個明確能夠解決這樣嚴重壓力問題的方式就是增進記憶體(DRAM)內列資料緩衝區(row buffer)的命中率(row buffer hit rate),
藉由提升列資料緩衝區的命中率,列資料緩衝區內所存放的資料替換的次數就能大幅減少,也因此能提升整體記憶體的運作速度。
現今已有在記憶體控制器(Memory Controller)內藉由重新安排記憶體需求(Memory Request)的處理順序來提升列資料緩衝區命中率的做法,
但其能夠納入考量的記憶體需求數量較少,並且記憶體控制器也會因此而變得更加複雜從而影響到整體運算速度。

由於通用圖形處理器的大量多執行緒架構能夠提供更多的機會來掩蓋記憶體存取的延遲,
我們在論文中利用了這樣的特性進而在晶片網路中實現重新調整記憶體需求封包到達記憶體控制器的順序,
我們藉由把欲存取同一個記憶體內列資料的記憶體封包集中並且將其在晶片網路內路由器中接合的方式,讓封包們在到達記憶體控制器時,
以一個能連續命中列資料緩衝區的處理順序來存取記憶體,而藉此加速記憶體存取的速度。 為了驗證這個方法的可行性,我們設計了一個支援封包接合的擴充式晶片網路路由器,
並且對其做了一系列廣泛的評估,評估報告中我們發現這樣的晶片網路輔助式設計策略的確能夠增進記憶體內列資料緩衝區的命中率,但不幸地,
對於我們使用的標竿程式來講,我們的設計只能在特定的條件下增進整體的運算表現,因此我們在論文中也進行了深入的研究來探討可能的因素,同時討論了克服問題的方法。
Modern General Purpose Graphic Processers Units (GPGPU) have been widely used in high
performance computing. The massive multithreading architecture of these GPGPUs makes
them ideal for data parallel programming. However, this also stresses the memory and the
interconnection to/from the memory beyond that from the state-of-the-art Chip Multiprocessors
(CMPs). Considering the memory access behaviors in data parallel computing, a
viable direction to solve the memory stress problem is to improve the row buffer hit rate of
the DRAM. A high hit rate implies fewer replacements of the row buffer and thus higher
DRAM performance. Previous approaches considered rearranging the memory requests at
the memory controller to increase the row buffer hit rate. The problem is that the window of
requests that can be considered for rearrangement is too narrow and the memory controller
becomes complicated and affect the critical path.
Since the massive multithreading architecture of GPGPUs can better hide memory access
latencies, we thus exploit in this thesis the novel idea of rearranging the memory requests in
the network-on-chip (NoC). By coalescing memory requests destined for the same row of the
DRAM together in the routers of the NoC, memory requests are already in proper order for
continuous row buffer hits when they arrive at the memory controller. To study the feasibility
of this idea, we have designed an expanded NoC router that supports packet coalescing and
evaluated its performance extensively. Evaluation results show that this NoC-assisted design
strategy can indeed improve the row buffer hit rate in the memory controllers. Unfortunately,
for the benchmarks studied, our design only improves the whole system performance under
specific conditions. We conduct in-depth investigations for the possible causes and study
ways to mitigate the problems.
1 Introduction 1
2 Background and Motivating Example 4
2.1 GPGPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Practicability of Inter-Core Coalescence . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Performance Impact of Delays in NoC . . . . . . . . . . . . . . . . . 7
2.3.2 Probability of Coalescence . . . . . . . . . . . . . . . . . . . . . . . . 8
3 System Design 10
3.1 Router Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Coalescer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Grant Holder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Allocation of Delays in Routers . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Evaluation 17
4.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
ii
4.2 Fast Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Accurate Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 Row Buffer Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Performance Impact Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.1 Application Characteristics Related Factors . . . . . . . . . . . . . . 23
4.4.1.1 Available Warps . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.1.2 Coalescing Probability and Quantity of Memory Request . . 24
4.4.2 Micro-architecture Related Factors . . . . . . . . . . . . . . . . . . . 27
4.4.2.1 Head Of Line Problem in DRAM . . . . . . . . . . . . . . . 27
4.4.2.2 Banked FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.2.3 Slack Distribution Strategies . . . . . . . . . . . . . . . . . 31
4.5 Detail Analysis on Performance gain . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Analysis of Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Related Works 38
5.1 Memory Access Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Slack Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion and Future Work 40
[1] NVIDIA, “Nvidias next generation cuda compute architecture: Fermi,” 2009.
[2] J. Sanders and E. Kandrot, CUDA by example: an introduction to general-purpose GPU
programming. Addison-Wesley Professional, 2010.
[3] J. Nickolls and W. Dally, “The gpu computing era,” Micro, IEEE, vol. 30, no. 2, pp.
56–69, 2010.
[4] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, “Gpu computing,”
Proceedings of the IEEE, vol. 96, no. 5, pp. 879–899, 2008.
[5] J. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming standard for heterogeneous
computing systems,” Computing in science & engineering, vol. 12, no. 3, p. 66,
2010.
[6] J. Yin, P. Zhou, A. Holey, S. Sapatnekar, and A. Zhai, “Energy-efficient non-minimal
path on-chip interconnection network for heterogeneous systems,” in Proceedings of the
2012 ACM/IEEE international symposium on Low power electronics and design. ACM,
2012, pp. 57–62.
[7] A. Bakhoda, J. Kim, and T. Aamodt, “Throughput-effective on-chip networks for manycore
accelerators,” in Proceedings of the 2010 43rd Annual IEEE/ACM international
symposium on Microarchitecture. IEEE Computer Society, 2010, pp. 421–432.
41
[8] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, “Memory access scheduling,”
in Computer Architecture, 2000. Proceedings of the 27th International Symposium on.
IEEE, 2000, pp. 128–138.
[9] O. Mutlu and T. Moscibroda, “Stall-time fair memory access scheduling for chip multiprocessors,”
in Proceedings of the 40th Annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 2007, pp. 146–160.
[10] G. Yuan, A. Bakhoda, and T. Aamodt, “Complexity effective memory access scheduling
for many-core accelerator architectures,” in Microarchitecture, 2009. MICRO-42. 42nd
Annual IEEE/ACM International Symposium on. IEEE, 2009, pp. 34–44.
[11] W. Dally, “Virtual-channel flow control,” Parallel and Distributed Systems, IEEE Transactions
on, vol. 3, no. 2, pp. 194–205, 1992.
[12] W. Dally and B. Towles, Principles and practices of interconnection networks. Morgan
Kaufmann, 2004.
[13] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing cuda workloads
using a detailed gpu simulator,” in Performance Analysis of Systems and Software,
2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009, pp. 163–174.
[14] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, “Rodinia:
A benchmark suite for heterogeneous computing,” in Workload Characterization, 2009.
IISWC 2009. IEEE International Symposium on. IEEE, 2009, pp. 44–54.
[15] “Parboil benchmark suite.” http://impact.crhc.illinois.edu/parboil.php.
[16] Pcchen, “N-queens solver,” http://forums.nvidia.com/index.php?showtopic=76893.
[17] “Nvidia gpu computing sdk suite.” https://developer.nvidia.com/gpu-computing-sdk.
42
[18] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both
performance and fairness of shared dram systems,” in ACM SIGARCH Computer Architecture
News, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 63–74.
[19] K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith, “Fair queuing memory systems,”
in Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium
on. IEEE, 2006, pp. 208–222.
[20] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, “Staged memory
scheduling: Achieving high performance and scalability in heterogeneous systems,”
in Proceedings of the 39th International Symposium on Computer Architecture. IEEE
Press, 2012, pp. 416–427.
[21] Y. Kim, H. Lee, and J. Kim, “An alternative memory access scheduling in manycore
accelerators,” in Parallel Architectures and Compilation Techniques (PACT), 2011 International
Conference on. IEEE, 2011, pp. 195–196.
[22] R. Das, O. Mutlu, T. Moscibroda, and C. Das, “A´ergia: A network-on-chip exploiting
packet latency slack,” Micro, IEEE, vol. 31, no. 1, pp. 29–41, 2011.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *