帳號:guest(3.16.82.194)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):彭德欣
作者(外文):Peng, Te-Hsin.
論文名稱(中文):利用高階合成進行加速器之記憶體分區與最佳化技術
論文名稱(外文):Memory Partitioning and Optimization of On-Chip Accelerators with High-Level Synthesis
指導教授(中文):黃稚存
指導教授(外文):Huang, Chih-Tsun
口試委員(中文):劉靖家
謝明得
口試委員(外文):Liou, Jing-Jia
Shieh, Ming-Der
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:104062547
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:56
中文關鍵詞:高階合成記憶體分區加速器設計空間探索
外文關鍵詞:high-level synthesismemory partitioningacceleratordesign space exploration
相關次數:
  • 推薦推薦:0
  • 點閱點閱:395
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
現今對於加速器的設計架構空間探索(design space exploration)主要是以暫存器傳輸級(Register-Transfer Level; RTL)或是高階合成(High-Level Synthesis; HLS)的設計流程為主,但是單用其中一種設計流程都是相當耗時的,而前暫存器傳輸級(pre-RTL)的工具,像是Aladdin,可以直接分析由高階語言所撰寫的設計,並且花較少的時間來得到不同設計參數(micro-architecture)的時間、面積與耗能的估計。

在我們先前的研究中提出了一個輔助設計的流程,結合高階合成的設計流程及Aladdin的輔助來做加速器的設計架構空間探索。由於該流程所使用的高階合成工具為Vivado HLS,它是專為FPGA flow所設計的工具,如果使用者欲採用ASIC flow的設計流程,則使用Vivado HLS所評估出來的結果就會有所差距。因此在這篇論文中,我們使用以ASIC flow設計流程為主的高階合成工具Stratus HLS,並得到加速器整體的設計架構空間。

另外,先前的研究中有提到,在做設計架構空間探索的過程中,使用傳統的記憶體分區方法,像是區塊(block)、循環(cyclic)和區塊循環(block-cyclic)等,在記憶體分區數量增加時,都可能會造成效能上的瓶頸,因而提出了新的記憶體分區演算法(remapping algorithm)去優化效能,但是該演算法在實作過程中,為了找出data mapping的規律,有許多記憶體分區位址不連續的偵測及資料交換的動作。因此在這篇論文中,我們在不改變核心概念下,改善了找規律的方式,使其演算法更完善。

在實驗的部分,我們將優化後的記憶體分區演算法(remapping algorithm)和傳統的循環記憶體分區方法(cyclic approach)套用於六個不同存取圖案(access pattern)的應用中,再搭配不同的循環展開(loop unrolling)及內存分區(memory partitioning)的參數去做設計架構空間探索(design space exploration)。接著,我們將六組應用針對存取圖案(access pattern)的類型做分類,並進行效能、面積的比較與分析。實驗結果顯示,相較於循環分區方法,我們的演算法在這六組應用中,皆能在犧牲較小的面積下,達到更好的效能。
Current researches in the design space exploration for accelerators mainly rely on either RTL-based flow or High-Level Synthesis flow. However, both of them are very time-consuming. Pre-RTL tools, such as Aladdin, can directly analyze designs in high-level languages and take less time to explore the timing, area, and power estimation of different micro-architectures.

Our previous work proposes a design assisted flow, which combines the HLS flow with the assistance of Aladdin to explore the design space. Vivado HLS, which targets at the FPGA design flow, is used. If users want to adopt the ASIC design flow, the result may be inaccurate. Therefore, in this thesis, we extend the exploration flow to adopt the ASIC HLS tool such as Stratus HLS, resulting in a more accurate design space exploration.

In addition, the conventional partitioning approaches, such as the block, cyclic, and block-cyclic techniques, can not evenly distribute the data elements into the memory banks. It causes the memory conflicts and thus becomes the bottleneck for the performance. Our previous work proposes the novel remapping algorithm to solve the problem.
However, the original remapping scheme will introduce irregular data padding or unnecessary data swapping, leading to the extra area or latency overhead. In this thesis, we improve the remapping algorithm by proposing a more general and efficient approach to find out the regularity.

We compare the optimized remapping algorithm and the conventional cyclic approach in six benchmark applications with different access patterns. And we apply the different combinations of the loop unrolling and memory partition to explore the design space. Then we classify the six benchmarks based on their access patterns and analyze the performance and area. The results of experiments show that our optimized remapping approach can effectively improve the performance with a smaller area overhead as compared with the cyclic approach.
1. Introduction (p.1)
2. Previous Work (p.5)
3. Proposed Methodology (p.14)
4. Experimental Results and Analysis (p.34)
5. Conclusion and Future Work (p.54)
[1] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks,"Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures," in ACM/IEEE International Symposium on Computer Architecture (ISCA), pp. 97-108, Jun. 2014.
[2] W.-C. Chang, "Assisted Design Optimization using High-Level Synthesis Flow", 2016
[3] Y. S. Shao, and D. Brooks, "ISA-Independent Workload Characterization and its Implications for Specialized Architectures," in International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013.
[4] J. Cong, W. Jiang, B. Liu, and Y. Zou, "Automatic Memory Partitioning and Scheduling for Throughput and Power Optimization," in \textit{ACM,} 2011.
[5] Y. Wang, P. Zhang, X. Cheng, and J. Cong, "An Integrated and Automated Memory Optimization Flow for FPGA Behavioral Synthesis," in ASPDAC, 2012.
[6] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong, "Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis," in ICCAD, 2012.
[7] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong, "Memory Partitioning for Multidimensional Arrays in High-Level Synthesis," in DAC, 2013.
[8] Y. Wang, P. Li, and J. Cong, "Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis," in FPGA, 2014.
[9] P. Li, P. Zhang, L. -N. Pouchet, and J. Cong, "Resource-Aware Throughput Optimization for High-Level Synthesis," in FPGA, 2015.
[10] J. Su, F. Yang, X. Zeng, and D. Zhou, "Efficient Memory Partitioning for Parallel Data Access via Data Reuse," in FPGA, 2016.
[11] J. Su, F. Yang, X. Zeng, and D. Zhou, "Interplay of loop unrolling and multidimensional memory partitioning in HLS," in DATE, 2015.
[12] B. Reagen, Y. S. Shao, G. -Y. Wei, and D. Brooks, "Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware," in ISLPED, 2013.
[13] M. Li, P. Zhang, C. Zhu, H. Jia, X. Xie, J. Cong, and W. Gao, "High Efficiency VLSI Implementation of an Edge-directed Video Up-scaler Using High Level Synthesis," in IEEE International Conference on Consumer Electronics (ICCE), 2015.
[14] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, David Brooks, "MachSuite: Benchmarks for Accelerator Design and Customized Architectures," in Workload Characterization (IISWC), IEEE International Symposium on), 2014.
[15] B. Carrion Schafer and A. Mahapatra, "S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis," in IEEE Embedded Systems Letters (Volume:6 , Issue: 3 ), 2014.
[16] J. Cong, V. Sarkar, G. Reinman and A. Bui, "Customizable Domain Specific Computing," in IEEE Design & Test of Computers, 2011.
[17] C.-T. Huang, H.-T. Tsai, "Performance Optimization of Accelerators using C-bassd High-Level Synthesis Flow", 2016
[18] Avalible:http://accelerator.eecs.harvard.edu/isca14tutorial/isca2014-tutorial-cad-benchmarks.pdf
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *