帳號:guest(3.137.217.220)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):張瑋君
作者(外文):Wei-Chun Chang
論文名稱(中文):應用高階合成之設計效能改善輔助技術
論文名稱(外文):Assisted Design Optimization using High-Level Synthesis Flow
指導教授(中文):黃稚存
指導教授(外文):Huang, Chih Tsun
口試委員(中文):劉靖家
黃俊達
口試委員(外文):Liou, Jing Jia
Huang, Juinn Dar
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:103062524
出版年(民國):105
畢業學年度:105
語文別:英文
論文頁數:42
中文關鍵詞:高階合成記憶體架構探索
外文關鍵詞:High-level synthesisMemoryDesign Space Exploration
相關次數:
  • 推薦推薦:0
  • 點閱點閱:302
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
現在在加速器的研究主要是仰賴寄存器傳輸級(RTL)的流程來得到
準確的模擬時間、耗能、面積的估計。 而前寄存器傳輸級(pre-RTL),
像是Aladdin這個工具,可以花很少的時間得到大約精準的的估計,但
是卻不用產生RTL的程式。 但是隨著設計的複雜化,不管是用RTL或
是pre-RTL都是非常的花時間。
在這篇論文,我們提出了一個輔助設計的流程,它可以有效地減少由循
環展開(loop unrolling)、 內存分區(memory partition)組合的點。 首先我們
會用Aladdin在不考慮內存分區的狀況下快速的找到循環展開的參數,並得
到動態資料關係圖(DDDG)。 接著我們會分析DDDG來尋找內存分區的參
數。 然後傳統分配資料的方法主要有區塊(block)、循環(cyclic)、區塊循
環(block-cyclic)。 但是這些可能會造成在性能上的瓶頸。 所以我們也提出
了不同於傳統的記憶體分區方法,透過分析DDDG讓資料平均分散到各個
記憶體分區。 在這個流程的最後,我們會用高階合成工具把我們要探索
的C語言程式合成成RTL語言程式,並得到精準的估計。
現有的高接合成工具,像是Vivado HLS,可以使用C/C++/Systemc語
言的程式,然後根據不同的參數來得到不同的設計。 然而,為了使用,程
式的撰寫方式是非常受局限的。 所以我們為了增進內存分區、循環展開、
輸入緩衝提供了三個補丁。
實驗數據顯示我們可以大量的減少模擬時間。 我們的記憶體分區方法也
在使用較少的分區下,大幅地增進了效能。
Current research works in accelerator designs mainly relies on register-transfer level (RTL)-
based flows to obtain accurate timing, power, and area estimations. Pre-RTL synthesis tool
such as Aladdin [1] can also be used to obtain approximately accurate estimations without
generating RTL code. However, design exploration of large or complex designs has become
a time-consuming process even using RTL or pre-RTL tools.
In this thesis, we proposed a design assisted flow which can efficiently reduce the searching
points of design exploration when using pre-synthesis tool considering micro-architecture
factors, such as loop unrolling, and memory partition. First, we use Aladdin [1] to quickly
explore the unrolling factor without considering memory partition and generate dynamic
data dependence graphs (DDDG). After choosing a unrolling number, the DDDG is analyzed
to explore the memory partition. However, conventional methods for memory partition are
mainly block, cyclic, or block-cyclic. The memory partition affects the performance a lot, and
it may be the bottleneck for the performance. In our flow, we proposed a memory-remapping
methodology to improve the source code with the better data placement in memory partitions
based on the DDDG. In the end, we use high-level synthesis tool to generate RTL code to
obtain accelerator designs with performance, area, and power.
Existing high-level synthesis (HLS) tools, such as Vivado HLS, can generate different
architectures of the application by applying different user’s configurartions in
C/C++/SystemC. However, the coding style is quite limited. Therefore, we provide three
patch methods, which address the improvement of memory partition, loop unrolling, and
input buffer of the high-level hardware description, respectively.
Experiment results show that we can dramatically reduce the simulation time. Our
memory-remapping methodology can also improve the performance of the design with an
optimized number of BRAM.
1 Introduction 2
2 Preliminary 5
2.1 Introduction to Aladdin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Dynamic Data Dependency Graphs . . . . . . . . . . . . . . . . . . . 6
2.2 H.-T. Tsai’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Proposed Methodology 12
3.1 Proposed Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Unroll Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Partition Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.5 New Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.6 Case Study - Sobel Gy . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.7 Code Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.8 HLS & Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Experiment 34
4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Experiment Result of Our Remapping Approach . . . . . . . . . . . . . . . . 34
4.3 Experiment Result of Our Proposed Flow . . . . . . . . . . . . . . . . . . . . 38
6
5 Conclusion and Future Work 39
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
[1] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks “Aladdin: A Pre-RTL, PowerPerformance
Accelerator Simulator Enabling Large Design Space Exploration of Customized
Architectures,” in ACM/IEEE International Symposium on Computer Architecture
(ISCA), pp. 97-108 Jun. 2014.
[2] J. Cong, W. Jiang, B. Liu, and Y. Zou “Automatic Memory Partitioning and Scheduling
for Throughput and Power Optimization” in ACM, 2011.
[3] Y. Wang, P. Zhang, X. Cheng, and J. Cong “An Integrated and Automated Memory
Optimization Flow for FPGA Behavioral Synthesis” in ASPDAC, 2012.
[4] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong “Memory Partitioning and
Scheduling Co-optimization in Behavioral Synthesis” in ICCAD, 2012.
[5] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong “Memory Partitioning for Multidimensional
Arrays in High-Level Synthesis” in DAC, 2013.
[6] Y. Wang, P. Li, and J. Cong “Theory and Algorithm for Generalized Memory Partitioning
in High-Level Synthesis” in FPGA, 2014.
[7] P. Li, P. Zhang, L. -N. Pouchet, and J. Cong “Resource-Aware Throughput Optimization
for High-Level Synthesis” in FPGA, 2015.
[8] M. Li, P. Zhang, C. Zhu, H. Jia, X. Xie, J. Cong, and W. Gao “High Efficiency VLSI
Implementation of an Edge-directed Video Up-scaler Using High Level Synthesis” in
IEEE International Conference on Consumer Electronics (ICCE), 2015.
[9] C.-T. Huang, H.-T. Tsai “Performance Optimization of Accelerators using C-bassd
High-Level Synthesis Flow ”
[10] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, David Brooks
“MachSuite: Benchmarks for Accelerator Design and Customized Architectures” in
Workload Characterization (IISWC), IEEE International Symposium on), 2014.
[11] B. Carrion Schafer and A. Mahapatra “S2CBench : Synthesizable SystemC Benchmark
Suite for High-Level Synthesis” in IEEE Embedded Systems Letters (Volume:6 , Issue:
3 ) 2014.
[12] J. Cong, V. Sarkar, G. Reinman and A. Bui “Customizable Domain Specific Computing”
in IEEE Design & Test of Computers, 2011.
[13] Avalible:http://accelerator.eecs.harvard.edu/isca14tutorial/isca2014-tutorial-cadbenchmarks.pdf
(此全文限內部瀏覽)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *