應用高階合成之設計效能改善輔助技術__國立清華大學博碩士論文全文影像系統

帳號：guest(3.137.217.220) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	張瑋君
作者(外文):	Wei-Chun Chang
論文名稱(中文):	應用高階合成之設計效能改善輔助技術
論文名稱(外文):	Assisted Design Optimization using High-Level Synthesis Flow
指導教授(中文):	黃稚存
指導教授(外文):	Huang, Chih Tsun
口試委員(中文):	劉靖家黃俊達
口試委員(外文):	Liou, Jing Jia Huang, Juinn Dar
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	103062524
出版年(民國):	105
畢業學年度:	105
語文別:	英文
論文頁數:	42
中文關鍵詞:	高階合成、記憶體、架構探索
外文關鍵詞:	High-level synthesis、Memory、Design Space Exploration
相關次數:	推薦:0 點閱:302 評分: 下載:0 收藏:0

現在在加速器的研究主要是仰賴寄存器傳輸級(RTL)的流程來得到
準確的模擬時間、耗能、面積的估計。而前寄存器傳輸級(pre-RTL)，
像是Aladdin這個工具，可以花很少的時間得到大約精準的的估計，但
是卻不用產生RTL的程式。但是隨著設計的複雜化，不管是用RTL或
是pre-RTL都是非常的花時間。
在這篇論文，我們提出了一個輔助設計的流程，它可以有效地減少由循
環展開(loop unrolling)、內存分區(memory partition)組合的點。首先我們
會用Aladdin在不考慮內存分區的狀況下快速的找到循環展開的參數，並得
到動態資料關係圖(DDDG)。接著我們會分析DDDG來尋找內存分區的參
數。然後傳統分配資料的方法主要有區塊(block)、循環(cyclic)、區塊循
環(block-cyclic)。但是這些可能會造成在性能上的瓶頸。所以我們也提出
了不同於傳統的記憶體分區方法，透過分析DDDG讓資料平均分散到各個
記憶體分區。在這個流程的最後，我們會用高階合成工具把我們要探索
的C語言程式合成成RTL語言程式，並得到精準的估計。
現有的高接合成工具，像是Vivado HLS，可以使用C/C++/Systemc語
言的程式，然後根據不同的參數來得到不同的設計。然而，為了使用，程
式的撰寫方式是非常受局限的。所以我們為了增進內存分區、循環展開、
輸入緩衝提供了三個補丁。
實驗數據顯示我們可以大量的減少模擬時間。我們的記憶體分區方法也
在使用較少的分區下，大幅地增進了效能。

Current research works in accelerator designs mainly relies on register-transfer level (RTL)-
based flows to obtain accurate timing, power, and area estimations. Pre-RTL synthesis tool
such as Aladdin [1] can also be used to obtain approximately accurate estimations without
generating RTL code. However, design exploration of large or complex designs has become
a time-consuming process even using RTL or pre-RTL tools.
In this thesis, we proposed a design assisted flow which can efficiently reduce the searching
points of design exploration when using pre-synthesis tool considering micro-architecture
factors, such as loop unrolling, and memory partition. First, we use Aladdin [1] to quickly
explore the unrolling factor without considering memory partition and generate dynamic
data dependence graphs (DDDG). After choosing a unrolling number, the DDDG is analyzed
to explore the memory partition. However, conventional methods for memory partition are
mainly block, cyclic, or block-cyclic. The memory partition affects the performance a lot, and
it may be the bottleneck for the performance. In our flow, we proposed a memory-remapping
methodology to improve the source code with the better data placement in memory partitions
based on the DDDG. In the end, we use high-level synthesis tool to generate RTL code to
obtain accelerator designs with performance, area, and power.
Existing high-level synthesis (HLS) tools, such as Vivado HLS, can generate different
architectures of the application by applying different user’s configurartions in
C/C++/SystemC. However, the coding style is quite limited. Therefore, we provide three
patch methods, which address the improvement of memory partition, loop unrolling, and
input buffer of the high-level hardware description, respectively.
Experiment results show that we can dramatically reduce the simulation time. Our
memory-remapping methodology can also improve the performance of the design with an
optimized number of BRAM.

1 Introduction 2
2 Preliminary 5
2.1 Introduction to Aladdin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Dynamic Data Dependency Graphs . . . . . . . . . . . . . . . . . . . 6
2.2 H.-T. Tsai’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Proposed Methodology 12
3.1 Proposed Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Unroll Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Partition Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.5 New Data Remapping . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.6 Case Study - Sobel Gy . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.7 Code Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.8 HLS & Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Experiment 34
4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Experiment Result of Our Remapping Approach . . . . . . . . . . . . . . . . 34
4.3 Experiment Result of Our Proposed Flow . . . . . . . . . . . . . . . . . . . . 38
6
5 Conclusion and Future Work 39
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

[1] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks “Aladdin: A Pre-RTL, PowerPerformance
Accelerator Simulator Enabling Large Design Space Exploration of Customized
Architectures,” in ACM/IEEE International Symposium on Computer Architecture
(ISCA), pp. 97-108 Jun. 2014.
[2] J. Cong, W. Jiang, B. Liu, and Y. Zou “Automatic Memory Partitioning and Scheduling
for Throughput and Power Optimization” in ACM, 2011.
[3] Y. Wang, P. Zhang, X. Cheng, and J. Cong “An Integrated and Automated Memory
Optimization Flow for FPGA Behavioral Synthesis” in ASPDAC, 2012.
[4] P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong “Memory Partitioning and
Scheduling Co-optimization in Behavioral Synthesis” in ICCAD, 2012.
[5] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong “Memory Partitioning for Multidimensional
Arrays in High-Level Synthesis” in DAC, 2013.
[6] Y. Wang, P. Li, and J. Cong “Theory and Algorithm for Generalized Memory Partitioning
in High-Level Synthesis” in FPGA, 2014.
[7] P. Li, P. Zhang, L. -N. Pouchet, and J. Cong “Resource-Aware Throughput Optimization
for High-Level Synthesis” in FPGA, 2015.
[8] M. Li, P. Zhang, C. Zhu, H. Jia, X. Xie, J. Cong, and W. Gao “High Efficiency VLSI
Implementation of an Edge-directed Video Up-scaler Using High Level Synthesis” in
IEEE International Conference on Consumer Electronics (ICCE), 2015.
[9] C.-T. Huang, H.-T. Tsai “Performance Optimization of Accelerators using C-bassd
High-Level Synthesis Flow ”
[10] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, David Brooks
“MachSuite: Benchmarks for Accelerator Design and Customized Architectures” in
Workload Characterization (IISWC), IEEE International Symposium on), 2014.
[11] B. Carrion Schafer and A. Mahapatra “S2CBench : Synthesizable SystemC Benchmark
Suite for High-Level Synthesis” in IEEE Embedded Systems Letters (Volume:6 , Issue:
3 ) 2014.
[12] J. Cong, V. Sarkar, G. Reinman and A. Bui “Customizable Domain Specific Computing”
in IEEE Design & Test of Computers, 2011.
[13] Avalible:http://accelerator.eecs.harvard.edu/isca14tutorial/isca2014-tutorial-cadbenchmarks.pdf

(此全文限內部瀏覽)
電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文