應用於高吞吐量卷積神經網路加速之預看與旋轉式的參數解碼引擎_

帳號：guest(216.73.216.96) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	王渙清
作者(外文):	Wang, Huan-Ching
論文名稱(中文):	應用於高吞吐量卷積神經網路加速之預看與旋轉式的參數解碼引擎
論文名稱(外文):	Look-ahead and Rotation-Based Parameter Decoding Engine for Highly-Parallel CNN Acceleration
指導教授(中文):	黃朝宗
指導教授(外文):	Huang, Chao-Tsung
口試委員(中文):	邱瀞德王家慶
口試委員(外文):	Chiu, Ching-Te Wang, Jia-Ching
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	106061572
出版年(民國):	109
畢業學年度:	108
語文別:	英文
論文頁數:	55
中文關鍵詞:	高度平行、旋轉基底、壓縮、解碼、高吞吐量、加速
外文關鍵詞:	Highly-parallel、Rotation-based、compression、decoding、high-throughput、acceleration
相關次數:	推薦:0 點閱:498 評分: 下載:0 收藏:0

近年來，卷積神經網路在計算影像應用上有了很大的進步。然而由於吞吐量的不足，導致傳統的卷積神經網路加速器很難能在目前的邊緣裝置上支援即時的計算影像應用。因此可以支援超大量平行的卷積神經網路加速器被提出來解決這個問題來達成吞吐量。而為了要達到這麼高的運算吞吐量，我們把做卷積運算時需要的參數都存在晶載記憶體來避免浪費額外的頻寬在重新傳輸參數上，並採用熵編碼來對參數做平行化的壓縮以此來增加可容納的模型。因此，我們設計了一個包含許多解碼單元的參數解碼引擎來做平行化的解碼以此來達到夠高的解碼吞吐量。
然而，我們不能只靠平行化來增加解碼的吞吐量，不然晶片的面積跟功耗會隨之提升，所以怎麼在一個參數解碼單元中加快解碼速度也是一個重要的問題，因此我們提出了預看架構來提升在資訊解碼單元中的操作頻率，除了上述這個問題外，我們發現在原先的版本中分裝參數的方式會導致壓縮過後的位元流會有長度不均的問題進而導致在同步重啟動的記憶體位址時需要填補過多的零，因此，我們提出了以旋轉式為基底的機制平均地分配參數來解決這個問題。
在本論文中，我們為單一的解碼單元提出了預看架構，相較於直接解的架構只增加了參數解碼引擎 2.29%的面積就可以提升19.5%的解碼吞吐量。此外，在提出以旋轉式為基底的機制來平均地分配參數來消除補0所帶來的影響後，增加參數解碼引擎1.8%的面積就能讓有效資料占參數記憶體的比率在各個測資中從87.6 - 91.45% 提升到95.89 - 99.69%。接著我們基於台積電40nm製程實作高吞吐量的卷積神經網路參數解碼電路，此電路用了1288 千位元組的晶片內部的記憶體以及39.6萬的邏輯閘。

Convolutional neural networks (CNNs) have recently made great process in computational imaging applications. However, it is difficult for conventional CNN accelerators to support the real time computational imaging applications on the edge device due to their insufficient throughput. Huang et al. proposed a block-based highly-parallel CNN accelerator, eCNN\cite{Huang_2019}, and it can support convolution by massive parallelism to reach the high calculating throughput. In order to reach the requirement of high calculating throughput, we keep all of the model parameters in the on-chip memories to avoid the external bandwidth for parameter retransmission and apply the entropy coding to compress the parameters to increase the capacity of supported model. Hence, we designed a parameter decoding engine which includes many decompress units to decode the encoded bitstream in parallel to reach the high decoding throughput for parameters.

However, it is not suitable only depending on parallel decoding to boost the decoding throughput because of the area and power overheads. So how to accelerate the decoding procedure in one decompress unit is another vital problem. Hence, we proposed the look-ahead architecture to increase the maximum operating frequency for the parameter decoding engine. In addition to the problem above, we found that the way of packing the parameter in the baseline version also introduced another problem which is the bitstream length imbalance after parallel encoding and it brings the redundancy overhead in the on-chip memory when padding zeros for aligning the restart address in parallel decompress units. Hence, we proposed rotation-based parameter allocation mechanism to address this problem.

In this thesis, we proposed the look-ahead architecture for one decompress unit, and it can boost 19.5\% decoding throughput by only increasing 2.29\% area of the parameter decoding engine compared to the direct architecture. In addition, we make the effective parameter memory utilization from 87.6-91.45\% to 95.89-99.69\% in each pattern after devising the rotation-based parameter allocation mechanism to eliminate the overheads introduced by padding zeros, and it only increases 1.8\% area overhead for the parameter decoding engine. Then, we implement a VLSI circuit for decoding CNN parameters in high throughput using TSMC 40nm technology process with 1288 KB on-chip memory and 396 K gate counts.

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Block-based Inference and Pipeline Scheme in eCNN . . . 3
1.2.2 Model Structure and the Leaf Module in eCNN . . . . . . 4
1.2.3 Compression Strategies for CNN . . . . . . . . . . . . . . 6
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Highly-Parallel Entropy Coding for CNN Parameter 11
2.1 Entropy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Raw Value . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Predicted Value . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.3 Symbol Encoding Method for Parallel Bitstream . . . . . 20
2.2 Bitstream Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Architecture of Parameter Decoding Engine 27
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Highly-parallel Parameter Decompress Unit . . . . . . . . . . . . 29
3.2.1 FIFO Register . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Symbol Decoder . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Direct Architecture . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Look Ahead Architecture . . . . . . . . . . . . . . . . . . 35
3.3 Rotation-Based Parameter Allocation . . . . . . . . . . . . . . . . 37
4 Implementation of Parameter Decoding Engine 43
4.1 Throughput Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Utilization Analysis for the Parameter Memory . . . . . . . . . . 44
4.3 Synthesis Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Conclusion and Future Work 51
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

[1] Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang, Chi-Wen Weng, KaiPing Lin, Li-Wei Wang, and Li-De Chen, “eCNN,” In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO), 2019.
[2] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
N. Sun, and O. Temam, “Dadiannao: A machine-learning supercomputer,”
In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO), 2014.
[3] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, and Jonathan Ross, “In-datacenter performance analysis of a tensor processing unit,” In Proceedings of the 44th Annual ACM/IEEE 44nd Annual International Symposium on Computer Architecture (ISCA), 2017.
[4] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” In Proceedings of the 42nd Annual ACM/IEEE International Symposium on Computer Architecture (ISCA), June 2015.
[5] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” In Proceedings of the 43rd Annual ACM/IEEE International Symposium on Computer Architecture (ISCA), June 2016.
[6] Song Han, Huizi Mao, and William J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” In International Conference on Learning Representations (ICLR), 2015.
[7] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally, “EIE: efficient inference engine on compressed deep neural network,” In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
[8] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze, “Eyeriss v2:
A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems,
2018.
[9] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” arXiv:1602.07360, 2016.
[10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
[11] Philipp Gysel, “Ristretto: Hardware-oriented approximation of convolutional neural networks,” arXiv:1605.06402, 2016.
[12] Darryl D. Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy, “Fixed point quantization of deep convolutional networks,” arXiv:1511.06393, 2015.
[13] Claude Elwood Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 7 1948.
[14] Ian H Witten, Radford M Neal, and John G Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, pp. 520–540, 1987.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文