帳號:guest(18.219.140.44)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):汪立偉
作者(外文):Wang, Li-Wei
論文名稱(中文):應用於風格轉換之具粗粒度位靈活性可重構卷積引擎
論文名稱(外文):Reconfigurable Convolution Engine with Coarse-Grained Bit-Level Flexibility for Style Transfer
指導教授(中文):黃朝宗
指導教授(外文):Huang, Chao-Tsung
口試委員(中文):呂仁碩
王家慶
口試委員(外文):Liu, Ren-Shuo
Wang, Jia-Ching
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:107061551
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:65
中文關鍵詞:可重構設計風格轉換粗粒度技術位靈活性
外文關鍵詞:reconfigurable designstyle transfercoarse-grained techniquebit-level flexibility
相關次數:
  • 推薦推薦:0
  • 點閱點閱:371
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來,卷積神經網絡(CNN)在分類應用以及計算影像應用中皆取得了巨大的成功。由於資源有限的邊緣裝置日益普及,與減小卷積神經網絡的模型尺寸和運算成本有關的研究已變得十分盛行。量化是其中一種主要的方法,它將卷積神經網絡中的參數和激活從32位浮點數減少到較低位寬。分類任務的模型通常可以被量化到少於8位元,並且幾乎沒有精準度的損失。然而在這種情況下,許多計算影像應用像是去噪或超解析度成像可能會有大幅的品質下降。在這個領域中,Judd等人在沒有損失分類任務的精準度的情形下,將每一層資料量化到不同的位寬以達成能源和性能的提升。此外,他們的團隊也實作了幾種使用了位串行方法的硬體加速器來處理多變位寬的資料。然而位串行計算是一種細粒度的技術,並且相比於具有較粗粒度位級計算的技術而言,它可能導致更大的硬體開銷。此外,當硬體的平行度增加時,開銷將相應增加。基於這些觀察,我們在這篇論文中致力於設計一個具粗粒度技術的可重構卷積引擎。
首先,我們專注於一種特殊應用–風格轉換,它是一種對量化有著較高容忍度的計算影像應用。風格轉換的模型可以被量化到4位元而仍然有著可以接受的轉換效果,而且8位元量化對風格轉換來說還是具有相當高的品質,因此我們討論的精度範圍是從4位元到8位元之間。眾所周知,模型的品質與其運算複雜度有著正相關的關係,所以我們可以透過量化到不同的精度配置來獲得具有各種品質的模型。在粗粒度設計的考量下,我們決定只將資料量化到4位元或8位元。舉例來說,我們可以將第一層的參數量化到4位元或8位元,而第一層的激活也一樣可以被量化到這兩種位寬,因此,一層資料可以有四種位寬配置。所以實際上對於一個由許多層組成的模型,它可以有各種精度配置與其相對應的運算複雜度。實驗顯示,只使用這兩種位寬來進行量化是可以獲得各種品質與其相對應的運算複雜度的模型的。因為運算複雜度與吞吐量大略是呈反比的關係,所以我們可以使用這種粗粒度技術在品質與吞吐量之間作取捨。這個實驗結果促使我們去設計一個具粗粒度位靈活性的可重構卷積引擎,而它可以支援4位元乘4位元、4位元乘8位元、8位元乘4位元和8位元乘8位元的乘法運算。
由於輸入到引擎的資料可以是4位元或8位元,如果我們在傳輸資料時沒有先對它們進行任何處理,那麼資料流量就會有所浪費。為了使資料傳輸有效率,我們提出了兩種方法來包裝4位元的資料,一種是逐通道包裝(CWP),另一種則是逐空間包裝(SWP)。經過詳細的分析與評估後,我們發現對於一個通道的6x4位元組激活輸入塊來說,SWP所需要用在處理元件的面積會比CWP還要小9%,這邊的位元組輸入塊是指輸入塊的每個單元格都具有8位元。最後,我們實作了兩個高度平行的卷積引擎,一個是由Huang等人所提出的eCNN修改而成的固定位寬卷積引擎,另一個則是使用SWP方法實作出來的可重構卷積引擎,它們有著相同的運算能力。我們的可重構卷積引擎可以在品質與吞吐量之間作取捨。在可接受的風格轉換品質下,它可以達到99.5Mpixel/s的吞吐量,而如果需要高品質的風格轉換的話,吞吐量則會下降到24.9Mpixel/s。同時它也可以在24.9∼99.5Mpixel/s的多種吞吐量下支援不同品質的風格轉換。與固定位寬卷積引擎相比,它在帶來這項彈性的同時,也導致了78% 的額外面積開銷。
In recent years, Convolutional Neural Networks (CNNs) have achieved great success in classification applications and computational imaging applications. Owing to the rising popularity of resource constrained edge devices, the researches concerned with reducing the model size and computational cost of CNNs have
become prevalent. Quantization, which means to reduce the parameters and activations of a CNN from 32-bit floating point into lower bitwidth, is one of the principal approaches. The model of classification tasks can usually be quantized to less than 8-bit with little loss of accuracy, however many computational imaging applications like denoising or super-resolution can just get large degradation in quality in such a situation. In this field, Judd et al. quantized the data with different bitwidth for each layer to enable energy and performance improvements without loss of accuracy for classification tasks. Additionally, their team also implemented several hardware accelerators using bit-serial way to deal with those variable-bitwidth data. Nevertheless, bit-serial computation is a kind of fine-grained
technique and could induce larger hardware overhead than the technique which has coarser granularity of bit-level computation. Moreover, when the degree of parallelism increases in hardware, the overhead will grow up accordingly. Based on these observations, we aim to design a reconfigurable convolution engine with coarse-grained technique in this thesis.
Firstly, we focus on a special application– style transfer. It is one of the computational imaging applications which has higher tolerance to quantization. The model of style transfer can be quantized to 4-bit, while the quality of the transferred images is still acceptable. Besides, 8-bit quantization is quite sufficient
to have high quality for style transfer. Therefore, the scope of the precision we discussed is from 4-bit to 8-bit. Knowing that the quality of the model is positively correlated to its computational complexity, we can get the model with various quality through quantizing the model with different precision configurations. Under the consideration of coarse-grained design, we decide to quantize the data to 4-bit or 8-bit only. For instance, we can quantize the parameters of the first layer to 4-bit or 8-bit, and also the activations to these two kinds of
bitwidths. Hence, there are four kinds of bitwidth configurations for one layer’s data, so when it comes to a model which is composed of many layers, there are actually various precision configurations that result in corresponding computational
complexity. Experiments show that quantizing the model with only these two kinds of bitwidths is able to obtain various quality corresponding to computational complexity. As the computational complexity is inversely proportional to the throughput in rough, we can tradeoff between quality and throughput with this coarse-grained technique. This experimental result suggests us to design a reconfigurable convolution engine with coarse-grained bit-level flexibility, which can support the multiplications of 4bx4b, 4bx8b, 8bx4b, and 8bx8b.
As the data input to the engine can be 4-bit or 8-bit, it is wasting if we just transmit data without any processing. To make data transmission efficient, we propose two methods to pack 4-bit data, one is channel-wise packing (CWP) and the other is space-wise packing (SWP). After detailed analysis and evaluation, we found that SWP has 9% smaller area overhead than CWP for the processing element which conducting convolution over one channel of 6x4 byte-tile-in activation, here byte-tile-in means the input tile having 8-bit for each cell. Finally, we implement two highly-parallel convolution engines, one is a fixed-bitwidth convolution
engine which is modified from eCNN, and the other is a reconfigurable convolution engine with the same computing capability using SWP method. Our reconfigurable convolution engine can tradeoff between quality and throughput. It can achieve the throughput up to 99.5Mpixel/s with acceptable quality for style
transfer, while the throughput is declined to 24.9Mpixel/s if high quality style transfer is requested. Additionally, it can also achieve various throughput ranged from 24.9Mpixel/s to 99.5Mpixel/s with different quality of style transfer. With
this flexibility during inference, it induces 78% area overhead compared to the fixed-bitwidth convolution engine.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Dynamic Fixed Point Precision . . . . . . . . . . . . . . . 7
1.2.3 Network Approximation Framework with Ristretto . . . . 9
1.2.4 Precision Flexibility in CNNs . . . . . . . . . . . . . . . . 11
1.2.5 Bit-Serial Works . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.6 Bit Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Analysis of Coarse-Grained Bit-Level Flexibility for Style Transfer
15
2.1 Style Transfer Model . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Network Architecture . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Transfer images with Our Image Transformation Network 16
2.1.3 Quantize Image Transformation Network to Low-Bitwidth 16
2.2 Coarse-Grained Technique . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Decision of Computing Unit . . . . . . . . . . . . . . . . . 18
2.2.2 Coarse-Grained Bit-Level Flexibility with 4bx4b Computing
Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Design of Processing Element for Reconfigurable Convolution Engine
31
3.1 Methodology of Data Packing . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Channel-Wise Packing (CWP) . . . . . . . . . . . . . . . 32
3.1.2 Space-Wise Packing (SWP) . . . . . . . . . . . . . . . . . 32
3.1.3 Usage of CWP and SWP . . . . . . . . . . . . . . . . . . 33
3.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Execute Convolution with Computing Units . . . . . . . . . . . . 34
3.4 Analysis of CWP and SWP . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Weight Register . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 Usage of Computing Units . . . . . . . . . . . . . . . . . . 43
3.4.3 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.4 Area Comparison . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Implementation of Highly-Parallel Reconfigurable Convolution
Engine 51
4.1 Brief Introduction to eCNN . . . . . . . . . . . . . . . . . . . . . 51
4.2 Baseline Convolution Engine . . . . . . . . . . . . . . . . . . . . . 52
4.3 Reconfigurable Convolution Engine . . . . . . . . . . . . . . . . . 54
4.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . 56
5 Conclusion and Future Work 59
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
[2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going Deeper with Convolutions,” arXiv e-prints, p. arXiv:1409.4842, Sept. 2014.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
[5] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[6] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid, “RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation,” arXiv e-prints, p. arXiv:1611.06612, Nov. 2016.
[7] Ying Zhang, Mohammad Pezeshki, Philemon Brakel, Saizheng Zhang, Cesar Laurent Yoshua Bengio, and Aaron Courville, “Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks,” arXiv eprints, p. arXiv:1701.02720, Jan. 2017.
[8] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2808–2817.
[9] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand, “Deep joint demosaicking and denoising,” ACM Trans. Graph., vol. 35, no. 6, Nov. 2016.
[10] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
[11] Kai Zhang, Wangmeng Zuo, and Lei Zhang, “FFDNet: Toward a Fast and Flexible Solution for CNN-Based Image Denoising,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608–4622, Sept. 2018.
[12] Ayan Chakrabarti, “A Neural Approach to Blind Motion Deblurring,” arXiv e-prints, p. arXiv:1603.04771, Mar. 2016.
[13] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 257–265.
[14] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, Eds., Cham, 2014, pp. 184–199, Springer International Publishing.
[15] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.
[16] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi, “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” arXiv e-prints, p. arXiv:1609.04802, Sept. 2016.
[17] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee, “Enhanced Deep Residual Networks for Single Image Super-Resolution,” arXiv e-prints, p. arXiv:1707.02921, July 2017.
[18] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi,
“Learning-based view synthesis for light field cameras,” ACM Trans. Graph., vol. 35, no. 6, Nov. 2016.
[19] Junyuan Xie, Ross Girshick, and Ali Farhadi, “Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks,” arXiv e-prints, p. arXiv:1604.03650, Apr. 2016.
[20] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2414–2423.
[21] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” arXiv e-prints, p. arXiv:1603.08155, Mar. 2016.
[22] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” arXiv e-prints, p. arXiv:1703.10593, Mar. 2017.
[23] Xun Huang and Serge Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization,” arXiv e-prints, p. arXiv:1703.06868, Mar. 2017.
[24] Qifeng Chen, Jia Xu, and Vladlen Koltun, “Fast Image Processing with Fully-Convolutional Networks,” arXiv e-prints, p. arXiv:1709.00643, Sept. 2017.
[25] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv e-prints, p. arXiv:1704.04861, Apr. 2017.
[26] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” arXiv e-prints, p. arXiv:1602.07360, Feb. 2016.
[27] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
[28] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269.
[29] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” arXiv e-prints, p. arXiv:1603.05279, Mar. 2016.
[30] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou, “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients,” arXiv e-prints, p. arXiv:1606.06160, June 2016.
[31] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, and Andreas Moshovos, “Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets,” arXiv e-prints, p. arXiv:1511.05236, Nov. 2015.
[32] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes: Bit-serial deep neural network computing,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
[33] J. Albericio, P. Judd, A. Delmás, S. Sharify, and A. Moshovos, “Bit-pragmatic Deep Neural Network Computing,” arXiv e-prints, p. arXiv:1610.06920, Oct. 2016.
[34] Sayeh Sharify, Alberto Delmas Lascorz, Kevin Siu, Patrick Judd, and Andreas Moshovos, “Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks,” arXiv e-prints, p. arXiv:1706.07853, June 2017.
[35] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018, pp. 764–775.
[36] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, “Training deep neural networks with low precision multiplications,” arXiv e-prints, p. arXiv:1412.7024, Dec. 2014.
[37] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi, “Hardware-oriented Approximation of Convolutional Neural Networks,” arXiv e-prints, p. arXiv:1604.03168, Apr. 2016.
[38] Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang, Chi-Wen Weng, Kai-Ping Lin, Li-Wei Wang, and Li-De Chen, “eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference,” arXiv e-prints, p. arXiv:1910.05680, Oct. 2019.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *