在異質多核心系統上通過OpenCL流程優化VP8解碼器__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.49) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	邱奕昕
作者(外文):	Chiu, Yi-Hsin
論文名稱(中文):	在異質多核心系統上通過OpenCL流程優化VP8解碼器
論文名稱(外文):	The Optimization of VP8 Decoder through OpenCL Flow on Heterogeneous Multi-Core Systems
指導教授(中文):	李政崑
指導教授(外文):	Lee, Jenq-Kuen
口試委員(中文):	洪明郁游逸平
口試委員(外文):	Hung, Ming-Yu You, Yi-Ping
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系所
學號:	104062611
出版年(民國):	107
畢業學年度:	106
語文別:	中文
論文頁數:	31
中文關鍵詞:	向量化、反量化、反離散餘弦變換
外文關鍵詞:	OpenCL、VP8、Mali GPUs、AMD APU、vectorization、inverse quantization、inverse transform
相關次數:	推薦:0 點閱:187 評分: 下載:0 收藏:0

隨著網路通訊的蓬勃發展，影像資訊的傳遞變得更加快速，影像透過壓縮傳遞的過程，除了傳遞的速度，在發送端的壓縮以及接收端的解碼的工作大多採用MPEG等高階影片編碼。自Google發表了開放原始碼授權的VP8以來，影像編碼及壓縮的技術，又多了一項新的選擇。VP8以免費授權且開放原始碼的優利條件，吸引許多人的眼球。由於VP8為可攜式的程式，性能可以依據需要再針對不同的架構模型進行優化。

伴隨著圖形處理器日益強健的今日，應用程式使用OpenCL來提升效能的例子已經屢見不鮮，透過大量的執行緒以及向量的平行運算，來增進平行度，使得工作效能達到提升，並且增加硬體的總利用率。而GPU運算應用的效能提升成為研究活動的重點之一。

本篇論文中，我們測試使用OpenCL改變VP8解碼器的流程，讓其中兩部分的例行程序分別在Mali GPU以及AMD APU上進行速度效能的測試。在這個案例研究中，我們為整個流程設計了幾項優化的技術。技術包含有向量化，複製多重區塊的資料進去GPU等實驗性測試，另外，我們針對遇到的解碼預測問題提出補充的演算法測試。實驗結果顯示，我們的方案比起目前網上的VP8 OpenCL版本優化了反量化以及反離散餘弦變換兩個流程區塊。平均跟循序版本的VP8我們有1.7倍的加速，跟網上的VP8 OpenCL版本有332.043倍的加速。

With the rapid development of internet communications, the delivery of video information has become faster. In addition to the speed of transmission, compression at the sending end and decoding at the receiving end mostly uses MPEG and other high-end video encoding. Since Google released VP8, a BSD license, there has been an additional new option of video encoding and compression. VP8 with the advantages of open source code and free license, attract many people's attention. Regardless of the digital platform used in television or mobile communications, it can be encoded or decoded by software through the way of transplanting. Although VP8 is a portable program, performance can still be tuned for different architectural models as needed.
Today, with the increasingly powerful graphic processors, the use of a large number of threads to increase productivity and improve work efficiency is commonly seen. Applications using OpenCL (Open Computing Language) to enhance the performance is desired. Through a large number of threads and vector parallel computing, work efficiency can be improved. It increases the overall utilization of hardware. And the performance of image processing applications thus becomes one of the focused point of research activities.
In this thesis, we use OpenCL to devise the VP8 decoder's process and use two of the routines, inverse quantization and inverse transform, to test speed performance on the Mali GPU and AMD APU, respectively. In this case study, we have designed several optimized techniques for both routines. The techniques include vectorization, copying more data size for GPU testing, and thread parallelism. In addition, we propose a supplementary algorithm test for the decoding prediction problems encountered. The experimental results show that our scheme optimizes both inverse quantization and inverse transform process routines compared to the current OpenCL version of VP8 from website. We get averaging 1.7 times faster than the sequential version of VP8, and 332.043 time faster than the existing OpenCL version VP8 decoder.

Abstract i
Contents iii
List of Figures v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 VP8 codec flow . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Inverse Transform . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Inverse Quantization . . . . . . . . . . . . . . . . . . . 7
2.1.3 Intra Prediction . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Motion Compensation . . . . . . . . . . . . . . . . . . 8
2.1.5 Loop Filter . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.6 Color Space Conversion . . . . . . . . . . . . . . . . . 9
2.2 VP8 existing OpenCL version . . . . . . . . . . . . . . . . . . 9
2.3 OpenCL benefit . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Optimzation Methods 12
3.1 Copy data with coarse granularity . . . . . . . . . . . . . . . . 12
3.2 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Thread algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Problem 16
4.1 Algorithm software flow . . . . . . . . . . . . . . . . . . . . . 16
4.2 Submit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Advanced Algorithm Designs 19
5.1 Slanted algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Experimental results 23
6.1 Experimental environment . . . . . . . . . . . . . . . . . . . . 23
6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.1 Odroid-XU4 . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.2 AMD HSA . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Conclusion 27
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

[1] (2017) The OpenCL website. [Online]. Available: https://www.khronos.org/opencl/
[2] (2007) Khronos opencl spec 1.2. [Online]. Available: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/
[3] J. Bankoski, P. Wilkins, and Y. Xu, “Technical overview of vp8, an open source video codec for the web,” Multimedia and Expo (ICME), 2011 IEEE International Conference on, 2011.
[4] J. Bankoski, J. Koleszar, L. Quillio, J. Salonen, P. Wilkins, and Y. Xu, “Vp8 data format and decoding guide,” https://tools.ietf.org/pdf/rfc6386.pdf, 2011.
[5] S. Katsigiannis, V. Dimitsas, and D. Maroulis, “A gpu vs cpu performance evaluation of an experimental video compression algorithm,” Quality of Multimedia Experience (QoMEX), 2015 Seventh International Workshop on, 2015.
[6] (2017) Webm wiki. [Online]. Available: http://wiki.webmproject.org/vp8-implementations
[7] A. Watry, “Initial opencl implementation,” https://chromium.
googlesource.com/webm/libvpx/sandbox/awatry/initial_opencl_
implementation, 2010.
[8] ——, “libvpx opencl,” https://github.com/awatry/libvpx.opencl, 2011.
[9] (2017) Odroid xu4. [Online]. Available: http://www.hardkernel.com/main/products/prdt_info.php
[10] (2012) Mali opencl sdk. [Online]. Available: https://developer.arm.com/technologies/compute-library
[11] P. Paglierani, G. Grossi, F. Pedersini, and A. Petrini, “Gpu-based vp8 encoding: Performance in native and virtualized environments,” Telecommunications and Multimedia (TEMU), 2016 International Conference on, 2016.
[12] P. Comi, P. S. Crosta, M. Beccari, P. Paglierani, G. Grossi, F. Pedersini, and A. Petrini, “Hardware-accelerated high-resolution video coding in virtual network functions,” Networks and Communications (EuCNC), 2016 European Conference on, 2016.
[13] W.-N. Chen and H.-M. Hang, “H.264/avc motion estimation implmentation on compute unified device architecture (cuda),” IEEE International Conference on Multimedia and Expo, 2008.
[14] G. Wang, Y. Lin, and W. Yi, “Kernel fusion: An effective method for better power efficiency on multithreaded gpu,” IEEE/ACM CoCPSCom, 2010.
[15] M. Wahib and N. Maruyama, “Automated gpu kernel transformations in large-scale production stencil applications,” HPDC ’15. ACM., 2015.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文