帳號:guest(          離開系統
字體大小: 字級放大   字級縮小   預設字形  


作者(外文):Chiu, Yi-Hsin
論文名稱(外文):The Optimization of VP8 Decoder through OpenCL Flow on Heterogeneous Multi-Core Systems
指導教授(外文):Lee, Jenq-Kuen
口試委員(外文):Hung, Ming-Yu
You, Yi-Ping
外文關鍵詞:OpenCLVP8Mali GPUsAMD APUvectorizationinverse quantizationinverse transform
  • 推薦推薦:0
  • 點閱點閱:94
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0


本篇論文中,我們測試使用OpenCL改變VP8解碼器的流程,讓其中兩部分的例行程序分別在Mali GPU以及AMD APU上進行速度效能的測試。在這個案例研究中,我們為整個流程設計了幾項優化的技術。技術包含有向量化,複製多重區塊的資料進去GPU等實驗性測試,另外,我們針對遇到的解碼預測問題提出補充的演算法測試。實驗結果顯示,我們的方案比起目前網上的VP8 OpenCL版本優化了反量化以及反離散餘弦變換兩個流程區塊。平均跟循序版本的VP8我們有1.7倍的加速,跟網上的VP8 OpenCL版本有332.043倍的加速。
With the rapid development of internet communications, the delivery of video information has become faster. In addition to the speed of transmission, compression at the sending end and decoding at the receiving end mostly uses MPEG and other high-end video encoding. Since Google released VP8, a BSD license, there has been an additional new option of video encoding and compression. VP8 with the advantages of open source code and free license, attract many people's attention. Regardless of the digital platform used in television or mobile communications, it can be encoded or decoded by software through the way of transplanting. Although VP8 is a portable program, performance can still be tuned for different architectural models as needed.
Today, with the increasingly powerful graphic processors, the use of a large number of threads to increase productivity and improve work efficiency is commonly seen. Applications using OpenCL (Open Computing Language) to enhance the performance is desired. Through a large number of threads and vector parallel computing, work efficiency can be improved. It increases the overall utilization of hardware. And the performance of image processing applications thus becomes one of the focused point of research activities.
In this thesis, we use OpenCL to devise the VP8 decoder's process and use two of the routines, inverse quantization and inverse transform, to test speed performance on the Mali GPU and AMD APU, respectively. In this case study, we have designed several optimized techniques for both routines. The techniques include vectorization, copying more data size for GPU testing, and thread parallelism. In addition, we propose a supplementary algorithm test for the decoding prediction problems encountered. The experimental results show that our scheme optimizes both inverse quantization and inverse transform process routines compared to the current OpenCL version of VP8 from website. We get averaging 1.7 times faster than the sequential version of VP8, and 332.043 time faster than the existing OpenCL version VP8 decoder.
Abstract i
Contents iii
List of Figures v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 VP8 codec flow . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Inverse Transform . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Inverse Quantization . . . . . . . . . . . . . . . . . . . 7
2.1.3 Intra Prediction . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Motion Compensation . . . . . . . . . . . . . . . . . . 8
2.1.5 Loop Filter . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.6 Color Space Conversion . . . . . . . . . . . . . . . . . 9
2.2 VP8 existing OpenCL version . . . . . . . . . . . . . . . . . . 9
2.3 OpenCL benefit . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Optimzation Methods 12
3.1 Copy data with coarse granularity . . . . . . . . . . . . . . . . 12
3.2 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Thread algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Problem 16
4.1 Algorithm software flow . . . . . . . . . . . . . . . . . . . . . 16
4.2 Submit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Advanced Algorithm Designs 19
5.1 Slanted algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Experimental results 23
6.1 Experimental environment . . . . . . . . . . . . . . . . . . . . 23
6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.1 Odroid-XU4 . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.2 AMD HSA . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Conclusion 27
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
[1] (2017) The OpenCL website. [Online]. Available: https://www.khronos.org/opencl/
[2] (2007) Khronos opencl spec 1.2. [Online]. Available: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/
[3] J. Bankoski, P. Wilkins, and Y. Xu, “Technical overview of vp8, an open source video codec for the web,” Multimedia and Expo (ICME), 2011 IEEE International Conference on, 2011.
[4] J. Bankoski, J. Koleszar, L. Quillio, J. Salonen, P. Wilkins, and Y. Xu, “Vp8 data format and decoding guide,” https://tools.ietf.org/pdf/rfc6386.pdf, 2011.
[5] S. Katsigiannis, V. Dimitsas, and D. Maroulis, “A gpu vs cpu performance evaluation of an experimental video compression algorithm,” Quality of Multimedia Experience (QoMEX), 2015 Seventh International Workshop on, 2015.
[6] (2017) Webm wiki. [Online]. Available: http://wiki.webmproject.org/vp8-implementations
[7] A. Watry, “Initial opencl implementation,” https://chromium.
implementation, 2010.
[8] ——, “libvpx opencl,” https://github.com/awatry/libvpx.opencl, 2011.
[9] (2017) Odroid xu4. [Online]. Available: http://www.hardkernel.com/main/products/prdt_info.php
[10] (2012) Mali opencl sdk. [Online]. Available: https://developer.arm.com/technologies/compute-library
[11] P. Paglierani, G. Grossi, F. Pedersini, and A. Petrini, “Gpu-based vp8 encoding: Performance in native and virtualized environments,” Telecommunications and Multimedia (TEMU), 2016 International Conference on, 2016.
[12] P. Comi, P. S. Crosta, M. Beccari, P. Paglierani, G. Grossi, F. Pedersini, and A. Petrini, “Hardware-accelerated high-resolution video coding in virtual network functions,” Networks and Communications (EuCNC), 2016 European Conference on, 2016.
[13] W.-N. Chen and H.-M. Hang, “H.264/avc motion estimation implmentation on compute unified device architecture (cuda),” IEEE International Conference on Multimedia and Expo, 2008.
[14] G. Wang, Y. Lin, and W. Yi, “Kernel fusion: An effective method for better power efficiency on multithreaded gpu,” IEEE/ACM CoCPSCom, 2010.
[15] M. Wahib and N. Maruyama, “Automated gpu kernel transformations in large-scale production stencil applications,” HPDC ’15. ACM., 2015.
第一頁 上一頁 下一頁 最後一頁 top
* *