Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation (pdf)

Article PDF cannot be displayed. You can download it here:

http://downloads.hindawi.com/journals/tswj/2014/716020.pdf

Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Hindawi Publishing Corporation e Scientiﬁc World Journal Volume 2014, Article ID 716020, 19 pages http://dx.doi.org/10.1155/2014/716020 Research Article Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation Huayou Su, Mei Wen, Nan Wu, Ju Ren, and Chunyuan Zhang School of Computer Science and Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, Hunan 410073, China Correspondence should be addressed to Huayou Su; Received 27 November 2013; Accepted 16 January 2014; Published 16 March 2014 Academic Editors: J. Shu and F. Yu Copyright © 2014 Huayou Su et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA’s GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264 encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20 times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to 0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts (CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design. 1. Introduction Video encoding plays an increasingly larger role in the multimedia processing community, which aims to reduce the size of the video sequence by exploiting spatial and temporal redundancy, as well as keeping the quality as good as possible. H.264/AVC [1] is currently the widely used video coding standard, which constitutes the basis of the emerging High Efficiency Video Coding (HEVC) [2]. It achieves about 39% and 49% bit-rate saving over that of MPEG-4 and H.263, respectively [3, 4]. The high compression efficiency is mainly attributed to several introduced new features, including variable block-size motion compensation, multiple reference frames, quarter pixel motion estimation, integer transform, in-the-loop deblocking filtering, and advanced entropy coding [5–8]. These new features imply that more computational power is needed for H.264 encoder [9]. It is almost impossible to achieve real-time High-Definition (HD) H.264 encoding in serial programming technologies, which restricts its usage in many areas [10–13]. In order to satisfy the requirement of real-time encoding, many research works focused on hardware-based encoders design [14–17]. Though high efficiency can be gained, dedicated ASIC designs are inflexible, time consuming, and expensive. Due to the high peak performance, high-speed bandwidth, and efficient programming environments, such as NVIDIA’s CUDA [18] and OpenCL [19], GPU has been at the leading edge of high performance computing era. Recently, many researchers are attracted to the topic of parallelizing video processing with multicore or many-core architecture, especially on the GPU-based systems [8–12, 20–27]. However, most of the research has mainly focused on accelerating the computational components, such as the motion estimation (ME) [12, 21, 22], motion compensation [10], and intraprediction [23]. For the irregular algorithms, such as deblocking filter and Context-based adaptive variable-length code (CAVLC), research about these aspects is seldom [24]. To the best of our knowledge, there is no research about GPU-based CAVLC, except our work [28]. There are several disadvantages by only accelerating some parts of video encoder. On the one hand, for each frame, the data size 2 transferred between CPU and GPU will be very huge. For example, when offloading the ME and transform coding to GPU only, the data size of the input frame, the quantized coefficients, and the auxiliary information are more than 30 MB for 1080 p video format. On the other hand, after parallelizing the compute intensive parts of the encoder, the control intensive algorithms occupy a larger fraction of execution time [29]. Though NVIDIA provides a GPU-based encoder library, the detailed information is insufficient, let alone open source. In this paper, we focused on developing a GPU-based parallel framework for H.264/AVC encoder and the efficient parallel implementation. The main contributions of this paper are as follows. After carefully reviewing and profiling the program, we proposed a fully parallel framework for H.264 encoder based on GPU. We introduced the loop partition technology to divide the whole pipeline into four steps (ME, intracoding, CAVLC, and deblocking filter) in terms of frame. All the four components are offloaded to GPU hardware in our framework. The CPU is only responsible for some simple transactions, such as I/O process. In order to improve the memory bandwidth efficiency, array of structure (AOS) to structure of array (SOA) transformation is performed. The transformed small and regular structures are more suitable for taking the advantage of coalesced accessing mechanism. In addition, the proposed framework exploits the producerconsumer locality between different parts of the encoder, which avoids unnecessary data copy between CPU and GPU. For the compute intensive component motion estimation, a scalable parallel algorithm has been proposed targeting massively parallel architecture, named multiresolutions multiwindows (MRMW) motion estimation. It calculates the optimal motion vector (MV) for each macroblock (MB) through several steps. Firstly, the original input frame and reference frame are concentrated into small resolution ones. Accordingly, there is a concentrated MB in the dedicated frame corresponding to the normal MB in the original frame. Secondly, based on the concentrated lower resolution frames, a full search in an assigned window space is performed for each concentrated MB and it produced a primary MV. Finally, a refinement search for the MBs of the original frame will be performed; the search window is centered with the produced MV in the second step. In order to overcome the limitations from the irregular componen (...truncated)