Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Hindawi Publishing Corporation
e Scientific World Journal
Volume 2014, Article ID 716020, 19 pages
http://dx.doi.org/10.1155/2014/716020
Research Article
Efficient Parallel Video Processing Techniques on GPU:
From Framework to Implementation
Huayou Su, Mei Wen, Nan Wu, Ju Ren, and Chunyuan Zhang
School of Computer Science and Science and Technology on Parallel and Distributed Processing Laboratory,
National University of Defense Technology, Changsha, Hunan 410073, China
Correspondence should be addressed to Huayou Su;
Received 27 November 2013; Accepted 16 January 2014; Published 16 March 2014
Academic Editors: J. Shu and F. Yu
Copyright © 2014 Huayou Su et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Through reorganizing the execution order and optimizing the data structure, we proposed an efficient parallel framework for
H.264/AVC encoder based on massively parallel architecture. We implemented the proposed framework by CUDA on NVIDIA’s
GPU. Not only the compute intensive components of the H.264 encoder are parallelized but also the control intensive components
are realized effectively, such as CAVLC and deblocking filter. In addition, we proposed serial optimization methods, including
the multiresolution multiwindow for motion estimation, multilevel parallel strategy to enhance the parallelism of intracoding as
much as possible, component-based parallel CAVLC, and direction-priority deblocking filter. More than 96% of workload of H.264
encoder is offloaded to GPU. Experimental results show that the parallel implementation outperforms the serial program by 20
times of speedup ratio and satisfies the requirement of the real-time HD encoding of 30 fps. The loss of PSNR is from 0.14 dB to
0.77 dB, when keeping the same bitrate. Through the analysis to the kernels, we found that speedup ratios of the compute intensive
algorithms are proportional with the computation power of the GPU. However, the performance of the control intensive parts
(CAVLC) is much related to the memory bandwidth, which gives an insight for new architecture design.
1. Introduction
Video encoding plays an increasingly larger role in the
multimedia processing community, which aims to reduce
the size of the video sequence by exploiting spatial and
temporal redundancy, as well as keeping the quality as good
as possible. H.264/AVC [1] is currently the widely used video
coding standard, which constitutes the basis of the emerging
High Efficiency Video Coding (HEVC) [2]. It achieves about
39% and 49% bit-rate saving over that of MPEG-4 and
H.263, respectively [3, 4]. The high compression efficiency
is mainly attributed to several introduced new features,
including variable block-size motion compensation, multiple
reference frames, quarter pixel motion estimation, integer
transform, in-the-loop deblocking filtering, and advanced
entropy coding [5–8]. These new features imply that more
computational power is needed for H.264 encoder [9]. It is
almost impossible to achieve real-time High-Definition (HD)
H.264 encoding in serial programming technologies, which
restricts its usage in many areas [10–13]. In order to satisfy
the requirement of real-time encoding, many research works
focused on hardware-based encoders design [14–17]. Though
high efficiency can be gained, dedicated ASIC designs are
inflexible, time consuming, and expensive.
Due to the high peak performance, high-speed bandwidth, and efficient programming environments, such as
NVIDIA’s CUDA [18] and OpenCL [19], GPU has been at the
leading edge of high performance computing era. Recently,
many researchers are attracted to the topic of parallelizing
video processing with multicore or many-core architecture,
especially on the GPU-based systems [8–12, 20–27]. However,
most of the research has mainly focused on accelerating
the computational components, such as the motion estimation (ME) [12, 21, 22], motion compensation [10], and
intraprediction [23]. For the irregular algorithms, such as
deblocking filter and Context-based adaptive variable-length
code (CAVLC), research about these aspects is seldom [24].
To the best of our knowledge, there is no research about
GPU-based CAVLC, except our work [28]. There are several
disadvantages by only accelerating some parts of video
encoder. On the one hand, for each frame, the data size
2
transferred between CPU and GPU will be very huge. For
example, when offloading the ME and transform coding to
GPU only, the data size of the input frame, the quantized
coefficients, and the auxiliary information are more than
30 MB for 1080 p video format. On the other hand, after
parallelizing the compute intensive parts of the encoder,
the control intensive algorithms occupy a larger fraction of
execution time [29]. Though NVIDIA provides a GPU-based
encoder library, the detailed information is insufficient, let
alone open source. In this paper, we focused on developing a
GPU-based parallel framework for H.264/AVC encoder and
the efficient parallel implementation. The main contributions
of this paper are as follows.
After carefully reviewing and profiling the program, we
proposed a fully parallel framework for H.264 encoder based
on GPU. We introduced the loop partition technology to
divide the whole pipeline into four steps (ME, intracoding,
CAVLC, and deblocking filter) in terms of frame. All the
four components are offloaded to GPU hardware in our
framework. The CPU is only responsible for some simple
transactions, such as I/O process. In order to improve the
memory bandwidth efficiency, array of structure (AOS) to
structure of array (SOA) transformation is performed. The
transformed small and regular structures are more suitable
for taking the advantage of coalesced accessing mechanism.
In addition, the proposed framework exploits the producerconsumer locality between different parts of the encoder,
which avoids unnecessary data copy between CPU and GPU.
For the compute intensive component motion estimation,
a scalable parallel algorithm has been proposed targeting
massively parallel architecture, named multiresolutions multiwindows (MRMW) motion estimation. It calculates the
optimal motion vector (MV) for each macroblock (MB)
through several steps. Firstly, the original input frame and
reference frame are concentrated into small resolution ones.
Accordingly, there is a concentrated MB in the dedicated
frame corresponding to the normal MB in the original frame.
Secondly, based on the concentrated lower resolution frames,
a full search in an assigned window space is performed for
each concentrated MB and it produced a primary MV. Finally,
a refinement search for the MBs of the original frame will be
performed; the search window is centered with the produced
MV in the second step.
In order to overcome the limitations from the irregular
componen (...truncated)