An efficient versatile video coding motion estimation hardware
Journal of Real-Time Image Processing (2024) 21:25
https://doi.org/10.1007/s11554-023-01402-8
RESEARCH
An efficient versatile video coding motion estimation hardware
Waqar Ahmad1,2 · Hossein Mahdavi1 · Ilker Hamzaoglu1,3
Received: 11 November 2023 / Accepted: 14 December 2023 / Published online: 29 January 2024
© The Author(s) 2024
Abstract
Versatile Video Coding (VVC) is the latest video coding standard. It provides higher compression efficiency than the previous video coding standards at the cost of significant increase in computational complexity. Motion estimation (ME) is the
most time consuming and memory intensive module in VVC encoder. Therefore, in this paper, we propose an efficient VVC
ME hardware. It is the first VVC ME hardware in the literature. It has real time performance with small hardware area. This
efficiency is achieved by using a 64 × 64 systolic processing element array to support maximum coding tree unit (CTU)
size of 128 × 128 and by using a novel memory-based sum of absolute differences (SAD) adder tree to calculate SADs of
128 × 128 CTUs. The proposed VVC ME hardware reduces memory accesses significantly by using an efficient data reuse
method. It can process up to 30 4 K (3840 × 2160) video frames per second.
Keywords Video compression · VVC · Motion estimation · Hardware
1 Introduction
As the amount of video data is increasing significantly, more
efficient video compression is needed to transmit and store
this video data with limited available bandwidth and storage space [1]. Therefore, Joint Video Experts Team (JVET)
of ITU-T and ISO standardization organizations developed
Versatile Video Coding (VVC) standard in 2020 [2]. VVC
provides 50% higher compression efficiency than its predecessor High Efficiency Video Coding (HEVC) standard
developed in 2013 [3, 4]. VVC is designed to encode diverse
video content such as high dynamic range, 360º video and
virtual reality [5].
VVC uses several new encoding tools to achieve better
compression than HEVC such as new block partitioning
structure called quadtree plus multi-type tree (QTMT), affine
* Ilker Hamzaoglu
Waqar Ahmad
Hossein Mahdavi
1
Sabanci University, Tuzla, 34956 Istanbul, Turkey
2
Ghulam Ishaq Khan Institute of Engineering Sciences
and Technology, Topi 23460, Pakistan
3
Ozyegin University, Çekmeköy, 34794 Istanbul, Turkey
motion estimation and multiple transforms [6]. VVC divides
a video frame into blocks called coding tree units (CTUs)
and encodes each CTU separately. Each CTU can be further
divided into coding units (CUs) using QTMT. QTMT allows
more partitions than simple quadtree (QT) partitioning used
in HEVC. The maximum CTU size in VVC is 128 × 128.
The maximum CTU size in HEVC is 64 × 64.
VVC achieves higher compression efficiency than HEVC
at the cost of significant increase in computational complexity. VVC encoder is 5 times and 31 times more complex than
HEVC encoder under Low-Delay and All-Intra configurations, respectively [7]. The encoding time of VVC reference
software encoder (VTM) is about 10 times more than the
encoding time of HEVC reference software encoder (HM)
[8]. Therefore, dedicated hardware implementations are
needed for processing high resolution videos in real-time [9].
Successive frames in a video sequence have temporal
redundancy. Video coding standards remove this temporal
redundancy by performing motion estimation (ME). ME is
the most time consuming and memory intensive module in
video encoding [10]. More than 50% of the encoding time of
VVC encoder is spent for ME [7]. Up to 60% of the memory
accesses of VVC encoder comes from ME module [11].
There are several HEVC ME hardware in the literature
[12–20]. Several sum of absolute differences (SAD) hardware that can be used for ME are proposed in the literature
[21, 22]. There are several VVC intra prediction, fractional
Vol.:(0123456789)
25 Page 2 of 12
interpolation and transform hardware in the literature
[23–26]. However, to the best of our knowledge, there is no
VVC ME hardware in the literature.
VVC ME has higher computational complexity than
HEVC ME because of using a larger maximum CTU size
and a more complex block partitioning structure called
QTMT. Two types of SAD adder trees are proposed in
the literature for HEVC ME hardware. Fully parallel SAD
adder tree processes all the pixels in the largest CU in parallel. Sequential SAD adder tree divides the largest CU
into smaller blocks and processes each block in successive
clock cycles. Using a fully parallel SAD adder tree in VVC
ME hardware results in very large hardware area. Using a
sequential SAD adder tree in VVC ME hardware results in
low data reuse and low throughput. Therefore, the methods
proposed in the literature for designing HEVC ME hardware
are inefficient for designing VVC ME hardware.
In this paper, we propose the first VVC ME hardware in
the literature. The proposed hardware uses the Full Search
ME algorithm with the SAD metric to find the best motion
vector for a wide range of QTMT partition sizes, ranging
from 8 × 4 (4 × 8) to 128 × 128, within a CTU. The proposed
hardware calculates SADs of 128 × 128 CTU using a 64 × 64
systolic processing element array and a novel memory-based
SAD adder tree to achieve real-time performance with small
hardware area. It reduces memory accesses significantly by
using an efficient data reuse method.
The proposed novel memory-based SAD adder tree combines the features of fully parallel and sequential SAD adder
trees. It is highly efficient as it achieves high data reuse and
high throughput, and it uses smaller hardware area than a
fully parallel 128 × 128 SAD adder tree.
The proposed VVC ME hardware is implemented using
Verilog HDL. It works at 253 MHz on a Xilinx Virtex 7
FPGA, and it can process up to 30 4K (3840 × 2160) video
frames per second (fps).
The rest of the paper is organized as follows. In Sect. 2,
VVC ME is explained. Section 3 describes the proposed
VVC ME hardware. Its implementation results and comparison with HEVC ME hardware in the literature are given in
Sect. 4. Finally, Section 5 concludes the paper.
2 VVC motion estimation
VVC uses block matching for translational ME. In block
matching, current video frame is divided into blocks. As
shown in Fig. 1, for each block in the current frame, the best
matching block in a search window (SW) in the reference
frame and the corresponding motion vector (MV) are determined. SAD metric is typically used to determine the best
matching block. SAD between blocks A and B is calculated
as shown in Eq. (1), where W × H is the block size, A(i, j)
Journal of Real-Time Image Processing (2024) 21:25
Search Window
Current Block
Best Match
Reference Frame
Current Frame
Fig. 1 Block matching motion estimation
and B(i, j) are pixels in ith row and jth column of A and B,
respectively.
W−1 H−1
SAD =
∑∑
i=0 j=0
|A(i, j) − B(i, j)|
(1)
Video coding standards perform variable block size ME.
Large block sizes achieve higher compres (...truncated)