An efficient versatile video coding motion estimation hardware (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11554-023-01402-8.pdf

An efficient versatile video coding motion estimation hardware

Journal of Real-Time Image Processing (2024) 21:25 https://doi.org/10.1007/s11554-023-01402-8 RESEARCH An efficient versatile video coding motion estimation hardware Waqar Ahmad1,2 · Hossein Mahdavi1 · Ilker Hamzaoglu1,3 Received: 11 November 2023 / Accepted: 14 December 2023 / Published online: 29 January 2024 © The Author(s) 2024 Abstract Versatile Video Coding (VVC) is the latest video coding standard. It provides higher compression efficiency than the previous video coding standards at the cost of significant increase in computational complexity. Motion estimation (ME) is the most time consuming and memory intensive module in VVC encoder. Therefore, in this paper, we propose an efficient VVC ME hardware. It is the first VVC ME hardware in the literature. It has real time performance with small hardware area. This efficiency is achieved by using a 64 × 64 systolic processing element array to support maximum coding tree unit (CTU) size of 128 × 128 and by using a novel memory-based sum of absolute differences (SAD) adder tree to calculate SADs of 128 × 128 CTUs. The proposed VVC ME hardware reduces memory accesses significantly by using an efficient data reuse method. It can process up to 30 4 K (3840 × 2160) video frames per second. Keywords Video compression · VVC · Motion estimation · Hardware 1 Introduction As the amount of video data is increasing significantly, more efficient video compression is needed to transmit and store this video data with limited available bandwidth and storage space [1]. Therefore, Joint Video Experts Team (JVET) of ITU-T and ISO standardization organizations developed Versatile Video Coding (VVC) standard in 2020 [2]. VVC provides 50% higher compression efficiency than its predecessor High Efficiency Video Coding (HEVC) standard developed in 2013 [3, 4]. VVC is designed to encode diverse video content such as high dynamic range, 360º video and virtual reality [5]. VVC uses several new encoding tools to achieve better compression than HEVC such as new block partitioning structure called quadtree plus multi-type tree (QTMT), affine * Ilker Hamzaoglu Waqar Ahmad Hossein Mahdavi 1 Sabanci University, Tuzla, 34956 Istanbul, Turkey 2 Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi 23460, Pakistan 3 Ozyegin University, Çekmeköy, 34794 Istanbul, Turkey motion estimation and multiple transforms [6]. VVC divides a video frame into blocks called coding tree units (CTUs) and encodes each CTU separately. Each CTU can be further divided into coding units (CUs) using QTMT. QTMT allows more partitions than simple quadtree (QT) partitioning used in HEVC. The maximum CTU size in VVC is 128 × 128. The maximum CTU size in HEVC is 64 × 64. VVC achieves higher compression efficiency than HEVC at the cost of significant increase in computational complexity. VVC encoder is 5 times and 31 times more complex than HEVC encoder under Low-Delay and All-Intra configurations, respectively [7]. The encoding time of VVC reference software encoder (VTM) is about 10 times more than the encoding time of HEVC reference software encoder (HM) [8]. Therefore, dedicated hardware implementations are needed for processing high resolution videos in real-time [9]. Successive frames in a video sequence have temporal redundancy. Video coding standards remove this temporal redundancy by performing motion estimation (ME). ME is the most time consuming and memory intensive module in video encoding [10]. More than 50% of the encoding time of VVC encoder is spent for ME [7]. Up to 60% of the memory accesses of VVC encoder comes from ME module [11]. There are several HEVC ME hardware in the literature [12–20]. Several sum of absolute differences (SAD) hardware that can be used for ME are proposed in the literature [21, 22]. There are several VVC intra prediction, fractional Vol.:(0123456789) 25 Page 2 of 12 interpolation and transform hardware in the literature [23–26]. However, to the best of our knowledge, there is no VVC ME hardware in the literature. VVC ME has higher computational complexity than HEVC ME because of using a larger maximum CTU size and a more complex block partitioning structure called QTMT. Two types of SAD adder trees are proposed in the literature for HEVC ME hardware. Fully parallel SAD adder tree processes all the pixels in the largest CU in parallel. Sequential SAD adder tree divides the largest CU into smaller blocks and processes each block in successive clock cycles. Using a fully parallel SAD adder tree in VVC ME hardware results in very large hardware area. Using a sequential SAD adder tree in VVC ME hardware results in low data reuse and low throughput. Therefore, the methods proposed in the literature for designing HEVC ME hardware are inefficient for designing VVC ME hardware. In this paper, we propose the first VVC ME hardware in the literature. The proposed hardware uses the Full Search ME algorithm with the SAD metric to find the best motion vector for a wide range of QTMT partition sizes, ranging from 8 × 4 (4 × 8) to 128 × 128, within a CTU. The proposed hardware calculates SADs of 128 × 128 CTU using a 64 × 64 systolic processing element array and a novel memory-based SAD adder tree to achieve real-time performance with small hardware area. It reduces memory accesses significantly by using an efficient data reuse method. The proposed novel memory-based SAD adder tree combines the features of fully parallel and sequential SAD adder trees. It is highly efficient as it achieves high data reuse and high throughput, and it uses smaller hardware area than a fully parallel 128 × 128 SAD adder tree. The proposed VVC ME hardware is implemented using Verilog HDL. It works at 253 MHz on a Xilinx Virtex 7 FPGA, and it can process up to 30 4K (3840 × 2160) video frames per second (fps). The rest of the paper is organized as follows. In Sect. 2, VVC ME is explained. Section 3 describes the proposed VVC ME hardware. Its implementation results and comparison with HEVC ME hardware in the literature are given in Sect. 4. Finally, Section 5 concludes the paper. 2 VVC motion estimation VVC uses block matching for translational ME. In block matching, current video frame is divided into blocks. As shown in Fig. 1, for each block in the current frame, the best matching block in a search window (SW) in the reference frame and the corresponding motion vector (MV) are determined. SAD metric is typically used to determine the best matching block. SAD between blocks A and B is calculated as shown in Eq. (1), where W × H is the block size, A(i, j) Journal of Real-Time Image Processing (2024) 21:25 Search Window Current Block Best Match Reference Frame Current Frame Fig. 1 Block matching motion estimation and B(i, j) are pixels in ith row and jth column of A and B, respectively. W−1 H−1 SAD = ∑∑ i=0 j=0 |A(i, j) − B(i, j)| (1) Video coding standards perform variable block size ME. Large block sizes achieve higher compres (...truncated)