EP-4740468-A1 - ENCODER FOR ENCODING A MEDIA SIGNAL

EP4740468A1EP 4740468 A1EP4740468 A1EP 4740468A1EP-4740468-A1

Abstract

An encoder and method for encoding a media signal is presented. The encoder is configured to perform a first-pass encoding of the media signal with restricting the first-pass encoding onto a set of frames out of a sequence of immediately consecutive frames of the media signal and using a probe quantization parameter, QP, for each of the set of frames of the media signal, so as to obtain a first-pass bitrate for each of the set of frames, determine, based on the first-pass bitrate for each of the set of frames, a start QP for each of the sequence of immediately consecutive frames; and perform a second-pass encoding of the media signal with using the start QP for each of the sequence of immediately consecutive frames, wherein the sequence of immediately consecutive frames is a group of pictures having a hierarchical referencing structure with temporal layers including a temporal base layer 0 up to a highest temporal layer N-1 and the encoder is configured to select the set of frames out of the sequence of immediately consecutive frames so that the set of frames includes all frames of temporal layers 0 to k-1, and, for each temporal layer k to N-1, only a proper subset of the frames of the sequence of immediately consecutive frames which belong to the respective temporal layer, wherein 1 < k < N, and/or the encoder is configured to perform scene detection to detect scene changes, and select the set of frames out of the sequence of immediately consecutive frames depending on whether any of the scene changes falls into the sequence of immediately consecutive frames so that the set of frames represents a sequentially subsampled subset of a sequence of immediately consecutive frames of the media signal in case of none of the scene changes falling into the sequence of immediately consecutive frames, and/or the encoder is configured to perform scene detection to detect scene changes separating scenes, and determine, based on the first-pass bitrate for each of the set of frames, the start QP for each of the sequence of immediately consecutive frames in a manner depending on the scene changes so that, for each of the sequence of immediately consecutive frames, the start QP is exclusively determined based on the first-pass bitrate of one or more frames within the set of frames, which fall into a scene into which the respective frame falls.

Inventors

HENKEL, ANASTASIA
Helmrich, Christian
HINZ, TOBIAS
Brandenburg, Jens
BARTNIK, CHRISTIAN
WIECKOWSKI, Adam
BROSS, BENJAMIN
MARPE, DETLEV
WIEGAND, THOMAS

Assignees

Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.

Dates

Publication Date: 20260513
Application Date: 20240702

Claims (20)

1 . Encoder (14) for encoding a media signal (11 ), configured to perform a first-pass encoding (50) of the media signal (11 ) with restricting the first- pass encoding onto a set (52) of frames (10) out of a sequence (54) of immediately consecutive frames (10) of the media signal (11 ) and using a probe QP (56) for each of the set (52) of frames (10) of the media signal (11), so as to obtain a first-pass bitrate for each of the set (52) of frames (10), determine, based on the first-pass bitrate for each of the set (52) of frames (10), a start QP (58) for each of the sequence (54) of immediately consecutive frames (10); and perform a second-pass encoding (60) of the media signal (11) with using the start QP (58) for each of the sequence (54) of immediately consecutive frames (10), wherein the sequence (54) of immediately consecutive frames (10) is a group of pictures having a hierarchical referencing structure with temporal layers (62) including a temporal base layer 0 up to a highest temporal layer N-1 and the encoder (14) is configured to select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) so that the set (52) of frames (10) includes all frames (10) of temporal layers 0 to k-1 , and, for each temporal layer k to N-1 , only a proper subset of the frames (10) of the sequence (54) of immediately consecutive frames (10) which belong to the respective temporal layer (62), wherein 1 < k < N, and/or the encoder (14) is configured to perform scene detection to detect scene changes (64), and select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) depending on whether any of the scene changes (64) falls into the sequence (54) of immediately consecutive frames (10) so that the set (52) of frames (10) represents a sequentially sub-sampled subset of a sequence (54) of immediately consecutive frames (10) of the media signal (11) in case of none of the scene changes (64) falling into the sequence (54) of immediately consecutive frames (10), and/or the encoder (14) is configured to perform scene detection to detect scene changes (64) separating scenes (66a, 66b), and determine, based on the first-pass bitrate for each of the set (52) of frames (10), the start QP (58) for each of the sequence (54) of immediately consecutive frames (10) in a manner depending on the scene changes (64) so that, for each of the sequence (54) of immediately consecutive frames (10), the start QP (58) is exclusively determined based on the first-pass bitrate of one or more frames (10) within the set (52) of frames (10), which fall into a scene (66a, 66b) into which the respective frame (10) falls.
2. Encoder (14) of claim 1 , wherein the first-pass encoding (50) involves an encodersearch space which is reduced compared to the second-pass encoding (60).
3. Encoder (14) of claim 1 or 2, wherein the first-pass encoding (50) operates using rate-distortion optimization at variable rate and the second-pass encoding (60) operates using rate-distortion optimization in a rate-controlled manner.
4. Encoder (14) of any previous claim, wherein the encoder (14) is configured to perform the first-pass encoding (50) onto consecutive sequences of immediately consecutive frames (10) of the media signal (1 1 ) before performing the second-pass encoding (60) onto each of the consecutive sequences of immediately consecutive frames (10) or perform the first-pass encoding (50) and the second-pass encoding (60) onto consecutive sequences of immediately consecutive frames (10) of the media signal (1 1 ) in an interleaved manner.
5. Encoder (14) of any previous claim, configured to determine, based on the first- pass bitrate for each of the set (52) of frames (10), the start QP (58) for each of the sequence (54) of immediately consecutive frames (10) by determining, for each frame of the sequence (54) of immediately consecutive frames (10), which is not comprised by the set (52) of frames (10), a first-pass bitrate based on the first-pass bitrate for each of the set (52) of frames (10), and determining, for each frame of the sequence (54) of immediately consecutive frames (10), the start QP (58) based on the first-pass bitrate of the respective frame.
6. Encoder (14) of claim 5, configured to determine, for each frame of the sequence (54) of immediately consecutive frames (10), which is not comprised by the set (52) of frames (10), the first-pass bitrate based on the first-pass bitrate for each of the set (52) of frames (10) by selecting one or more frames (10) out of the set (52) of frames (10) having a temporal layer (62) associated therewith which equals the temporal layer (62) of the respective frame.
7. Encoder (14) of any previous claim, wherein N = 6 or 5 and k = 3.
8. Encoder (14) of any previous claim, wherein N = 6 and k = 3 and a size of the GOP is 32 or N = 5 and k = 3 and a size of the GOP is 16.
9. Encoder (14) of any previous claim, wherein the sequence (54) of immediately consecutive frames (10) is a group of pictures, GOP, having a hierarchical referencing structure with temporal layers (62) including a temporal base layer 0 up to a highest temporal layer N-1 and the encoder (14) is configured to select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) so that the set (52) of frames (10) includes all frames (10) of temporal layers 0 to k-1 , and, for each temporal layer k to N-1 , only a proper subset of the frames (10) of the sequence (54) of immediately consecutive frames (10) which belong to the respective temporal layer (62), and the encoder (14) is configured to perform scene detection to detect scene changes (64), and select k and/or select the proper subset for each of the temporal layers k to N-1 depending on whether any of the scene changes (64) falls into the GOP.
10. Encoder (14) of claim 9, wherein the encoder (14) is configured to select k = N-1 if any of the scene changes (64) falls into the GOP, and set k so that 1 < k < N if none of the scene changes (64) falls into the GOP.
11 . Encoder (14) of any previous claim, wherein the sequence (54) of immediately consecutive frames (10) is a group of pictures (GOP) having a hierarchical referencing structure with temporal layers (62) including a temporal base layer 0 up to a highest temporal layer N-1 and the encoder (14) is configured to select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) so that the set (52) of frames (10) includes all frames (10) of temporal layers 0 to k-1 , and, for each temporal layer k to N-1 , only a proper subset of the frames (10) of the sequence (54) of immediately consecutive frames (10) which belong to the respective temporal layer (62), and the encoder (14) is configured to perform scene detection to detect scene changes (64), and select k and/or select the proper subset for each of the temporal layers k to N-1 depending on which frame within the GOP any of the scene changes (64) coincides with.
12. Encoder (14) of claim 11 , wherein the encoder (14) is configured to select k and/or select the proper subset for each of the temporal layers k to N-1 so that the set (52) of frames (10) comprises, for each of the temporal layers (62), at least one frame of the respective temporal layer (62) which is within a scene preceding the scene change (64) in the GOP, and at least one frame of the respective temporal layer (62) which is within a scene (66a, 66b) extending from the scene change (64) onwards in the GOP.
13. Encoder (14) of any previous claim 11 or 12, wherein the encoder (14) is configured to select k and/or select the proper subset for each of the temporal layers k to N-1 so that the set (52) of frames (10) comprises all of one or more frames (10) affected by the scene change (64) and all frames (10) referenced, by way of inter-frame prediction, by the one or more frames (10) affected by the scene change (64).
14. Encoder (14) of any previous claims 11 to 13, wherein the encoder (14) is configured to determine one or more coding complexity measures for each of the frames (10) of the sequence (54) of immediately consecutive frames (10), determine the probe QP (56) for each of the set (52) of frames (10) based on the one or more coding complexity measures determined for the respective frame, and the encoder (14) is configured to select k and/or select the proper subset for each of the temporal layers k to N-1 so that the set (52) of frames (10) comprises all of one or more frames (10) whose one or more coding complexity measures fulfill a predetermined criterion with respect to the one or more coding complexity measures of a reference set (52) of frames (10) including one or more of, all of, or all of remaining frames (10) of the sequence (54) of immediately consecutive frames (10), or one or more, or all of frames (10) of one or more preceding sequences of immediately consecutive frames (10).
15. Encoder (14) of any previous claim, wherein the encoder (14) is configured so that the proper subset of the frames (10) of the sequence (54) of immediately consecutive frames (10) which belong to the respective temporal layer (62), exclusively, or at least, comprises the earliest frame among the frames (10) of the sequence (54) of immediately consecutive frames (10) which belong to the respective temporal layer (62), or exclusively, or at least, comprises the earliest frame and the latest frame among the frames (10) of the sequence (54) of immediately consecutive frames (10) which belong to the respective temporal layer (62).
16. Encoder (14) of any previous claim, wherein the encoder (14) is configured to perform scene detection to detect scene changes (64), and select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) depending on whether any of the scene changes (64) falls into the sequence (54) of immediately consecutive frames (10) so that the set (52) of frames (10) represents the sequentially sub-sampled subset of a sequence (54) of immediately consecutive frames (10) of the media signal (11 ) in case of none of the scene changes (64) falling into the sequence (54) of immediately consecutive frames (10), and so that the set utive frames (10) in case of any of the scene changes (64) falling into the sequence (54) of immediately consecutive frames (10).
17. Encoder of any previous claim, wherein the encoder (14) is configured to perform scene detection to detect scene changes (64), and select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) depending on which frame within the sequence (54) of immediately consecutive frames (10) any of the scene changes (64) coincides with so that the set (52) of frames (10) comprises, for each of a set of different frame types [e.g. temporal layer ID], at least one frame of the respective frame type, and the set (52) of frames (10) represents a sequentially sub-sampled subset of a sequence (54) of immediately consecutive frames (10) of the media signal (1 1 ), which comprises, for each frame type of which at least one frame exists in sequence (54) of immediately consecutive frames (10), which temporally precedes the frame of the scene change, and at least one frame exists in sequence (54) of immediately consecutive frames (10), which temporally follows, or coincides with, the frame of the scene change, a subset of one or more frames (10) of the at least one frame in the sequence (54) of immediately consecutive frames (10), which temporally precedes the frame of the scene change, and a subset of one or more frames (10) of the at least one frame in the sequence (54) of immediately consecutive frames (10), which temporally follows, or coincides with, the frame of the scene change.
18. Encoder (14) of any previous claim, wherein the encoder (14) is configured to perform scene detection to detect scene changes (64), and select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) depending on which frame within the sequence (54) of immediately consecutive frames (10) any of the scene changes (64) coincides with so that the set (52) of frames (10) comprises all of one or more frames (10) affected by the scene change (64) and all frames (10) referenced, by way of inter- frame prediction, by the one or more frames (10) affected by the scene change (64).
19. Encoder (14) of any previous claim, wherein the encoder (14) is configured to determine one or more coding complexity measures for each of the frames (10) of the sequence (54) of immediately consecutive frames (10), determine the probe QP (56) for each of the set (52) of frames (10) based on the one or more coding complexity measures determined for the respective frame, and perform scene detection to detect scene changes (64), and select the set (52) of frames (10) out of the sequence (54) of immediately consecutive frames (10) depending on which frame within the GOP any of the scene changes (64) coincides with so that the set (52) of frames (10) comprises all of one or more frames (10) whose one or more coding complexity measures fulfill a predetermined criterion with respect to the one or more coding complexity measures of a reference set (52) of frames (10) including one or more of, all of, or all of remaining frames (10) of the sequence (54) of immediately consecutive frames (10), or one or more, or all of frames (10) of one or more preceding sequences of immediately consecutive frames (10).
20. Encoder (14) of any previous claim, wherein the media signal (11) is a video.

Description

Encoder for encoding a media signal Description Embodiment according to the invention relate to an encoder for encoding a media signal, e.g., by performing a first-pass restricted to a set of frames and/or a spatially sub-sampled version of the media signal. Embodiments according to the invention relate to a Two-Pass Video Encoding Concept, e.g., realizing a fast first pass in two-pass video encoding, e.g., using sub-sampling. Introduction and problem statement: Rate control (RC) methods are mandatory in real-world encoding applications. Instead of a fixed quantization parameter (QP) encoding, where the final bitrate is unpredicted, RC enables targeting a specific rate during encoding. The VVC software encoder VVenC for example supports so called “one-pass” and “two-pass” rate control modes. In the following the rate control of VVenC will be used as an exemplary embodiment of the present invention. The RC solutions in VVenC consist of two stages. The first stage is an analysis stage in which coding statistics, specifically the rate of encoding the frame with a fixed QP, are collected for each frame, over a sliding window for one-pass rate control, or over the entire video input for two-pass rate control. This first analysis stage, in the remainder also designated as the first pass, is followed by the encoding stage also called the second pass. The first pass is faster than the second pass, because of modifications in the encoder configuration, which results in a reduced encoder search space for the first pass. For instance, the interval of block sizes might be restricted in the first pass compared to the second pass: Additionally or alternatively, a reduced set of coding tools might be tested in terms of ratedistortion optimization in the first pass compared to the second pass. For instance, certain tools might be excluded from being used in the first pass such as dependent quantization. In the second pass, the video is encoded again with the unmodified encoder configuration using the coding statistics of the first pass. The second pass might be rate controlled while pass 1 is not. Both passes may inherit a block-wise modification of the underlying frame QP, i.e. the probe QP in case of pass 1 and the start QP in case of pass 2, such as depending on certain coding complexity measures such as an activity measure or the like. However, the modification operates relative to the underlying frame QP. Optionally, both pass 1 and pass 2, may allow for a micro modification of the block-wise modified frame QP in rate- distortion sense, but the range of modification might, for instance, be lower than the range of modification realized depending on the coding complexity measure. In addition, the encoder may perform an input picture pre-filter stage, e.g. motion compensated temporal filtering (MCTF), before the first pass, to improve coding efficiency. The VVenC RC method uses video dimensions (width and height) and the target bitrate to determine the overall QP for the first pass. When reencoding the input in the second pass, the approximation of the target bitrate occurs with the framewise adjustment of the QP, based on the statistics from the first pass and the actual bits used to encode the previous frames from the current pass. Although using a fast configuration, the additional encoding in the first analysis pass requires additional time which adds to the overall processing time of the video encoding. To enable the RC scheme to operate in on-the-fly application with lower latency, the first pass can be applied not on the entire input, but for a Group-Of-Pictures (GOP)[1]. Here, a GOP is defined as group of consecutive pictures with a fixed picture referencing structure and, if hierarchical referencing is used, various temporal layer (TL) (as exemplarily illustrated in fig. 4). The default in VVenC is a GOP with 32 pictures and a hierarchical referencing structure with 6 TLs. Such an approach in VVenC is called “one-pass” encoding because the first GOP-based look-ahead pass, executing the analysis stage, is never exposed to the user, even though it still executes two full passes, but interleaved. The coding statistics can be collected using a short look-ahead window, whose length is usually set equal to the GOP size, in the one-pass RC application, with the results being directly applied for the final encoding. Hence in the following, this type of RC will be called "look-ahead based RC" or "GOP-wise RC". In other words, according to the look-ahead based RC, the two first- and second pass encodings are performed in an interleaved manner such as in a manner so that the application of the second-pass encoding onto a current sequence of immediately consecutive frames of the media signal has begun before the application of the first-pass encoding onto a next one of the consecutive sequences of immediately consecutive frames. Nevertheless, two-pass or look-ahead based RC, both applications occur with encoding time incr