US-12621492-B2 - Method and apparatuses for using face video generative compression SEI message

US12621492B2US 12621492 B2US12621492 B2US 12621492B2US-12621492-B2

Abstract

A method of decoding a bitstream to output one or more pictures for a video stream, includes: receiving a bitstream; and decoding, using coded information of the bitstream, one or more pictures. The decoding includes: determining, based on an identifying number, whether a face video generative compression scheme is used; in response to a determination that the face video generative compression scheme is used, decoding a supplemental enhancement information (SEI) message, the SEI message comprising facial information; and reconstructing a face picture based on the facial information and a base picture associated with the SEI message.

Inventors

Bolin Chen
Jie Chen
Shurun WANG
Yan Ye
Shiqi Wang

Assignees

ALIBABA (CHINA) CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20231221

Claims (17)

1 . A method of decoding a bitstream to output one or more pictures for a video stream, the method comprising: receiving a bitstream; and decoding, using coded information of the bitstream, one or more pictures; wherein the decoding comprises: determining, based on an identifying number, whether a face video generative compression scheme is used; in response to a determination that the face video generative compression scheme is used, decoding a supplemental enhancement information (SEI) message, the SEI message comprising facial information; and reconstructing a face picture based on the facial information and a base picture associated with the SEI message.
2 . The method according to claim 1 , wherein the SEI message further comprises a flag indicating whether a syntax element associated with facial information is present in the SEI message, and the method further comprises: in response to the syntax element associated with facial information is present, decoding the facial information based on the syntax element associated with facial information.
3 . The method according to claim 2 , wherein the SEI message further comprises a syntax element indicating a number of face frames using the face video generative compression scheme, and the method further comprises: decoding the flag indicating whether a syntax element associated with facial information is present in the SEI message for each face frame respectively.
4 . The method according to claim 2 , wherein the SEI message further comprises a syntax element indicating a number of face frames using the face video generative compression scheme, and the method further comprises: decoding the facial information based on the syntax element associated with facial information for each face frame, respectively.
5 . The method according to claim 4 , further comprising: in response to a flag indicating that a syntax element associated with facial information is not present for a current face frame, copying corresponding facial information from a previous face frame.
6 . The method according to claim 5 , wherein the SEI message further comprises a base picture as a reference, and the method further comprises: in response to a flag indicating that a syntax element associated with facial information is not present for a first face frame, copying corresponding facial information from the base picture.
7 . The method according to claim 4 , wherein the SEI message further comprises a base picture as a reference, and the method further comprises: in response to a flag indicating that a syntax element associated with facial information is not present for a current face frame, copying corresponding facial information from the base picture.
8 . The method according to claim 1 , wherein the SEI message further comprises a factor indicating a quantization factor to process the facial information, and the method further comprises: decoding the factor; and processing the facial information based on the factor.
9 . The method according to claim 8 , wherein the SEI message further comprises a syntax element indicating a number of face frames using the face video generative compression scheme, and the method further comprises: decoding the factor for each face frame respectively; and processing the facial information based on the factor for each face frame.
10 . A method of encoding a video sequence into a bitstream, the method comprising: receiving a video sequence; encoding one or more pictures of the video sequence; and generating a bitstream; wherein the encoding comprises: encoding an identifying number indicating whether a face video generative compression scheme is used.
11 . The method according to claim 10 , wherein when the face video generative compression scheme is used, the method further comprises: encoding a flag indicating whether a syntax element associated with facial information is present; and when the flag indicating the syntax element associated with facial information is present, encoding corresponding syntax element associated with facial information.
12 . The method according to claim 11 , further comprising: encoding a syntax element indicating a number of face frames using the face video generative compression scheme; and encoding the flag indicating whether a syntax element associated with facial information is present for each face frame respectively.
13 . The method according to claim 11 , further comprising: encoding a syntax element indicating a number of face frames using the face video generative compression scheme; and encoding a factor indicating a quantization factor to process the facial information for each face frame respectively.
14 . A method for signaling a bitstream, the method comprising: receiving a video sequence; encoding the video sequence by: encoding an identifying number indicating whether a face video generative compression scheme is used; and signaling a bitstream that is generated based on the encoding.
15 . The method according to claim 14 , wherein when the face video generative compression scheme is used, the encoding further comprises: encoding a flag indicating whether a syntax element associated with facial information is present; and when the flag indicating the syntax element associated with facial information is present, encoding corresponding syntax element associated with facial information.
16 . The method according to claim 15 , wherein the encoding further comprises: encoding a syntax element indicating a number of face frames using the face video generative compression scheme; and encoding the flag indicating whether a syntax element associated with facial information is present for each face frame respectively.
17 . The method according to claim 15 , wherein the encoding further comprises: encoding a syntax element indicating a number of face frames using the face video generative compression scheme; and encoding a factor indicating a quantization factor to process the facial information for each face frame respectively.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The disclosure claims the benefits of priority to U.S. Provisional Application No. 63/436,626, filed Jan. 1, 2023, which is incorporated herein by reference in its entirety. TECHNICAL FIELD The present disclosure generally relates to video processing, and more particularly, to methods and apparatuses for using face video generative compression supplemental enhancement information (SEI) messages. BACKGROUND A video is a set of static pictures (or “frames”) capturing the visual information. To reduce the storage memory and the transmission bandwidth, a video can be compressed before storage or transmission and decompressed before display. The compression process is usually referred to as encoding and the decompression process is usually referred to as decoding. There are various video coding formats which use standardized video coding technologies, most commonly based on prediction, transform, quantization, entropy coding and in-loop filtering. The video coding standards, such as the High Efficiency Video Coding (HEVC/H.265) standard, the Versatile Video Coding (VVC/H.266) standard, and AVS standards, specifying the specific video coding formats, are developed by standardization organizations. With more and more advanced video coding technologies being adopted in the video standards, the coding efficiency of the new video coding standards get higher and higher. SUMMARY OF THE DISCLOSURE Embodiments of the present disclosure provide a method of decoding a bitstream to output one or more pictures for a video stream. The method includes: receiving a bitstream; and decoding, using coded information of the bitstream, one or more pictures. The decoding includes determining, based on an identifying number, whether a face video generative compression scheme is used; in response to a determination that the face video generative compression scheme is used, decoding a supplemental enhancement information (SEI) message, the SEI message comprising facial information; and reconstructing a face picture based on the facial information and a base picture associated with the SEI message. Embodiments of the present disclosure provide a method of encoding a video sequence into a bitstream. The method includes receiving a video sequence; encoding one or more pictures of the video sequence; and generating a bitstream. The encoding includes: signaling an identifying number indicating whether a face video generative compression scheme is used. Embodiments of the present disclosure provide a non-transitory computer readable storage medium storing a bitstream of a video. The bitstream includes: a supplemental enhancement information (SEI) message, the SEI message comprising facial information, wherein the facial information is used for reconstructing a face picture based on a base picture associated with the face picture. BRIEF DESCRIPTION OF THE DRAWINGS Embodiments and various aspects of the present disclosure are illustrated in the following detailed description and the accompanying figures. Various features shown in the figures are not drawn to scale. FIG. 1 is a schematic diagram illustrating an exemplary system for preprocessing and coding image data, according to some embodiments of the present disclosure. FIG. 2A is a schematic diagram illustrating an exemplary encoding process of a hybrid video coding system, consistent with embodiments of the disclosure. FIG. 2B is a schematic diagram illustrating another exemplary encoding process of a hybrid video coding system, consistent with embodiments of the disclosure. FIG. 3A is a schematic diagram illustrating an exemplary decoding process of a hybrid video coding system, consistent with embodiments of the disclosure. FIG. 3B is a schematic diagram illustrating another exemplary decoding process of a hybrid video coding system, consistent with embodiments of the disclosure. FIG. 4 is a block diagram of an exemplary apparatus for preprocessing or coding image data, according to some embodiments of the present disclosure. FIG. 5 is a schematic diagram illustrating an exemplary deep learning based video generative compression framework, according to some embodiments of the present disclosure. FIG. 6 is a schematic diagram illustrating an exemplary encoder-decoder coding framework with the 1×4×4 compact feature size for a talking face video, according to some embodiments of the present disclosure. FIG. 7 is a schematic diagram illustrating a general encoder-decoder generative compression framework of 3DMM-assisted talking face video, according to some embodiments of the present disclosure. FIG. 8 is a flowchart of an exemplary method for processing video based on face video generative compression supplemental enhancement information (SEI) messages, according to some embodiments of the present disclosure. FIG. 9 is a flowchart of an exemplary method for processing video based on face video generative compression supplemental enhancement informatio