EP-4740180-A1 - SEMANTIC FACE PARAMETER ENCODING

EP4740180A1EP 4740180 A1EP4740180 A1EP 4740180A1EP-4740180-A1

Abstract

System, methods, and instrumentalities are disclosed for encoding semantic face parameters. A device may receive video data from an encoding device. The video data may include an indication of a value of a rig parameter associated with a video frame. The device may determine a three-dimensional mesh associated with a human face in the video frame based on the value of the rig parameter and a local three-dimensional rig. The value of the rig parameter may indicate a semantic facial feature of the human face in the video frame. The device may generate a two-dimensional image of the human face in the video frame based on the three-dimensional mesh of the human face.

Inventors

COVA REGATEIRO, João Pedro
GOSSELIN, Philippe Henri
LE CLERC, FRANCOIS
GALPIN, FRANCK

Assignees

InterDigital CE Patent Holdings, SAS

Dates

Publication Date: 20260513
Application Date: 20240719

Claims (20)

1 . A device for video decoding, the device comprising: a processor configured to: receive video data from an encoding device, wherein the video data comprises an indication of a value of a rig parameter associated with a video frame; determine a three-dimensional mesh associated with a human face in the video frame based on the value of the rig parameter and a local three-dimensional rig, wherein the value of the rig parameter indicates a semantic facial feature of the human face in the video frame; and generate a two-dimensional image of the human face in the video frame based on the three- dimensional mesh of the human face.
2. The device of claim 1 , wherein the processor being configured to generate the two-dimensional image of the human face based on the three-dimensional mesh of the human face comprises the processor being configured to: use the three-dimensional mesh as an input to a neural network; and receive the two-dimensional image of the human face as an output of the neural network.
3. The device of claim 1 , wherein the processor being configured to generate the two-dimensional image of the human face based on the three-dimensional mesh of the human face comprises the processor being configured to: use the three-dimensional mesh and at least one of: the value of the rig parameter or a driving picture as inputs to a neural network; and receive the two-dimensional image of the human face as an output of the neural network.
4. The device of claim 1 , wherein the video data is first video data, the video frame is a first video frame, the value of the rig parameter is a first value of the rig parameter, the three-dimensional mesh is a first three-dimensional mesh, the two-dimensional image of the human face is a first two-dimensional image of the human face, and the processor is further configured to: receive second video data from the encoding device, wherein the second video data comprises an indication of a second value of the rig parameter associated with a second video frame; determine a second three-dimensional mesh associated with a human face in the second video frame based on the second value of the rig parameter and the local three-dimensional rig, wherein the second value of the rig parameter indicates a semantic facial feature of the human face in the second video frame; and generate a second two-dimensional image of the human face in the second video frame based on the second three-dimensional mesh of the human face.
5. The device of claim 1 , wherein the value of the rig parameter indicates a blendshape weight, and the video data further comprises an indication of whether the blendshape weight is an identity blendshape weight or an expression blendshape weight.
6. The device of claim 1 , wherein the value of the rig parameter indicates a vertex color, texture size, parametric vertex color weight, or parametric texture weight, and the video data further comprises an indication of whether the vertex color, texture size, parametric vertex color weight, or parametric texture weight is associated with reflectance, specularity, roughness, or glossiness.
7. The device of claim 1 , wherein the value of the rig parameter indicates a degree of a spherical harmonics base, and the processor is further configured to determine a number of weights based on the degree of the spherical harmonics base.
8. The device of claim 1 , wherein the value of the rig parameter indicates an eye gaze dimensionality associated with the three-dimensional mesh, wherein the eye gaze dimensionality indicates whether an eye gaze direction is determined based on two coordinates or three coordinates.
9. The device of claim 1 , wherein the video data further comprises an indication of focal parameters, and the value of the rig parameter indicates a camera focal type, wherein the camera focal type indicates whether the focal parameters comprise: a field of view in degrees along a vertical axis, and an aspect ratio; or a horizontal focal length in pixels, a vertical focal length in pixels, and coordinates of a principal point.
10. A device for video encoding, the device comprising: a processor configured to: receive a two-dimensional video frame that depicts a human face; determine a value of a rig parameter based on the two-dimensional video frame and a local three- dimensional rig, wherein the value of the rig parameter indicates a semantic facial feature of the human face in the two-dimensional video frame; include, in video data, an indication of the value of the rig parameter; and send the video data to a video decoding device.
11. The device of claim 10, wherein the processor being configured to determine the value of the rig parameter based on the two-dimensional video frame and the local three-dimensional rig comprises the processor being configured to: use the two-dimensional video frame and the local three-dimensional rig as an input to a neural network; and receive the value of the rig parameter as an output of the neural network.
12. The device of claim 10, wherein the two-dimensional video frame is a first two-dimensional video frame, the value of the rig parameter is a first value of the rig parameter, the video data is first video data, and the processor is further configured to: receive a second two-dimensional video frame that depicts the human face; determine a second value of the rig parameter based on the second two-dimensional video frame and the local three-dimensional rig, wherein the second value of the rig parameter indicates a semantic facial feature of the human face in the second two-dimensional video frame; include, in second video data, an indication of the second value of the rig parameter; and send the second video data to the video decoding device.
13. The device of claim 10, wherein the value of the rig parameter indicates a blendshape weight, and the video data further comprises an indication of whether the blendshape weight is an identity blendshape weight or an expression blendshape weight.
14. The device of claim 10, wherein the value of the rig parameter indicates a vertex color, texture size, parametric vertex color weight, or parametric texture weight, and the video data further comprises an indication of whether the vertex color, texture size, parametric vertex color weight, or parametric texture weight is associated with reflectance, specularity, roughness, or glossiness.
15. The device of claim 10, wherein the value of the rig parameter indicates a degree of a spherical harmonics base, and the processor is further configured to determine a number of weights based on the degree of the spherical harmonics base.
16. The device of claim 10, wherein the value of the rig parameter indicates an eye gaze dimensionality, wherein the eye gaze dimensionality indicates whether an eye gaze direction is determined based on two coordinates or three coordinates.
17. The device of claim 10, wherein the video data further comprises an indication of focal parameters, and the value of the rig parameter indicates a camera focal type, wherein the camera focal type indicates whether the focal parameters comprise: a field of view in degrees along a vertical axis, and an aspect ratio; or a horizontal focal length in pixels, a vertical focal length in pixels, and coordinates of a principal point.
18. The device of claim 10, wherein the processor is further configured to include a driving picture in the video data.
19. A method for video decoding, the method comprising: receiving video data from an encoding device, wherein the video data comprises an indication of a value of a rig parameter associated with a video frame; determining a three-dimensional mesh associated with a human face in the video frame based on the value of the rig parameter and a local three-dimensional rig, wherein the value of the rig parameter indicates a semantic facial feature of the human face in the video frame; and generating a two-dimensional image of the human face in the video frame based on the three- dimensional mesh of the human face.
20. The method of claim 19, wherein generating the two-dimensional image of the human face based on the three-dimensional mesh of the human face comprises: using the three-dimensional mesh as an input to a neural network; and receiving the two-dimensional image of the human face as an output of the neural network.

Description

SEMANTIC FACE PARAMETER ENCODING CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims the benefit of European Provisional Patent Application No. EP23306267.8, filed July 21 , 2023, the contents of which are hereby incorporated by reference herein. BACKGROUND [0002] Video coding systems may be used to compress digital video signals, e.g., to reduce the storage and/or transmission bandwidth needed for such signals. Video coding systems may include, for example, block-based, wavelet-based, and/or object-based systems. SUMMARY [0003] System, methods, and instrumentalities are disclosed for encoding semantic face parameters (e.g., on a generative face video supplemental enhancement information (SEI) message). [0004] An example device for video decoding may receive video data from an encoding device. The video data may include an indication of a value of a rig parameter associated with a video frame. The device may determine a three-dimensional mesh associated with a human face in the video frame based on the value of the rig parameter and a local three-dimensional rig. The value of the rig parameter may indicate a semantic facial feature of the human face in the video frame. The device may generate a two- dimensional image of the human face in the video frame based on the three-dimensional mesh of the human face. [0005] The device may generate the two-dimensional image of the human face based on the three- dimensional mesh of the human face by using the three-dimensional mesh as an input to a neural network; and receiving the two-dimensional image of the human face as an output of the neural network. [0006] The device may generate the two-dimensional image of the human face based on the three- dimensional mesh of the human face by using the three-dimensional mesh and at least one of: the value of the rig parameter or a driving picture as inputs to a neural network; and receiving the two-dimensional image of the human face as an output of the neural network. [0007] The video data may be first video data. The video frame may be a first video frame. The value of the rig parameter may be a first value of the rig parameter. The three-dimensional mesh may be a first three-dimensional mesh. The two-dimensional image of the human face may be a first two-dimensional image of the human face. [0008] The device may receive second video data from the encoding device. The second video data may include an indication of a second value of the rig parameter associated with a second video frame. The device may determine a second three-dimensional mesh associated with a human face in the second video frame based on the second value of the rig parameter and the local three-dimensional rig. The second value of the rig parameter may indicate a semantic facial feature of the human face in the second video frame. The device may generate a second two-dimensional image of the human face in the second video frame based on the second three-dimensional mesh of the human face. [0009] The value of the rig parameter may indicate a blendshape weight. The video data may further include an indication of whether the blendshape weight is an identity blendshape weight or an expression blendshape weight. [0010] The value of the rig parameter may indicate a vertex color, texture size, parametric vertex color weight, or parametric texture weight. The video data may further include an indication of whether the vertex color, texture size, parametric vertex color weight, or parametric texture weight is associated with reflectance, specularity, roughness, or glossiness. [0011] The value of the rig parameter may indicate a degree of a spherical harmonics base. The device may determine a number of weights based on the degree of the spherical harmonics base. [0012] The value of the rig parameter indicates an eye gaze dimensionality associated with the three- dimensional mesh. The eye gaze dimensionality may indicate whether an eye gaze direction is determined based on two coordinates or three coordinates. [0013] The video data may further include an indication of focal parameters. The value of the rig parameter may indicate a camera focal type. The camera focal type may indicate whether the focal parameters comprise: a field of view in degrees along a vertical axis, and an aspect ratio; or a horizontal focal length in pixels, a vertical focal length in pixels, and coordinates of a principal point. [0014] A device for video encoding may receive a two-dimensional video frame that depicts a human face. The device may determine a value of a rig parameter based on the two-dimensional video frame and a local three-dimensional rig. The value of the rig parameter may indicate a semantic facial feature of the human face in the two-dimensional video frame. The device may include, in video data, an indication of the value of the rig parameter. The device may send the video data to a video decoding device. [0015] The device may determine the value of