EP-4742184-A1 - IMAGE PROCESSING METHOD AND APPARATUS, METHOD AND APPARATUS FOR TRAINING BODY PART IMAGE PREDICTION MODEL, AND COMPUTER DEVICE, COMPUTER-READABLE STORAGE MEDIUM AND COMPUTER PROGRAM PRODUCT

EP4742184A1EP 4742184 A1EP4742184 A1EP 4742184A1EP-4742184-A1

Abstract

The present application belongs to the technical field of artificial intelligence. Disclosed are an image processing method and apparatus, a method and apparatus for training a body part image prediction model, and a computer device, a computer-readable storage medium and a computer program product. The method comprises: acquiring a first body image, which is an image in a first angle of view; calling a feature coding network to perform image coding on the first body image, so as to obtain fused feature representations of at least two query points on a body object, wherein an nth fused feature representation is used for indicating a feature representation of an image region which is symmetric in terms of a physiological structure; and calling a decoding network to perform feature decoding on the fused feature representations of the at least two query points, so as to obtain decoded features, and performing rendering by means of the decoded features, so as to obtain a second body image, which is image information in a second angle of view.

Inventors

HUANG, XUAN
LI, HANHUI
YANG, Zejun
WANG, ZHISHENG
LIANG, XIAODAN

Assignees

Tencent Technology (Shenzhen) Company Limited

Dates

Publication Date: 20260513
Application Date: 20240823

Claims (20)

A method for image processing, performed by a computer device, the method comprising: obtaining a first limb image, wherein the first limb image is an image of a limb object captured under a first perspective, the limb object comprising a first limb and a second limb having symmetry in a physiological structure; invoking a feature encoding network of a limb part image prediction model to perform image encoding on the first limb image, to obtain a fused feature representation of each of at least two query points on the limb object, wherein an n th query point in the at least two query points is a query point on the first limb, and an n th fused feature representation of the n th query point indicates a feature representation of an image region with symmetry in the physiological structure and is used for supplementing a feature representation of the n th query point based on the symmetry of the limb object using a feature representation on the second limb; invoking a decoding network of the limb part image prediction model to perform feature decoding on fused feature representations of the at least two query points, to obtain decoded features; and performing rendering based on the decoded features to obtain a second limb image, wherein the second limb image is an image of the limb object captured under a second perspective different from the first perspective, and a generation region corresponding to the n th query point in the second limb image is obtained through decoding after feature supplementation is performed based on the feature representation on the second limb.
The method according to claim 1, wherein the n th fused feature representation of the n th query point comprises a first region feature and a second region feature; and wherein invoking the feature encoding network of the limb part image prediction model to perform image encoding on the first limb image, to obtain the fused feature representation of each of at least two query points on the limb object comprises: invoking the feature encoding network to extract, from the first limb image, the first region feature of a peripheral region of the n th query point on the first limb and the second region feature of a symmetrical region on the second limb, wherein the symmetrical region and the peripheral region of the n th query point have symmetry in the physiological structure; concatenating the first region feature and the second region feature to obtain the n th fused feature representation of the n th query point; and summarizing fused feature representations of all query points to obtain at least two fused feature representations.
The method according to claim 1 or 2, wherein concatenating the first region feature and the second region feature to obtain the n th fused feature representation of the n th query point comprises: performing prediction on the first region feature and the second region feature to obtain a first weight corresponding to the first region feature and a second weight corresponding to the second region feature; and concatenating a product of the first weight and the first region feature and a product of the second weight and the second region feature to obtain the n th fused feature representation corresponding to the n th query point.
The method according to any one of claims 1 to 3, wherein the feature encoding network comprises an encoding subnetwork and a fusion subnetwork; and wherein invoking the feature encoding network to extract, from the first limb image, the first region feature of the peripheral region of the n th query point on the first limb and the second region feature of the symmetrical region on the second limb comprises: invoking the encoding subnetwork to perform image encoding on the first limb image, to obtain an image feature representation of the first limb image in latent space; and invoking the fusion subnetwork to extract, from the image feature representation, the first region feature of the peripheral region of the n th query point on the first limb and the second region feature of the symmetrical region on the second limb.
The method according to any one of claims 1 to 4, wherein the limb part image prediction model further comprises a structure reconstruction network, and the method further comprises: invoking the structure reconstruction network to predict a three-dimensional structural grid of the limb object; determining a first grid point, wherein the first grid point is adjacent to the n th query point in the three-dimensional structural grid; determining a second grid point corresponding to the first grid point, wherein a relative location of the first grid point on the first limb is the same as a relative location of the second grid point on the second limb; and determining the peripheral region of the n th query point using the first grid point, and determining the symmetrical region using the second grid point.
The method according to any one of claims 1 to 5, wherein determining the peripheral region of the n th query point using the first grid point, and determining the symmetrical region using the second grid point comprises: determining first location information of the first grid point mapped onto the first limb image; determining the peripheral region of the n th query point using the first location information as a center; determining second location information of the second grid point mapped onto the first limb image; and determining the symmetrical region using the second location information as a center.
The method according to any one of claims 1 to 6, wherein the n th fused feature representation comprises a spatial feature, the spatial feature indicating a spatial depth of the n th query point relative to a point of interest of the limb object, and the point of interest being a point, on the limb object, having visual saliency or associated with an activity of the limb object; and wherein the method further comprises: obtaining a location of interest of the point of interest on a three-dimensional structural grid and a query location of the n th query point on the three-dimensional structural grid; invoking the feature encoding network to construct the spatial feature based on the location of interest and the query location; and adding the spatial feature to the fused feature representation.
The method according to any one of claims 1 to 7, wherein the fused feature representation indicates at least one of a texture feature and a geometric structure feature of the limb object.
The method according to any one of claims 1 to 8, wherein when the fused feature representation indicates a texture feature of the limb object, the n th fused feature representation comprises a global texture feature; and wherein the method further comprises: invoking the feature encoding network to perform image encoding on the first limb image, to obtain the global texture feature; and adding the global texture feature to the fused feature representation, wherein the global texture feature is a global feature of the limb object in the first limb image.
The method according to any one of claims 1 to 9, wherein the global texture feature comprises a first texture feature of the first limb and a second texture feature of the second limb; and wherein invoking the feature encoding network to perform image encoding on the first limb image, to obtain the global texture feature comprises: invoking, based on a location of the first limb, the feature encoding network to extract the first texture feature from the first limb image; and invoking, based on a location of the second limb, the feature encoding network to extract the second texture feature from the first limb image.
The method according to any one of claims 1 to 10, wherein the first texture feature and the second texture feature correspondingly have mutually independent weight information in the fused feature representation; and wherein adding the global texture feature to the fused feature representation comprises: invoking the feature encoding network to predict the weight information corresponding to the first texture feature and the weight information corresponding to the second texture feature; and adding the first texture feature, the second texture feature, the weight information corresponding to the first texture feature, and the weight information corresponding to the second texture feature to the fused feature representation.
A method for training a limb part image prediction model, wherein the method is performed by a computer device, the limb part image prediction model comprises a feature encoding network and a decoding network, and the method comprises: obtaining a sample information pair, wherein the sample information pair comprises a first sample image and a second sample image, the first sample image being an image of a sample object captured under a first perspective, the second sample image being an image of the sample object captured under a second perspective different from the first perspective, and the sample object comprising a first limb and a second limb having symmetry in a physiological structure; invoking the feature encoding network to perform image encoding on the first sample image, to obtain a predicted feature representation of each of at least two sample query points on the sample object, wherein an n th sample query point in the at least two sample query points is a sample query point on the first limb, an n th predicted feature representation of the n th sample query point indicates a feature representation of a sample region with symmetry in the physiological structure and the n th is used for supplementing a feature representation of the n th sample query point based on the symmetry of the sample object using a feature representation on the second limb; invoking the decoding network to perform feature decoding on predicted feature representations of the at least two sample query points, to obtain sample decoded features; performing rendering based on the sample decoded features to obtain a predicted limb image, wherein a generation region corresponding to the n th sample query point in the predicted limb image is obtained through decoding after feature supplementation is performed based on the feature representation on the second limb, and the predicted limb image is an image that is of the sample object obtained through prediction under the second perspective; and training the limb part image prediction model using a difference between the predicted limb image and the second sample image to obtain a trained limb part image prediction model.
The method according to claim 12, wherein the n th predicted feature representation of the n th sample query point comprises a first predicted region feature and a second predicted region feature; and wherein invoking the feature encoding network to perform image encoding on the first sample image, to obtain the predicted feature representation of each of at least two sample query points on the sample object comprises: invoking the feature encoding network to extract, from the first sample image, the first predicted region feature of a peripheral region of the n th sample query point on the first limb and the second predicted region feature of a symmetrical region on the second limb, wherein the symmetrical region and the peripheral region of the n th sample query point have symmetry in the physiological structure; concatenating the first predicted region feature and the second predicted region feature to obtain the n th predicted feature representation of the n th sample query point; and summarizing predicted feature representations of all sample query points to obtain at least two predicted feature representations.
The method according to claim 12 or 13, wherein the limb part image prediction model further comprises a discrimination network; and wherein the method further comprises: invoking the discrimination network to perform prediction on the predicted limb image, to obtain first visibility information, wherein the first visibility information is presented from the second perspective, and the first visibility information is a visibility status of the sample object from the first perspective and is predicted in the predicted limb image; obtaining second visibility information of the sample object, wherein the second visibility information is presented from the second perspective and is a visibility status of the sample object in the first sample image from the first perspective; and performing supplementary training on the discrimination network or a prediction submodel of the limb part image prediction model using a difference between the first visibility information and the second visibility information, wherein the prediction submodel comprises the feature encoding network and the decoding network.
A method for training a limb part image prediction model, performed by a computer device, the method comprising: obtaining a first sample image, wherein the first sample image is an image of a sample object captured under a first perspective, the sample object comprising a first limb and a second limb having symmetry in a physiological structure; invoking a feature encoding network of the limb part image prediction model to perform image encoding on the first sample image, to obtain a predicted feature representation of each of at least two sample query points on the sample object, wherein an n th sample query point in the at least two sample query points is a sample query point on the first limb, an n th predicted feature representation of the n th sample query point indicates a feature representation of a sample region with symmetry in the physiological structure and is used for supplementing a feature representation of the n th sample query point based on the symmetry of the sample object using a feature representation on the second limb; invoking a decoding network of the limb part image prediction model to perform feature decoding on predicted feature representations of the at least two sample query points, to obtain sample decoded features; performing rendering based on the sample decoded features to obtain a predicted limb image, wherein a generation region corresponding to the n th sample query point in the predicted limb image is obtained through decoding after feature supplementation is performed based on the feature representation on the second limb, the predicted limb image is an image of the sample object obtained through prediction under a second perspective different from the first perspective; invoking the discrimination network to perform prediction on the predicted limb image, to obtain first visibility information, wherein the first visibility information is presented from the second perspective, and the first visibility information is a visibility status of the sample object from the first perspective and is predicted in the predicted limb image; obtaining second visibility information of the sample object, wherein the second visibility information is presented from the second perspective and is a visibility status of the sample object in the first sample image from the first perspective; and performing adversarial training on the discrimination network or a prediction submodel of the limb part image prediction model using a difference between the first visibility information and the second visibility information to obtain a trained limb part image prediction model, wherein the prediction submodel comprises the feature encoding network and the decoding network.
An apparatus for image processing, comprising: a first obtaining module, configured to obtain a first limb image, wherein the first limb image is an image of a limb object captured under a first perspective, the limb object comprising a first limb and a second limb having symmetry in a physiological structure; a first processing module, configured to invoke a feature encoding network of a limb part image prediction model to perform image encoding on the first limb image, to obtain a fused feature representation of each of at least two query points on the limb object, wherein an n th query point in the at least two query points is a query point on the first limb, and an n th fused feature representation of the n th query point indicates a feature representation of an image region with symmetry in the physiological structure and is used for supplementing a feature representation of the n th query point based on the symmetry of the limb object using a feature representation on the second limb; and a first rendering module, configured to invoke a decoding network of the limb part image prediction model to perform feature decoding on fused feature representations of the at least two query points, to obtain decoded features; and to perform rendering based on the decoded features to obtain a second limb image, wherein the second limb image is an image of the limb object obtained under a second perspective different from the first perspective, and a generation region corresponding to the n th query point in the second limb image is obtained through decoding after feature supplementation is performed based on the feature representation on the second limb.
An apparatus for training a limb part image prediction model, comprising: a second obtaining module, configured to obtain a sample information pair, wherein the sample information pair comprises a first sample image and a second sample image, the first sample image being an image of a sample object captured under a first perspective, the second sample image being an image of the sample object captured under a second perspective different from the first perspective, the sample object comprising a first limb and a second limb having symmetry in a physiological structure; a second processing module, configured to invoke a feature encoding network of the limb part image prediction model to perform image encoding on the first sample image, to obtain a predicted feature representation of each of at least two sample query points on the sample object, wherein an n th sample query point in the at least two sample query points is a sample query point on the first limb, an n th predicted feature representation of the n th sample query point indicates a feature representation of a sample region with symmetry in the physiological structure and is used for supplementing a feature representation of the n th sample query point based on the symmetry of the sample object using a feature representation on the second limb; a second rendering module, configured to invoke a decoding network of the limb part image prediction model to perform feature decoding on predicted feature representations of the at least two sample query points, to obtain sample decoded features; and to perform rendering based on the sample decoded features to obtain a predicted limb image, wherein a generation region corresponding to the n th sample query point in the predicted limb image is obtained through decoding after feature supplementation is performed based on the feature representation on the second limb, and the predicted limb image is an image of the sample object obtained through prediction under the second perspective; and a first training module, configured to train the limb part image prediction model using a difference between the predicted limb image and the second sample image to obtain a trained limb part image prediction model.
An apparatus for training a limb part image prediction model, comprising: a third obtaining module, configured to obtain a first sample image, wherein the first sample image is an image of a sample object captured under a first perspective, the sample object comprising a first limb and a second limb having symmetry in a physiological structure; a third processing module, configured to invoke a feature encoding network of the limb part image prediction model to perform image encoding on the first sample image, to obtain a predicted feature representation of each of at least two sample query points on the sample object, wherein an n th sample query point in the at least two sample query points is a sample query point on the first limb, an n th predicted feature representation of the n th sample query point indicates a feature representation of a sample region with symmetry in the physiological structure and is used for supplementing a feature representation of the n th sample query point based on the symmetry of the sample object using a feature representation on the second limb; a third rendering module, configured to invoke a decoding network of the limb part image prediction model to perform feature decoding on predicted feature representations of the at least two sample query points, to obtain sample decoded features; and to perform rendering based on the sample decoded features to obtain a predicted limb image, wherein a generation region corresponding to the n th sample query point in the predicted limb image is obtained through decoding after feature supplementation is performed based on the feature representation on the second limb, the predicted limb image is an image of the sample object obtained through prediction under a second perspective different from the first perspective; wherein the third processing module is further configured to invoke a discrimination network of the limb part image prediction model to perform prediction on the predicted limb image, to obtain first visibility information, wherein the first visibility information is presented from the second perspective, and the first visibility information is a visibility status of the sample object from the first perspective and is predicted in the predicted limb image; the third obtaining module is further configured to obtain second visibility information of the sample object, wherein the second visibility information is presented from the second perspective and is a visibility status of the sample object in the first sample image from the first perspective; and a second training module, configured to perform adversarial training on the discrimination network or a prediction submodel of the limb part image prediction model using a difference between the first visibility information and the second visibility information to obtain a trained limb part image prediction model, wherein the prediction submodel comprises the feature encoding network and the decoding network.
A computer device, comprising a processor and a memory, wherein the memory stores at least one section of executable instructions, and the processor is configured to execute the at least one section of executable instructions in the memory, to implement the method for image processing according to any one of claims 1 to 11 or the method for training the limb part image prediction model according to any one of claims 12 to 15.
A computer-readable storage medium having stored thereon executable instructions that, when loaded and executed by a processor, cause the processor to perform the method for image processing according to any one of claims 1 to 11 or the method for training the limb part image prediction model according to any one of claims 12 to 15.

Description

RELATED APPLICATION This application claims priority to Chinese Patent Application No. 202311367748.0, filed on October 20, 2023, which is incorporated herein by reference in its entirety. FIELD OF THE TECHNOLOGY The present disclosure relates to the field of artificial intelligence technologies, and relates to, but is not limited to, an image processing method, a method for training a limb part image prediction model, an apparatus, a computer device, a computer-readable storage medium, and a computer program product. BACKGROUND OF THE DISCLOSURE In multimedia technologies, there is a need to observe a complex action from different angles. An example in which a gesture is made with two hands is used. There is a need to observe the two hands from different angles to check relative positions of the two hands, to facilitate imitation of the gesture with the two hands. In the related art, a camera-based image capture device can capture an image, for observing two hands of a user, only from a single direction based on a placement position of the image capture device by the user. When there is a need to display an image of the two hands of the user in another direction, an artificial neural network needs to be invoked to perform image prediction. A global feature is extracted from the image to perform prediction on the image of the two hands to generate the image in the another direction. However, according to the method in the related art, feature information cannot be sufficiently extracted from the image, resulting in a poor effect of generating an image of two hands. SUMMARY Embodiments of the present disclosure provide an image processing method, a method for training a limb part image prediction model, an apparatus, a computer device, a computer-readable storage medium, and a computer program product, to accurately render a limb image of a limb object from a second perspective based on effectively extracted feature information, improving rendering precision of a second limb image. The embodiments of the present disclosure include the following technical solutions: The present disclosure provides an image processing method, the method being performed by a computer device, and the method including: obtaining a first limb image, the first limb image being an image of a limb object from a first perspective, the limb object including a first limb and a second limb, and the first limb and the second limb having symmetry in a physiological structure; invoking a feature encoding network of a limb part image prediction model to perform image encoding on the first limb image, to obtain a fused feature representation of each of at least two query points on the limb object, an nth query point in the at least two query points being a query point on the first limb, an nth fused feature representation of the nth query point indicating a feature representation of an image region with symmetry in the physiological structure, and the nth fused feature representation being configured for supplementing a feature representation of the nth query point based on the symmetry of the limb object by using a feature representation on the second limb; invoking a decoding network of the limb part image prediction model to perform feature decoding on fused feature representations of the at least two query points, to obtain decoded features; and performing rendering based on the decoded features to obtain a second limb image, the second limb image being an image of the limb object from a second perspective, a generation region corresponding to the nth query point in the second limb image being obtained through decoding after feature supplementation is performed based on the feature representation on the second limb, and the second perspective being different from the first perspective. An embodiment of the present disclosure provides a method for training a limb part image prediction model, the method being performed by a computer device, the limb part image prediction model including a feature encoding network and a decoding network, and the method including: obtaining a sample information pair, the sample information pair including a first sample image and a second sample image, the first sample image being an image of a sample object from a first perspective, the second sample image being an image of the sample object from a second perspective, the first perspective being different from the second perspective, the sample object including a first limb and a second limb, and the first limb and the second limb having symmetry in a physiological structure; invoking the feature encoding network to perform image encoding on the first sample image, to obtain a predicted feature representation of each of at least two sample query points on the sample object, an nth sample query point in the at least two sample query points being a sample query point on the first limb, an nth predicted feature representation of the nth sample query point indica