CN-121982202-A - Human head image standardized embedded representation method, device, equipment and medium

CN121982202ACN 121982202 ACN121982202 ACN 121982202ACN-121982202-A

Abstract

The invention discloses a normalized embedded representation method, a normalized embedded representation device, computer equipment and a storage medium for human head images, which relate to the technical field of artificial intelligence in the fields of finance, medical treatment, insurance, banking and the like and comprise the steps of constructing a three-dimensional normalized space and defining a learnable potential feature grid therein; the method comprises the steps of constructing and training an embedded network, wherein the network and a characteristic grid are subjected to joint optimization training based on a training data set containing a plurality of head images and ground truth point track pairs generated by the training data set, inputting a target head image to be processed into the trained embedded network to obtain a target coordinate graph, and representing three-dimensional coordinates of each pixel point in the image in a normalized space. The invention can build stable and robust pixel-level corresponding relation for the whole head including hair and accessory by building the shared normalized space and learning the dense mapping from the pixels to the space, and overcomes the defects of incomplete coverage, easy shielding influence and sparse corresponding of the traditional method.

Inventors

ZHANG XULONG
LU RENJIE

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (10)

1. A normalized embedded representation method for a human head image, comprising: Constructing a three-dimensional normalized space, and defining a learnable potential feature grid in the normalized space; constructing and training an embedded network, wherein the training process comprises performing joint optimization training on the embedded network and the potential feature grid based on training data, wherein the training data comprises a training data set containing a plurality of human head images and at least one group of ground truth point track pairs generated from the training data set; And inputting the target head image to be processed into the embedded network after training is completed to obtain a target coordinate graph, wherein the target coordinate graph represents the three-dimensional coordinates of each pixel point in the target head image in the normalized space.
2. The human head image normalized embedded representation method according to claim 1, wherein the joint optimization training of the embedded network and the potential feature grid based on training data comprises: Acquiring a first training image and a second training image from the training data set, and at least one group of ground truth point track pairs between the first training image and the second training image, wherein each group of point track pairs comprises a first pixel point coordinate in the first training image and a second pixel point coordinate corresponding to the first pixel point coordinate in the second training image; Respectively inputting the first training image and the second training image into the embedded network to obtain a first coordinate graph corresponding to the first training image and a second coordinate graph corresponding to the second training image; Obtaining a first normalized coordinate corresponding to the first pixel point coordinate according to the first coordinate graph, obtaining a second normalized coordinate corresponding to the second pixel point coordinate according to the second coordinate graph, obtaining a first feature vector from the potential feature grid according to the first normalized coordinate through interpolation operation, and obtaining a second feature vector from the potential feature grid according to the second normalized coordinate; And constructing a contrast loss function based on the first feature vector and the second feature vector, and updating parameters of the embedded network and the values of the potential feature grids by using the contrast loss function so as to reduce the distance between the first feature vector and the second feature vector in a feature space.
3. The human head image normalized embedded representation method according to claim 2, wherein the training data based joint optimization training of the embedded network and the potential feature grid further comprises: Detecting a plurality of first sparse keypoints from the first training image and a plurality of second sparse keypoints from the second training image using a pre-trained landmark detector; For each first sparse key point in the first training image, acquiring normalized coordinates corresponding to the first sparse key point from the first coordinate graph; for each second sparse key point in the second training image, acquiring normalized coordinates corresponding to the second sparse key point from the second coordinate graph; calculating landmark anchoring loss, wherein the landmark anchoring loss is a measure of difference between normalized coordinates of all first sparse key points and second sparse key points and fixed target position coordinates which are preset in the normalized space and semantically bound with all key points; The landmark anchoring loss is combined with the contrast loss function to collectively update parameters of the embedded network.
4. The human head image normalized embedded representation method according to claim 3, wherein the training data based joint optimization training of the embedded network and the potential feature grid further comprises: adding a segmentation pre-measurement head in the embedded network; Inputting the segmentation prediction head to the first training image to obtain a predicted segmentation mask, wherein the first training image is used for embedding the intermediate features extracted by the network or the features obtained by inquiring the first coordinate graph from the potential feature grid; Acquiring a head region pseudo-segmentation truth value mask preset for the first training image; Calculating a segmentation penalty between the predicted segmentation mask and the pseudo-segmentation truth mask; And combining the segmentation loss with the contrast loss function and the landmark anchoring loss to update the parameters of the embedded network and the parameters of the segmentation pre-measurement head.
5. The normalized embedded representation method of a human head image according to claim 2, wherein the obtaining, by interpolation, a first feature vector from the potential feature grid according to the first normalized coordinate and obtaining, by query, a second feature vector from the potential feature grid according to the second normalized coordinate includes: Determining a first minimum cube unit in which the first normalized coordinate is located in the potential feature grid, and positioning eight first vertexes of the first minimum cube unit; acquiring eight first vertex feature vectors stored in the potential feature grid by the eight first vertices; calculating a first weight of the first normalized coordinates relative to the spatial positions of the eight first vertices of the first minimum cube unit; According to the first weight of the space position, carrying out weighted summation on the eight first vertex feature vectors to obtain the first feature vectors; determining a second minimum cube cell in which the second normalized coordinate is located in the potential feature grid, and positioning eight second vertices of the second minimum cube cell; acquiring eight second vertex feature vectors stored in the potential feature grid by the eight second vertices; Calculating a second weight of the second normalized coordinates relative to the spatial positions of the eight second vertices of the second minimum cube unit; And carrying out weighted summation on the eight second vertex feature vectors according to the second weight of the space position to obtain the second feature vectors.
6. The method of claim 1, wherein the normalized space is a three-dimensional unit cube, and wherein after defining the learnable potential feature grid in the normalized space, the method further comprises: And filtering the feature vector of each grid point position in the potential feature grid by applying a three-dimensional Gaussian convolution kernel, so that the feature vectors of adjacent grid points in the potential feature grid are in smooth transition in space.
7. The human head image normalized embedded representation method according to claim 1, further comprising: When the input is a single target head image, the target coordinate graph is utilized to realize pixel-level dense matching by carrying out nearest neighbor search between the coordinate graphs corresponding to different images, or Combining the dense corresponding relation provided by the target coordinate graph with a fitting result of a preset parameterized head model to provide alignment basis for texture mapping of the parameterized head model, or When a plurality of target head images of the same head under different view angles are input, the target coordinate graphs corresponding to the images are combined, and three-dimensional point cloud or surface grid reconstruction is carried out through a triangulation or multi-view angle geometric constraint method.
8. A human head image normalized embedded representation apparatus comprising means for performing the method of any of claims 1-7.
9. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-7.
10. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any of claims 1-7.

Description

Human head image standardized embedded representation method, device, equipment and medium Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for normalized embedded representation of a human head image, a computer device, and a storage medium. Background The high-fidelity human head modeling technology has important application value in the fields of finance, medical treatment, insurance, banks and the like with strict requirements on image presentation and identity verification. Currently, the main technical scheme in the field mainly depends on two types of methods, namely a tracking technology based on sparse key points and a modeling method based on a parameterized three-dimensional deformation model (such as 3 DMM). The sparse key point tracking method realizes rough alignment and tracking of the face by detecting and tracking few feature points, such as eye corners, mouth corners and the like, of the face with definite semantics. However, the number of spots treated by this method is extremely limited, and it is not possible to cover the entire area of the head, and particularly, it is not possible to perform effective tracking and characterization for the hair, the glasses worn, the earring, the hat, the clothing, and other non-facial areas. The parameterized three-dimensional model method can provide geometrical information denser than sparse key points by fitting a basic shape model obtained through statistical learning. However, the model structure is generally simple, and the core modeling object is limited to the facial skin area, so that the model is difficult to expand to the parts of hair, complex headwear and the like beyond the basic facial mesh topological structure, and thus the modeling of the complete head can not be realized. There are significant limitations to the prior art. First, there is a lack of integrity over the coverage, focusing mainly on the face and ignoring non-facial areas that make up the overall head outline important. Secondly, in a complex actual scene, the robustness is insufficient, and when the face is partially blocked by an object (such as a hand, a microphone and the like) or due to an extreme gesture and an exaggerated expression, the tracking and fitting method depending on the apparent information of the face is extremely easy to fail, so that tracking is lost or a remarkable error is generated. Finally, there is a bottleneck in the quality of the output correspondence, the sparse key points cannot establish the dense correspondence of the pixel level, and the correspondence provided by the parameterized model in the face area is also limited by the resolution and coverage of the mesh, which severely restricts the effects of downstream applications such as high-precision texture mapping, fine expression migration, high-quality geometric reconstruction and the like. Disclosure of Invention The embodiment of the invention provides a standardized embedded representation method, a standardized embedded representation device, computer equipment and a standardized embedded representation storage medium for human head images, and aims to solve the technical problem of providing an effective solution capable of providing a robust, stable and dense pixel-level corresponding relation for a complete human head including hair and accessories. In a first aspect, an embodiment of the present invention provides a normalized embedded representation method for a human head image, including: Constructing a three-dimensional normalized space, and defining a learnable potential feature grid in the normalized space; constructing and training an embedded network, wherein the training process comprises performing joint optimization training on the embedded network and the potential feature grid based on training data, wherein the training data comprises a training data set containing a plurality of human head images and at least one group of ground truth point track pairs generated from the training data set; And inputting the target head image to be processed into the embedded network after training is completed to obtain a target coordinate graph, wherein the target coordinate graph represents the three-dimensional coordinates of each pixel point in the target head image in the normalized space. Optionally, the performing joint optimization training on the embedded network and the potential feature grid based on training data includes: Acquiring a first training image and a second training image from the training data set, and at least one group of ground truth point track pairs between the first training image and the second training image, wherein each group of point track pairs comprises a first pixel point coordinate in the first training image and a second pixel point coordinate corresponding to the first pixel point coordinate in the second training image; Respectively inputting