JP-7857385-B2 - Image generation model training method, image generation method, apparatus and equipment

JP7857385B2JP 7857385 B2JP7857385 B2JP 7857385B2JP-7857385-B2

Inventors

杜宗財
趙亜飛
範錫睿
陳毅
王志強
秦勤

Assignees

ベイジンバイドゥネットコムサイエンステクノロジーカンパニーリミテッド

Dates

Publication Date: 20260512
Application Date: 20241217
Priority Date: 20240207

Claims (20)

The process involves acquiring sample audio data, sample reference images, and annotation image data, and extracting reference keypoints of a person from the sample reference images. Based on a training-waiting model, motion estimation is performed using sample audio data and reference keypoints to obtain predicted keypoints that match the sample audio data. Based on the aforementioned training-waiting model, parameter estimation is performed using reference keypoints and predicted keypoints to obtain the motion parameters of the predicted keypoints, and prior motion estimation is performed using the motion parameters of the predicted keypoints to obtain the optical flow of non-key pixel points. Based on the aforementioned training model, image prediction is performed using a sample reference image and a dense optical flow including the optical flow of predicted keypoints and the optical flow of non-key pixel points to obtain predicted image data that matches the sample audio data. This includes obtaining an image generation model by training the model using the aforementioned predicted image data and the aforementioned annotation image data, Based on the aforementioned training-waiting model, parameter estimation is performed by adopting reference keypoints and predicted keypoints to obtain the motion parameters of the predicted keypoints, and prior motion estimation is performed by adopting the motion parameters of the predicted keypoints to obtain the optical flow of non-key pixel points. Based on the aforementioned training-waiting model, the coordinates of the predicted keypoints and the coordinates of the reference keypoints are used to obtain the optical flow of the predicted keypoints. This includes: employing the optical flow of predicted keypoints to perform parameter estimation and obtain the motion parameters of the predicted keypoints; selecting auxiliary keypoints from the predicted keypoints for non-key pixel points; and employing the optical flow and motion parameters of the auxiliary keypoints to perform prior motion estimation and obtain the optical flow of the non-key pixel points. Training methods for image generation models.
The process involves using the optical flow of predicted keypoints to estimate parameters and obtain the motion parameters of the predicted keypoints, selecting auxiliary keypoints for non-key pixel points from the predicted keypoints, and then using the optical flow and motion parameters of the auxiliary keypoints to perform prior motion estimation and obtain the optical flow of the non-key pixel points. The optical flow of the predicted keypoint is used to determine the motion function followed by the predicted keypoint, and the derivatives of the motion function are obtained based on the Taylor distribution to obtain the first and second partial derivatives of the predicted keypoint moving in the horizontal and vertical directions. This includes performing prior motion estimation by employing the coordinates of non-key pixel points, the coordinates of auxiliary key points, and the first and second partial derivatives of the auxiliary key points in the horizontal and vertical directions, and obtaining the optical flow of the non-key pixel points. The method according to claim 1 .
After obtaining the optical flow for non-key pixel points, Based on a Gaussian distribution, the influence weights that non-key pixel points receive from auxiliary keypoints are determined according to the coordinates of the non-key pixel points, the coordinates of the auxiliary keypoints, and the learnable influence radius. This further includes scaling the optical flow of non-key pixel points by adopting the influence weights they receive from auxiliary keypoints, and obtaining the scaled optical flow of non-key pixel points. The method according to claim 2 .
After obtaining the optical flow for non-key pixel points, This further includes correcting the optical flow of non-key pixel points by employing a learnable optical flow offset amount, and obtaining the corrected optical flow of non-key pixel points. The method according to claim 2 or 3 .
Extracting reference keypoints of a person from a sample reference image includes extracting reference keypoints, a reference person image, and a background image from the sample reference image, supplementing the background image, and obtaining the supplemented background image. Based on the aforementioned training model, image prediction is performed using a sample reference image and a dense optical flow to obtain predicted image data that matches the sample audio data. Based on the aforementioned training model, the reference person profile is encoded to obtain the reference person profile features, Based on the aforementioned training model, the reference person profile features and dense optical flow are decoded to obtain predicted person profile data. This includes fusing the aforementioned predicted person profile data with supplemented background images to obtain predicted image data that matches the sample audio data. The method according to claim 1.
Based on the aforementioned training model, image prediction is performed using a sample reference image and a dense optical flow to obtain predicted image data that matches the sample audio data. Based on the aforementioned training-waiting model, the dense optical flow is masked, and the masked dense optical flow is obtained. This includes performing image prediction using a sample reference image and a masked dense optical flow to obtain predicted image data that matches the sample audio data, The method according to claim 1 or 5 .
Based on a training-waiting model, motion estimation is performed using sample audio data and reference keypoints to obtain predicted keypoints that match the sample audio data. Based on the training model, sample audio data is encoded to obtain audio features, This includes performing motion estimation using reference keypoints and speech features to obtain predicted keypoints that match sample speech data, The method according to claim 1.
The process involves acquiring target audio data and target reference images, and extracting keypoints of the person from the target reference images. Based on an image generation model, motion estimation is performed using target audio data and reference keypoints to obtain predicted keypoints that match the target audio data. Based on the image generation model, parameter estimation is performed using reference keypoints and predicted keypoints to obtain the motion parameters of the predicted keypoints, and prior motion estimation is performed using the motion parameters of the predicted keypoints to obtain the optical flow of non-key pixel points. Based on the image generation model, the process includes performing image prediction using a target reference image and a dense optical flow that includes the optical flow of predicted keypoints and the optical flow of non-key pixel points to obtain predicted image data that matches the target audio data. Based on an image generation model, parameter estimation is performed using reference keypoints and predicted keypoints to obtain the motion parameters of the predicted keypoints, and prior motion estimation is performed using the motion parameters of the predicted keypoints to obtain the optical flow of non-key pixel points. Based on the image generation model, the optical flow of the predicted keypoint is obtained by adopting the coordinates of the predicted keypoint and the coordinates of the reference keypoint, This includes: employing the optical flow of predicted keypoints to perform parameter estimation and obtain the motion parameters of the predicted keypoints; selecting auxiliary keypoints from the predicted keypoints for non-key pixel points; and employing the optical flow and motion parameters of the auxiliary keypoints to perform prior motion estimation and obtain the optical flow of the non-key pixel points. Image generation method.
The process involves using the optical flow of predicted keypoints to estimate parameters and obtain the motion parameters of the predicted keypoints, selecting auxiliary keypoints for non-key pixel points from the predicted keypoints, and then using the optical flow and motion parameters of the auxiliary keypoints to perform prior motion estimation and obtain the optical flow of the non-key pixel points. The optical flow of the predicted keypoint is used to determine the motion function followed by the predicted keypoint, and the derivatives of the motion function are obtained based on the Taylor distribution to obtain the first and second partial derivatives of the predicted keypoint moving in the horizontal and vertical directions. This includes performing prior motion estimation by employing the coordinates of non-key pixel points, the coordinates of auxiliary key points, and the first and second partial derivatives of the auxiliary key points in the horizontal and vertical directions, and obtaining the optical flow of the non-key pixel points. The method according to claim 8 .
After obtaining the optical flow for non-key pixel points, Based on a Gaussian distribution, the influence weights that non-key pixel points receive from auxiliary keypoints are determined according to the coordinates of the non-key pixel points, the coordinates of the auxiliary keypoints, and the learnable influence radius. This further includes scaling the optical flow of non-key pixel points by adopting the influence weights they receive from auxiliary keypoints, and obtaining the scaled optical flow of non-key pixel points. The method according to claim 9 .
After obtaining the optical flow for non-key pixel points, This further includes correcting the optical flow of non-key pixel points by employing a learnable optical flow offset amount, and obtaining the corrected optical flow of non-key pixel points. The method according to claim 9 or 10 .
Extracting reference keypoints of a person from a target reference image includes extracting reference keypoints and a reference person's image from the target reference image. Based on the aforementioned image generation model, obtaining predicted image data that matches the target audio data by performing image prediction using a target reference image and dense optical flow is possible. Based on the aforementioned image generation model, the reference person image is encoded to obtain the reference person image features, Based on the aforementioned image generation model, the reference person image features and dense optical flow are decoded to obtain predicted person image data. This includes fusing the aforementioned predicted person profile data with a target background image to obtain predicted image data that matches the target audio data. The method according to claim 8 .
Extract a background image from the target reference image, supplement with the extracted background image, and use the supplemented background image as the target background image. Alternatively, the process may further include obtaining a customized background image and using it as the target background image. The method according to claim 12 .
Based on an image generation model, motion estimation is performed by employing target audio data and reference keypoints to obtain predicted keypoints that match the target audio data. Based on an image generation model, the target audio data is encoded to obtain audio features, This includes performing motion estimation by employing reference keypoints and audio features, and obtaining predicted keypoints that match the target audio data. The method according to claim 8 .
To obtain key points for customized behavior, This further includes fusing key points of customized actions with predicted key points that match target audio data to obtain new predicted key points, The method according to claim 8 or 14 .
A reference keypoint module for acquiring sample audio data, sample reference images, and annotation image data, and for extracting reference keypoints of people from the sample reference images, A prediction keypoint module for obtaining prediction keypoints that match the sample audio data by performing motion estimation using sample audio data and reference keypoints based on a training waiting model, An optical flow estimation module for obtaining the motion parameters of the predicted key points by employing the aforementioned training waiting model, employing the reference key points and predicted key points, employing the motion parameters of the predicted key points, and performing prior motion estimation to obtain the optical flow of non-key pixel points, An image prediction module for obtaining predicted image data that matches sample audio data by performing image prediction using a sample reference image and a dense optical flow including the optical flow of predicted keypoints and the optical flow of non-key pixel points, based on the aforementioned training model, The system includes a model training module for obtaining an image generation model by training the model using the aforementioned predicted image data and the aforementioned annotation image data . The optical flow estimation module is Based on the aforementioned training model, the coordinates of the predicted keypoints and the coordinates of the reference keypoints are used to obtain the optical flow of the predicted keypoints. The optical flow of the predicted keypoints is used to estimate parameters and obtain the motion parameters of the predicted keypoints. Auxiliary keypoints are selected from the predicted keypoints for non-key pixel points. The optical flow and motion parameters of the auxiliary keypoints are used to perform prior motion estimation and obtain the optical flow of the non-key pixel points. Furthermore, the following methods are employed: A training device for image generation models.
A reference keypoint module for acquiring target audio data and target reference images, and for extracting reference keypoints of a person from the target reference image, A prediction keypoint module for obtaining prediction keypoints that match the target audio data by performing motion estimation based on an image generation model, using target audio data and reference keypoints, An optical flow estimation module for obtaining the motion parameters of predicted key points by employing reference key points and predicted key points based on the image generation model, and obtaining the optical flow of non-key pixel points by employing the motion parameters of predicted key points and performing prior motion estimation, and Based on the aforementioned image generation model, the system includes an image prediction module for obtaining predicted image data that matches the target audio data by employing a target reference image and a dense optical flow including the optical flow of predicted keypoints and the optical flow of non-key pixel points, thereby performing image prediction . The optical flow estimation module is Based on the image generation model, the optical flow of the predicted keypoint is obtained by adopting the coordinates of the predicted keypoint and the coordinates of the reference keypoint, The optical flow of the predicted keypoints is used to estimate parameters and obtain the motion parameters of the predicted keypoints. Auxiliary keypoints are selected from the predicted keypoints for non-key pixel points. The optical flow and motion parameters of the auxiliary keypoints are used to perform prior motion estimation and obtain the optical flow of the non-key pixel points. Furthermore, the following methods are employed: Image generation device.
At least one processor, The system comprises a memory that is communicated to at least one of the processors, The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor such that the at least one processor can perform the method according to any one of claims 1 to 3, 5, 7 to 10 and 12 to 14 . electronic equipment.
Computer instructions for causing a computer to perform the method according to any one of claims 1 to 3, 5, 7 to 10 and 12 to 14 are stored. A non-temporary computer-readable storage medium.
When executed by a processor, it realizes the method according to any one of claims 1 to 3, 5, 7 to 10 and 12 to 14 . Computer program.

Description

This disclosure relates to the computer field, particularly to the field of artificial intelligence, and specifically to technologies such as augmented reality (AR), virtual reality (VR), computer vision, and deep learning, and is applicable to scenes such as the metaverse and virtual digital humans. Specifically, it relates to training methods for image generation models, image generation methods, apparatus, and devices. A virtual human, also known as a digital human, refers to a virtual person with a digitized appearance. Its existence depends on the display device, and it possesses human features, behaviors (such as speaking and raising hands), and thoughts. The generation of virtual humans is a crucial technology in areas such as the metaverse, intelligent customer service, and e-commerce. Its core principle is the generation of continuous, realistic digital human images, offering a wide range of applications and business needs. How to generate digital human images is extremely important. This disclosure provides a method for training an image generation model, an image generation method, an apparatus, and equipment. According to one aspect of this disclosure, The process involves acquiring sample audio data, sample reference images, and annotation image data, and extracting reference keypoints of a person from the sample reference images. Based on a training-waiting model, motion estimation is performed using sample audio data and reference keypoints to obtain predicted keypoints that match the sample audio data. Based on the aforementioned training-waiting model, parameter estimation is performed using reference keypoints and predicted keypoints to obtain the motion parameters of the predicted keypoints, and prior motion estimation is performed using the motion parameters of the predicted keypoints to obtain the optical flow of non-key pixel points. Based on the aforementioned training model, image prediction is performed using a sample reference image and a dense optical flow including the optical flow of predicted keypoints and the optical flow of non-key pixel points to obtain predicted image data that matches the sample audio data. This includes obtaining an image generation model by training the model using the aforementioned predicted image data and the aforementioned annotation image data, This provides a method for training image generation models. According to one aspect of this disclosure, The process involves acquiring target audio data and target reference images, and extracting keypoints of the person from the target reference images. Based on an image generation model, motion estimation is performed using target audio data and reference keypoints to obtain predicted keypoints that match the target audio data. Based on the image generation model, parameter estimation is performed using reference keypoints and predicted keypoints to obtain the motion parameters of the predicted keypoints, and prior motion estimation is performed using the motion parameters of the predicted keypoints to obtain the optical flow of non-key pixel points. Based on the aforementioned image generation model, the process includes performing image prediction using a target reference image and a dense optical flow that includes the optical flow of predicted keypoints and the optical flow of non-key pixel points to obtain predicted image data that matches the target audio data. This provides an image generation method. According to one aspect of this disclosure, A reference keypoint module for acquiring sample audio data, sample reference images, and annotation image data, and for extracting reference keypoints of people from the sample reference images, A prediction keypoint module for obtaining prediction keypoints that match the sample audio data by performing motion estimation using sample audio data and reference keypoints based on a training waiting model, An optical flow estimation module for obtaining the motion parameters of the predicted key points by employing the aforementioned training waiting model, employing the reference key points and predicted key points, employing the motion parameters of the predicted key points, and performing prior motion estimation to obtain the optical flow of non-key pixel points, An image prediction module for obtaining predicted image data that matches sample audio data by performing image prediction using a sample reference image and a dense optical flow including the optical flow of predicted keypoints and the optical flow of non-key pixel points, based on the aforementioned training model, The system includes a model training module for obtaining an image generation model by performing model training using the aforementioned predicted image data and the aforementioned annotation image data. We provide a training device for image generation models. According to one aspect of this disclosure, A reference keypoint module for acquiring target audio data and target reference