CN-121999403-A - Video data processing method, device, equipment and storage medium

CN121999403ACN 121999403 ACN121999403 ACN 121999403ACN-121999403-A

Abstract

The application provides a video data processing method, a device, equipment and a storage medium, which can be applied to various fields such as electronic shopping, video reloading and the like. The method comprises the steps of obtaining an image comprising preset clothes and original video data comprising video objects, wherein the original video data comprise N image frames, extracting features of the preset clothes in the image to obtain first feature information of the preset clothes, masking areas to be replaced in the N image frames to obtain N clothes masking images, respectively fusing the N clothes masking images with one of N input noise images to obtain N fused images, predicting noise in the N fused images through the noise prediction model by taking the first feature information of the preset clothes as control conditions of the noise prediction model to obtain N noise prediction values, denoising the N input noise images based on the N noise prediction values to obtain N denoised images to generate updated video data, and improving video replacement effects.

Inventors

JIANG BOYUAN

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260508
Application Date: 20241106

Claims (20)

1. A method of video data processing, comprising: acquiring an image containing preset clothes and original video data containing a video object, wherein the original video data comprises N image frames, and N is a positive integer; extracting features of the preset clothes in the image to obtain first feature information; Masking areas to be replaced in the N image frames respectively to obtain N clothing masking images, wherein the areas to be replaced are at least part of clothing areas of the video objects in the N image frames; Fusing the N clothing mask images with one of N input noise images respectively to obtain N fused images; predicting noise in the N fused images by using the first characteristic information of the preset clothes as a control condition of a noise prediction model through the noise prediction model to obtain N noise prediction values; And denoising the N input noise maps based on the N noise predicted values respectively to obtain N denoised images, wherein the denoised images are used for generating updated video data containing the video object.
2. The method of claim 1, wherein predicting noise in the N fused images by the noise prediction model under the control condition that the first characteristic information of the preset garment is taken as a noise prediction model to obtain N noise prediction values includes: Converting the N fusion images from a pixel space to a latent vector space to obtain characteristic information of the N fusion images; determining input information of the noise prediction model based on the characteristic information of the N fusion images; and taking the first characteristic information of the preset clothes as the control condition of the noise prediction model, and processing the input information through the noise prediction model to obtain the N noise prediction values.
3. The method according to claim 2, wherein determining the input information of the noise prediction model based on the feature information of the N fused images includes: Performing key point detection on each of the N image frames to obtain key point posture information of the video object in each image frame; determining skeleton feature information of the video object in each image frame based on the key point posture information of the video object in each image frame; the input information is determined based on skeleton feature information of the video object in each image frame and feature information of each fusion image.
4. The method of claim 3, wherein the determining the input information based on skeleton feature information of the video object in each image frame and feature information of each fused image comprises: For each image frame, adding the skeleton feature information of the video object in the image frame and the feature information of the fusion image corresponding to the image frame to obtain model input information of the image frame; And determining model input information of the N image frames as the input information.
5. The method according to any one of claims 2-4, wherein the noise prediction model includes M first network modules, M is a positive integer greater than 1, the noise control condition of the denoising prediction model is the first characteristic information of the preset garment, and the processing the input information by the noise prediction model to obtain the N noise prediction values includes: For the ith first network module in the M first network modules, obtaining the ith-1 characteristic information corresponding to the N image frames output by the ith-1 first network module, wherein i is a positive integer from 1 to M, and if i is equal to 1, the ith-1 characteristic information is the input information; Processing, by the ith first network module, the (i-1) th feature information corresponding to the N image frames by taking the first feature information of the preset garment as the control condition, so as to obtain the (i) th feature information corresponding to the N image frames output by the ith first network module; And obtaining the N noise predicted values based on the M characteristic information corresponding to the N image frames output by the M first network module.
6. The method of claim 5, wherein the first characteristic information of the preset garment is extracted by a garment characteristic extraction model, the garment characteristic extraction model includes M second network modules, and the M second network modules are in one-to-one correspondence with the M first network modules; The step of extracting the characteristics of the preset clothes in the image comprising the preset clothes to obtain the first characteristic information of the preset clothes comprises the following steps: extracting detail features and space features of preset clothes in the image comprising the preset clothes through the M second network modules to obtain first feature information of the preset clothes corresponding to each second network module in the M second network modules; The method comprises the steps of taking first characteristic information of preset clothes as a control condition, processing the ith-1 characteristic information corresponding to the N image frames through the ith first network module to obtain the ith characteristic information corresponding to the N image frames output by the ith first network module, and processing the ith-1 characteristic information corresponding to the N image frames through the ith first network module to obtain the ith characteristic information of the preset clothes corresponding to the ith second network module.
7. The method of claim 6, wherein the ith first network module comprises a first attention layer, the ith second network module comprises a second attention layer, and the ith first characteristic information of the preset garment is input information of the second attention layer included in the ith second network module; The processing, by the ith first network module, the ith-1 th feature information corresponding to the N image frames with the ith first feature information of the preset garment as the control condition to obtain the ith feature information corresponding to the N image frames, including: Based on the i-1 th characteristic information corresponding to the N image frames, obtaining input information of a first attention layer included in the i first network module; Fusing the input information of the first attention layer and the input information of the second attention layer to obtain fused input information; Performing self-attention processing on the fusion input information corresponding to each image frame in the fusion input information through the first attention layer to obtain first attention processing results corresponding to the N image frames output by the ith first network module; and determining the ith characteristic information corresponding to the N image frames based on the first attention processing results corresponding to the N image frames.
8. The method according to claim 7, wherein the obtaining the input information of the first attention layer based on the i-1 th feature information corresponding to the N image frames includes: Performing inter-frame three-dimensional convolution processing on the i-1 th characteristic information corresponding to the N image frames to obtain a convolution result; based on the convolution result, input information of the first attention layer is determined.
9. The method of claim 8, wherein fusing the input information of the first attention layer and the input information of the second attention layer to obtain fused input information, comprising: Fusing a key in the input information of the first attention layer with a key in the input information of the second attention layer to obtain a new key, and fusing a value in the input information of the first attention layer with a value in the input information of the second attention layer to obtain a new value; and determining the new key, the query of the first attention layer and the new value as the fusion input information.
10. The method according to any one of claims 7 to 9, wherein the determining the i-th feature information corresponding to the N image frames based on the first attention processing results corresponding to the N image frames includes: Performing inter-frame cross attention processing on the first attention processing results corresponding to the N image frames to obtain second attention processing results corresponding to the N image frames; And obtaining the ith characteristic information corresponding to the N image frames based on the first attention processing results and the second attention processing results corresponding to the N image frames.
11. The method according to any one of claims 1-10, further comprising: Extracting global features of the preset clothes in the image comprising the preset clothes to obtain second feature information of the preset clothes; The predicting the noise in the N fused images by using the first characteristic information of the preset clothes as a control condition of a noise prediction model to obtain N noise prediction values, including: and predicting the noise in the N fused images by using the first characteristic information and the second characteristic information of the preset clothes as control conditions of the noise prediction model, so as to obtain N noise prediction values.
12. The method according to any one of claims 1-11, wherein masking the areas to be reloaded in the N image frames respectively to obtain N clothing mask patterns, includes: for a j-th image frame in the N image frames, performing key point detection on the j-th image frame to obtain key point posture information of the video object in the j-th image frame, wherein j is a positive integer less than or equal to N; Dividing different areas of the video object in the jth image frame to obtain an area division map of the jth image frame; And carrying out mask processing on the region to be replaced in the jth image frame based on the key point attitude information of the video object in the jth image frame and the region segmentation map of the jth image frame to obtain a clothing mask map corresponding to the jth image frame.
13. A method of model training, comprising: Acquiring a training sample, wherein the training sample comprises a training image and training video data, the training image is an image comprising preset clothes, a video object in the training video data wears the preset clothes, the training video data comprises N image frames, and N is a positive integer; extracting features of preset clothes in the training image to obtain first feature information of the preset clothes; masking areas to be replaced in N image frames included in the training video data respectively to obtain N clothing masking patterns; Fusing the N clothing mask images with one of N input noise images respectively to obtain N fused images; predicting noise in the N fused images by using the first characteristic information of the preset clothes as a control condition of a noise prediction model through the noise prediction model to obtain N noise prediction values; And determining the loss of the noise prediction model based on the N noise preset values and the noise values corresponding to the N input noise graphs, and training the noise prediction model based on the loss to obtain a trained noise prediction model.
14. The method of claim 13, wherein predicting noise in the N fused images by the noise prediction model under the control condition that the first characteristic information of the preset garment is taken as a noise prediction model to obtain N noise prediction values includes: Converting the N fusion images from a pixel space to a latent vector space to obtain characteristic information of the N fusion images; determining input information of the noise prediction model based on the characteristic information of the N fusion images; and taking the first characteristic information of the preset clothes as the control condition of the noise prediction model, and processing the input information through the noise prediction model to obtain the N noise prediction values.
15. The method according to claim 13 or 14, wherein determining the input information of the noise prediction model based on the feature information of the N fused images comprises: Performing key point detection on each of the N image frames to obtain key point posture information of the video object in each image frame; Encoding key point gesture information of the video object in each image frame through a gesture encoding model to obtain skeleton feature information of the video object in each image frame; Determining the input information based on the skeleton feature information of the video object in each image frame and the feature information of each fusion image; training the noise prediction model based on the loss to obtain a trained noise prediction model, including: And synchronously training the noise prediction model and the gesture coding model based on the loss to obtain a trained noise prediction model and a trained gesture coding model.
16. The method of any one of claims 13-15, wherein the performing feature extraction on the preset garment in the training image to obtain first feature information of the preset garment includes: performing feature extraction on preset clothes in the training image through a clothes feature extraction model to obtain first feature information of the preset clothes; training the noise prediction model based on the loss to obtain a trained noise prediction model, including: and synchronously training the noise prediction model and the clothing feature extraction model based on the loss to obtain a trained noise prediction model and a trained clothing feature extraction model.
17. A video data processing apparatus, comprising: The device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring an image comprising preset clothes and original video data of a video object, the original video data comprises N image frames, and N is a positive integer; the extraction unit is used for extracting the characteristics of the preset clothes in the image comprising the preset clothes to obtain first characteristic information of the preset clothes; A mask unit, configured to mask regions to be replaced in the N image frames, to obtain N clothing mask images, where the regions to be replaced are at least part of clothing regions of the video object in the N image frames; The fusion unit is used for respectively fusing the N clothing mask images with one of N input noise images to obtain N fusion images; The prediction unit is used for predicting the noise in the N fused images by taking the first characteristic information of the preset clothes as the control condition of a noise prediction model through the noise prediction model so as to obtain N noise prediction values; And the denoising unit is used for denoising the N input noise images based on the N noise predicted values respectively to obtain N denoising images, wherein the region corresponding to the region to be replaced in the N denoising images presents the preset clothes and is used for obtaining the video data after the video object is replaced.
18. A model training device, comprising: The training system comprises an acquisition unit, a training unit and a display unit, wherein the acquisition unit is used for acquiring a training sample, the training sample comprises a training image and training video data, the training image is an image comprising preset clothes, a video object in the training video data wears the preset clothes, the training video data comprises N image frames, and N is a positive integer; The extraction unit is used for extracting the characteristics of the preset clothes in the training image to obtain first characteristic information of the preset clothes; A mask unit, configured to mask regions to be replaced in the N image frames, to obtain N clothing mask images, where the regions to be replaced are at least part of clothing regions of the video object in the N image frames; The fusion unit is used for respectively fusing the N clothing mask images with one of N input noise images to obtain N fusion images; The prediction unit is used for predicting the noise in the N fused images by taking the first characteristic information of the preset clothes as the control condition of a noise prediction model through the noise prediction model so as to obtain N noise prediction values; The training unit is used for determining the loss of the noise prediction model based on the N noise preset values and the noise values corresponding to the N input noise graphs, and training the noise prediction model based on the loss to obtain a trained noise prediction model.
19. An electronic device comprising a processor and a memory; The memory is used for storing a computer program; the processor for executing the computer program to implement the method of any of the preceding claims 1 to 12 or 13 to 16.
20. A computer-readable storage medium storing a computer program; The computer program causes a computer to perform the method of any of the preceding claims 1 to 12 or 13 to 16.

Description

Video data processing method, device, equipment and storage medium Technical Field The embodiment of the application relates to the technical field of computers, in particular to a video data processing method, a device, equipment and a storage medium. Background With the development of the fields of electronic shopping and the like, a virtual try-on technology is provided. The main goal of virtual try-on is to generate an image of a video object wearing a garment comprising a preset, ensuring that the fine details of the garment are preserved and can be fused with the surrounding environment. In the current virtual fitting technology, an image including a preset garment and an image including an object are combined to generate a picture of the object wearing the preset garment. However, the current virtual fitting technology generates a reloaded picture, and the reloading effect cannot meet the expected requirement. Disclosure of Invention The application provides a video data processing method, a device, equipment and a storage medium, which can realize video reloading and further improve the virtual reloading effect. In a first aspect, the present application provides a video data processing method, including: Acquiring an image comprising preset clothes and original video data comprising video objects, wherein the original video data comprises N image frames; Extracting features of preset clothes in the image comprising the preset clothes to obtain first feature information of the preset clothes; Masking areas to be replaced in the N image frames respectively to obtain N clothing masking images, wherein the areas to be replaced are at least part of clothing areas of the video objects in the N image frames; Fusing the N clothing mask images with one of N input noise images respectively to obtain N fused images; taking the first characteristic information of the preset clothes as a noise prediction model denoising control condition, and predicting noise in the N fused images through the noise prediction model to obtain N noise prediction values; And denoising the N input noise maps based on the N noise predicted values respectively to obtain N denoised images, wherein the denoised images are used for generating updated video data containing video objects. In a second aspect, the present application provides a model training method, including: Acquiring a training sample, wherein the training sample comprises a training image and training video data, the training image is an image comprising preset clothes, a video object in the training video data wears the preset clothes, the training video data comprises N image frames, and N is a positive integer; extracting features of preset clothes in the training image to obtain first feature information of the preset clothes; Masking areas to be replaced in the N image frames respectively to obtain N clothing masking images, wherein the areas to be replaced are at least part of clothing areas of the video object in the N image frames; The N clothing mask patterns are respectively fused with one of N input noise patterns to obtain N fused images, first characteristic information of the preset clothing is used as a control condition of a noise prediction model, and noise in the N fused images is predicted through the noise prediction model to obtain N noise prediction values; And determining the loss of the noise prediction model based on the N noise preset values and the noise values corresponding to the N input noise graphs, and training the noise prediction model based on the loss to obtain a trained noise prediction model. In a third aspect, the present application provides a video data processing apparatus comprising: The system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring an image comprising preset clothes and original video data comprising video objects, the original video data comprises N image frames, and N is a positive integer; the extraction unit is used for extracting the characteristics of the preset clothes in the image comprising the preset clothes to obtain first characteristic information of the preset clothes; A mask unit, configured to mask regions to be replaced in the N image frames, to obtain N clothing mask images, where the regions to be replaced are at least part of clothing regions of the video object in the N image frames; the fusion unit is used for respectively fusing the N clothing mask images with one of N input noise images to obtain N fused image images; The prediction unit is used for predicting the noise in the N fused images by taking the first characteristic information of the preset clothes as the control condition of a noise prediction model through the noise prediction model so as to obtain N noise prediction values; And the denoising unit is used for denoising the N input noise maps based on the N noise predicted values respectively to obtain N denoising