CN-121998815-A - Video processing method, apparatus, electronic device, computer readable storage medium, and computer program product

CN121998815ACN 121998815 ACN121998815 ACN 121998815ACN-121998815-A

Abstract

The application provides a video processing method, a device, electronic equipment, a computer readable storage medium and a computer program product, wherein the method comprises the steps of obtaining a video to be processed comprising at least one face, generating a face video corresponding to the face based on an image frame comprising the face in the video to be processed, amplifying the size of the face video to obtain an amplified face video, adjusting the face in the amplified face video to obtain an adjusted face video, and fusing the adjusted face video and the video to be processed to obtain a target video. The application can improve the accuracy of adjusting the face video, thereby improving the display effect of the target video.

Inventors

WANG HONGMEI
ZHAO WENZHE
TIAN QI
LIU MENGYANG
WANG HONGFA
LIN QIN
LU QINGLIN

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260508
Application Date: 20241101

Claims (17)

1. A method of video processing, the method comprising: Acquiring a video to be processed comprising at least one face; generating a face video corresponding to the face based on an image frame including the face in the video to be processed for each face in the video to be processed; amplifying the size of the face video to obtain an amplified face video; adjusting the face in the amplified face video to obtain an adjusted face video; And fusing the adjusted face video and the video to be processed to obtain a target video.
2. The method of claim 1, wherein the number of image frames is a plurality; The generating the face video corresponding to the face based on the image frame including the face in the video to be processed includes: For each image frame including the face in the video to be processed, determining a recognition frame including the face in the image frame; Acquiring a union set of a plurality of identification frames of the face to obtain a target identification frame; For each image frame of the face, intercepting a region corresponding to the target identification frame from the image frame to obtain an sub-image frame; And generating a face video corresponding to the face based on the plurality of sub-image frames of the face.
3. The method according to claim 2, wherein after obtaining the union of the plurality of recognition frames of the face to obtain the target recognition frame, the method further comprises: expanding the size of the target recognition frame based on the size of the image frame to obtain an expanded target recognition frame; the capturing, for each image frame of the face, a region corresponding to the target recognition frame from the image frame to obtain a sub-image frame, including: And aiming at each image frame of the human face, intercepting the region corresponding to the enlarged target recognition frame from the image frame to obtain a sub-image frame.
4. The method of claim 3, wherein expanding the size of the object recognition frame based on the size of the image frame to obtain an expanded object recognition frame comprises: acquiring the corresponding expansion ratio of the face; based on the expansion ratio and the size of the image frame, performing outer expansion processing on the target identification frame in the image frame by taking the center point of the target identification frame as an original point to obtain an expanded target identification frame; the size of the enlarged target recognition frame is smaller than or equal to the size of the image frame.
5. The method of claim 2, wherein the generating a face video corresponding to the face based on the plurality of sub-image frames of the face comprises: determining a scene of the face in the video to be processed; sequencing a plurality of sub-image frames of the human face based on the scene of the human face in the video to be processed to obtain a sub-image frame sequence; and carrying out video coding on the sub-image frame sequence to obtain the face video corresponding to the face.
6. The method of claim 1, wherein the adjusting the face in the enlarged face video to obtain an adjusted face video comprises: extracting video characteristics of the amplified face video; iteratively adding noise to the video features to obtain first noise adding features; acquiring a target text corresponding to the face, and carrying out iterative denoising on the first denoising feature based on the target text to obtain a second denoising feature; and decoding the second noise adding feature to obtain the adjusted face video.
7. The method of claim 6, wherein the obtaining the target text corresponding to the face comprises: Extracting a first image frame from the amplified face video; and generating target text corresponding to the human face based on the first image frame.
8. The method of claim 7, wherein generating the target text corresponding to the face based on the first image frame comprises: extracting image features of the first image frame through an image description model to obtain the image features of the first image frame; based on the image characteristics of the first image frame, carrying out image description prediction on the first image frame through the image description model to obtain a description text for describing the first image frame; And extracting target text corresponding to the human face from the description text.
9. The method of claim 6, wherein the acquiring the video to be processed including at least one face comprises: acquiring a prompt text for video generation; Inputting the prompt text into a video generation model to obtain a video which is generated by the video generation model and comprises at least one face, and taking the video generated by the video generation model as the video to be processed; The obtaining the target text corresponding to the face includes: and extracting a prompt text corresponding to the human face from the prompt text as the target text.
10. The method of claim 6, wherein iteratively denoising the video feature results in a first denoised feature, comprising: Repeating the following steps until the repetition times of the steps reach preset repetition times, and taking the characteristic after noise adding as the first noise adding characteristic; acquiring a feature to be processed, wherein the feature to be processed is the video feature under the condition that the current repetition number is 0, and is the feature after the last noise adding under the condition that the current repetition number is not 0; multiplying the feature to be processed with a first attenuation factor to obtain a first multiplication result; acquiring noise corresponding to the current repetition number, and multiplying the noise corresponding to the current repetition number by a second attenuation factor to obtain a second multiplication result; and taking the sum value of the first multiplication result and the second multiplication result as the characteristic after noise addition.
11. The method of claim 1, wherein the adjusted face video comprises a plurality of second image frames, the second image frames having a correspondence with image frames of the video to be processed that include the face; The step of fusing the adjusted face video and the video to be processed to obtain a target video comprises the following steps: creating a gradual mask for each of the second image frames in the adjusted face video, the gradual mask having pixel values that decrease from center to edge; Based on the gradual change mask, fusing the second image frame and the image frame corresponding to the second image frame and comprising the human face to obtain a first fused image frame; and correspondingly replacing the image frames including the human faces in the video to be processed with the first fusion image frames to obtain a target video.
12. The method of claim 11, wherein a first pixel of the fade mask corresponds one-to-one with a second pixel of the second image frame, and a third pixel in an image frame including the face has a correspondence with the first pixel; The step of fusing the second image frame and the image frame corresponding to the second image frame and including the face based on the gradual change mask to obtain a first fused image frame includes: Multiplying the first pixel by a second pixel corresponding to the first pixel for each first pixel in the gradual change mask to obtain a first product; multiplying the value determined based on the first pixel by a third pixel corresponding to the first pixel to obtain a second product; taking the sum of the first product and the second product as a fourth pixel; And replacing a third pixel in the image frame corresponding to the second image frame and comprising the human face with a fourth pixel corresponding to the third pixel to obtain a first fusion image frame.
13. The method of claim 1, wherein the adjusted face video comprises a plurality of second image frames, the second image frames having a correspondence with image frames of the video to be processed that include the face; The step of fusing the adjusted face video and the video to be processed to obtain a target video comprises the following steps: Extracting a face image from each second image frame of the adjusted face video, and shrinking the face image to obtain a reduced face image; Determining a target area in the image frame corresponding to the second image frame and comprising the human face; Fusing the reduced face image and the target area to obtain a second fused image frame; and correspondingly replacing the image frames including the human faces in the video to be processed with the second fusion image frames to obtain a target video.
14. A video processing apparatus, the apparatus comprising: the acquisition module is used for acquiring the video to be processed comprising at least one face; the generation module is used for generating a face video corresponding to each face in the video to be processed based on the image frames comprising the faces in the video to be processed; the amplifying module is used for amplifying the size of the face video to obtain an amplified face video; the adjusting module is used for adjusting the face in the amplified face video to obtain an adjusted face video; And the fusion module is used for fusing the adjusted face video and the video to be processed to obtain a target video.
15. An electronic device, the electronic device comprising: A memory for storing computer executable instructions or computer programs; a processor for implementing the video processing method of any one of claims 1 to 13 when executing computer executable instructions or computer programs stored in the memory.
16. A computer readable storage medium storing computer executable instructions or a computer program, which when executed by a processor implements the video processing method of any one of claims 1 to 13.
17. A computer program product comprising computer executable instructions or a computer program which, when executed by a processor, implements the video processing method of any one of claims 1 to 13.

Description

Video processing method, apparatus, electronic device, computer readable storage medium, and computer program product Technical Field The present application relates to the field of computer technology, and in particular, to a video processing method, apparatus, electronic device, computer readable storage medium, and computer program product. Background In the related art, in order to improve the viewing experience of a user, the video may be adjusted, so as to improve the visual effect of the video, for example, image repair may be performed on each image frame in the video, so as to implement video repair. However, the method of repairing the image frames has the problem that the consistency between the adjacent image frames cannot be maintained, so that the display effect of the processed video is poor. Disclosure of Invention The embodiment of the application provides a video processing method, a video processing device, electronic equipment, a computer readable storage medium and a computer program product, which can improve the accuracy of adjusting a face video so as to improve the display effect of a target video. The technical scheme of the embodiment of the application is realized as follows: the embodiment of the application provides a video processing method, which comprises the following steps: Acquiring a video to be processed comprising at least one face; generating a face video corresponding to the face based on an image frame including the face in the video to be processed for each face in the video to be processed; amplifying the size of the face video to obtain an amplified face video; adjusting the face in the amplified face video to obtain an adjusted face video; And fusing the adjusted face video and the video to be processed to obtain a target video. The embodiment of the application also provides a video processing device, which comprises: the acquisition module is used for acquiring the video to be processed comprising at least one face; the generation module is used for generating a face video corresponding to each face in the video to be processed based on the image frames comprising the faces in the video to be processed; the amplifying module is used for amplifying the size of the face video to obtain an amplified face video; the adjusting module is used for adjusting the face in the amplified face video to obtain an adjusted face video; And the fusion module is used for fusing the adjusted face video and the video to be processed to obtain a target video. The method comprises the steps of obtaining a target recognition frame, obtaining a plurality of image frames of a human face, obtaining a region corresponding to the target recognition frame from the image frames of the human face, obtaining a sub-image frame, and generating a human face video corresponding to the human face based on the sub-image frames of the human face. In the above scheme, the generating module is further configured to, after obtaining the union of the plurality of recognition frames of the face to obtain a target recognition frame, expand the size of the target recognition frame based on the size of the image frame to obtain an expanded target recognition frame, and intercept, for each image frame of the face, an area corresponding to the expanded target recognition frame from the image frame to obtain a sub-image frame. The generation module is further used for acquiring the expansion ratio corresponding to the face, and performing outward expansion processing on the target recognition frame in the image frame by taking the center point of the target recognition frame as an original point based on the expansion ratio and the size of the image frame to obtain an expanded target recognition frame, wherein the size of the expanded target recognition frame is smaller than or equal to that of the image frame. In the scheme, the generating module is further used for determining a scene of the face in the video to be processed, sequencing a plurality of sub-image frames of the face based on the scene of the face in the video to be processed to obtain a sub-image frame sequence, and performing video coding on the sub-image frame sequence to obtain a face video corresponding to the face. In the scheme, the adjusting module is further used for extracting video features of the amplified face video, carrying out iterative denoising on the video features to obtain first denoising features, obtaining target texts corresponding to the faces, carrying out iterative denoising on the first denoising features based on the target texts to obtain second denoising features, and decoding the second denoising features to obtain the adjusted face video. In the scheme, the adjusting module is further used for extracting a first image frame from the amplified face video and generating a target text corresponding to the face based on the first image frame. In the scheme, the adjusting module is further used for extracting image f