CN-121986472-A - Video processing method, apparatus, electronic device, storage medium, and program product

CN121986472ACN 121986472 ACN121986472 ACN 121986472ACN-121986472-A

Abstract

The present disclosure provides a video processing method, apparatus, electronic device, storage medium and program product, and relates to the technical field of video processing and artificial intelligence. The method comprises the specific implementation scheme of determining target frames from a two-dimensional video frame sequence to be processed, wherein the target frames comprise a front Jing Wenben, performing depth estimation on the two-dimensional video frame sequence to be processed to obtain target depth maps corresponding to each video frame, wherein depth information of the target depth maps represents fusion results of depth information of a current video frame and depth information of adjacent video frames, processing target depth maps and mask images corresponding to adjacent video frames of the target frames in the two-dimensional video frame sequence to be processed to generate target images, processing the target depth maps and the target images corresponding to the adjacent video frames based on a position and a preset parallax relation, generating target viewpoint images comprising foreground texts, and outputting three-dimensional videos based on the target viewpoint images.

Inventors

CUI TENGHE
WANG ZHIXIN
JIANG XIAOTIAN
SUN XUAN

Assignees

京东方科技集团股份有限公司

Dates

Publication Date: 20260505
Application Date: 20240814

Claims (20)

A video processing method, comprising: determining a target frame from a two-dimensional video frame sequence to be processed, wherein the target frame comprises a front Jing Wenben; Performing depth estimation on the two-dimensional video frame sequence to be processed to obtain a target depth map corresponding to each video frame, wherein the depth information of the target depth map represents the fusion result of the depth information of the current video frame and the depth information of the adjacent video frame; Processing a target depth map and a mask image corresponding to each of adjacent video frames of a target frame in the two-dimensional video frame sequence to be processed to generate a target image, wherein the target image comprises complete image information of the target frame, the mask image is obtained by masking the target frame based on the position of the front Jing Wenben in the target frame, and And processing the target depth map and the target image corresponding to each adjacent video frame of the target frame based on the position and the preset parallax relation, generating a target viewpoint image comprising the foreground text, and outputting a three-dimensional video based on the target viewpoint image.
The method of claim 1, wherein the sequence of two-dimensional video frames to be processed comprises I video frames, I being an integer greater than 1, and wherein determining the target frame from the sequence of two-dimensional video frames to be processed comprises: Performing foreground text recognition on the two-dimensional video frame sequence to be processed, determining a foreground text region where the foreground text is located, and And determining the ith video frame as the target frame in response to the difference of the text content of the target area in the adjacent video frame of the ith video frame and the foreground text area in the ith video frame being smaller than a first preset threshold, wherein the position of the target area in the adjacent video frame is the same as the position of the foreground text area in the ith video frame, and I is an integer which is larger than 1 and smaller than or equal to I.
The method of claim 2, the method further comprising: Acquiring the positions of a plurality of characters in a foreground text region in the ith video frame; Determining the character direction of the plurality of characters by fitting the positions of the plurality of characters, and And determining the ith video frame as the target frame in response to the text direction being a preset direction.
A method according to claim 2 or 3, wherein the method further comprises: detecting character attributes of a plurality of characters in a foreground text region in the ith video frame, and And determining the ith video frame as the target frame in response to the text attribute being a preset attribute.
The method of any of claims 2-4, wherein the method further comprises: And determining the ith video frame as the target frame in response to the position of the foreground text region in the ith video frame belonging to a preset position range.
The method of any of claims 2-5, wherein the method further comprises: Performing voice recognition on the audio corresponding to the ith video frame to obtain a voice recognition result, and And determining that the ith video frame is the target frame in response to the voice recognition result and the difference of the text content in the foreground text area in the ith video frame being smaller than the first preset threshold value.
The method of any one of claims 2-6, further comprising: Binarizing the foreground text region based on the target depth map corresponding to the ith video frame to obtain first depth information corresponding to the text and second depth information corresponding to the background, and And determining the ith video frame as the target frame in response to a difference between the first depth information and the second depth information being greater than a second predetermined threshold.
The method of any of claims 3-7, further comprising: Determining weights of the respective plurality of target parameters, and According to the weight, determining the target frame by carrying out weighting processing on the judgment results corresponding to the target parameters; the target parameters comprise at least two of a character direction, a character attribute, a position of a foreground text region, a character content difference and a depth difference between characters and a background.
The method according to any one of claims 1-8, wherein the performing depth estimation on the two-dimensional video frame sequence to be processed to generate a target depth map corresponding to each video frame, includes: Performing depth estimation on each video frame to obtain an initial depth map corresponding to each video frame, and Based on the attention mechanism, processing an initial depth map corresponding to an ith video frame, an initial depth map corresponding to an I-1 th video frame and an initial depth map corresponding to an i+1th video frame to generate a target depth map corresponding to the ith video frame, wherein I is an integer greater than 1 and less than or equal to I.
The method of claim 9, wherein the performing depth estimation on each video frame to obtain an initial depth map corresponding to each video frame comprises: Image-compressing each video frame for each video frame to obtain an image of a predetermined resolution, and And processing the image characteristics of the image with the preset resolution by using a target network to generate the initial depth map, wherein the target network is obtained by training the depth information of each pixel in the sample image with the preset resolution as a label.
The method of claim 9 or 10, wherein the processing, based on the attention mechanism, the initial depth map corresponding to the i-th video frame, the initial depth map corresponding to the i-1 th video frame, and the initial depth map corresponding to the i+1 th video frame to generate the target depth map corresponding to the i-th video frame includes: Respectively extracting features of an initial depth map corresponding to an ith video frame, an initial depth map corresponding to an i-1 th video frame and an initial depth map corresponding to an i+1 th video frame to generate a first intermediate feature map corresponding to each of the initial depth maps, and And fusing the first intermediate feature maps corresponding to the attention mechanisms based on the attention mechanisms to generate a target depth map corresponding to the ith video frame.
The method according to any one of claims 1-11, wherein the processing the target depth map and the mask image, each corresponding to a neighboring video frame of the target frame in the sequence of two-dimensional video frames to be processed, to generate the target image, comprises: respectively carrying out feature extraction on the target depth map and the mask image corresponding to each adjacent video frame to generate second intermediate feature maps corresponding to each adjacent video frame; splicing the second intermediate feature graphs corresponding to each other to obtain a third intermediate feature graph; Processing the third intermediate feature map based on an attention mechanism to generate first difference information between the target image and the mask image, and And processing the mask image based on the first difference information to generate the target image.
The method of claim 12, wherein the processing the third intermediate feature map based on the attention mechanism to generate first difference information for the target image and the mask image comprises: dividing the third intermediate feature map to obtain a plurality of feature blocks; Processing the plurality of feature blocks based on the attention mechanism to generate a plurality of local features corresponding to the plurality of pixel blocks; splicing the local features to obtain global fusion features, and And processing the global fusion feature to generate the first difference information.
The method of any of claims 1-13, wherein the processing the target depth map and the target image, respectively corresponding to the adjacent video frames, based on the position and a predetermined disparity relationship, generates a target viewpoint image comprising the foreground text, comprising: Processing the target depth map and the target image corresponding to the adjacent video frames respectively to generate an initial viewpoint image, wherein the initial viewpoint image and the target viewpoint image have the same viewpoint direction, and And filling the foreground text into the initial viewpoint image based on the position and a preset parallax relation, and generating the target viewpoint image.
The method of claim 14, wherein the processing the target depth map and the target image, which each correspond to the adjacent video frame, to generate an initial viewpoint image comprises: Respectively extracting features of the target depth map and the target image corresponding to the adjacent video frames to generate fourth intermediate feature maps corresponding to the adjacent video frames, and Splicing the respective fourth intermediate feature graphs to obtain a fifth intermediate feature graph; Processing the fifth intermediate feature map based on an attention mechanism to generate second difference information between the initial viewpoint image and the target image; and processing the target image based on the second difference information to generate the initial viewpoint image.
The method of claim 14 or 15, wherein the filling the foreground text into the initial viewpoint image based on the position and a predetermined disparity relationship, generating the target viewpoint image, comprises: determining depth information of foreground text from a target depth map corresponding to the target frame based on the position, and And filling the foreground text into the initial viewpoint image according to the depth information of the foreground text, the position and the preset parallax relation, and generating the target viewpoint image.
The method of claim 16, wherein the populating the foreground text into the initial viewpoint image according to the depth information of the foreground text, the position, and a predetermined disparity relationship, generating the target viewpoint image comprises: determining a target position of the front Jing Wenben in the target viewpoint image based on depth information of the foreground text, the position, and a predetermined parallax relationship, and And filling the foreground text into the initial viewpoint image based on the target position, and generating the target viewpoint image.
A video processing apparatus comprising: the determining module is used for determining a target frame from a two-dimensional video frame sequence to be processed, wherein the target frame comprises a front Jing Wenben; The depth estimation module is used for carrying out depth estimation on the two-dimensional video frame sequence to be processed to obtain a target depth map corresponding to each video frame, wherein the depth information of the target depth map represents the fusion result of the depth information of the current video frame and the depth information of the adjacent video frame; A processing module for processing a target depth map and a mask image corresponding to each of adjacent video frames of a target frame in the two-dimensional video frame sequence to be processed to generate a target image, wherein the target image comprises complete image information of the target frame, the mask image is obtained by masking the target frame based on the position of the front Jing Wenben in the target frame, and And the generation module is used for processing the target depth map and the target image corresponding to each adjacent video frame based on the position and the preset parallax relation, generating a target viewpoint image comprising the foreground text and outputting a three-dimensional video based on the target viewpoint image.
An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-17.
A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-17.

Description

Video processing method, apparatus, electronic device, storage medium, and program product Technical Field The present disclosure relates to the field of video processing and artificial intelligence, and relates to the field of computer vision and deep learning technologies, and in particular to a video processing method, apparatus, electronic device, storage medium, and program product. Background The 3D (3-dimension) video is a video synthesized by three-dimensional software according to the real requirement, constructing a scene and a model in a virtual three-dimensional world in a computer according to the real size of an object to be represented, manufacturing vivid materials and lamplight, and configuring the movement effect of a camera. The 3D video may bring a user with a more immersive visual experience than a 2D (2-dimensional) video. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention The present disclosure provides a video processing method, apparatus, electronic device, storage medium, and program product. According to a first aspect, the disclosure provides a video processing method, which comprises the steps of determining a target frame from a two-dimensional video frame sequence to be processed, wherein the target frame comprises a front Jing Wenben, performing depth estimation on the two-dimensional video frame sequence to be processed to obtain a target depth map corresponding to each video frame, wherein depth information of the target depth map represents a fusion result of depth information of a current video frame and depth information of adjacent video frames, processing the target depth map corresponding to each adjacent video frame of the target frame in the two-dimensional video frame sequence to be processed and a mask image to generate a target image, wherein the target image comprises complete image information of the target frame, the mask image is obtained by masking the target frame based on the position of a foreground text in the target frame, processing the target depth map corresponding to each adjacent video frame and the target image based on the position and a preset parallax relation, generating a target viewpoint image comprising the foreground text, and outputting a three-dimensional video based on the target viewpoint image. According to a second aspect, the present disclosure provides a video processing apparatus including a determination module, a depth estimation module, a processing module, and a generation module. And the determining module is used for determining a target frame from the two-dimensional video frame sequence to be processed, wherein the target frame comprises a foreground text. The depth estimation module is used for carrying out depth estimation on the two-dimensional video frame sequence to be processed to obtain target depth maps corresponding to each video frame, wherein the depth information of the target depth maps represents the fusion result of the depth information of the current video frame and the depth information of the adjacent video frames. The processing module is used for processing the target depth map and the mask image which are respectively corresponding to the adjacent video frames of the target frames in the two-dimensional video frame sequence to be processed to generate a target image, wherein the target image comprises the complete image information of the target frames, and the mask image is obtained by masking the target frames based on the positions of the foreground texts in the target frames. And the generation module is used for processing the target depth map and the target image corresponding to each adjacent video frame based on the position and the preset parallax relation, generating a target viewpoint image comprising a foreground text and outputting a three-dimensional video based on the target viewpoint image. According to a third aspect, the present disclosure provides an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above. According to a fourth aspect, the present disclosure provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above. According to a fifth aspect, the present disclosure provides a computer program product comprising a computer program which, when executed by a processor, implements a method as described above. It should be understood that the description in this sect