CN-121864926-B - Subtitle eliminating method and device, electronic equipment and storage medium

CN121864926BCN 121864926 BCN121864926 BCN 121864926BCN-121864926-B

Abstract

The invention provides a caption eliminating method, a caption eliminating device, electronic equipment and a storage medium, which relate to the field of video processing, can determine caption areas in video image frames, extract characteristics of caption area images to obtain caption area characteristics, and determine a sub-mirror switching identification corresponding to the video image frames according to image differences between adjacent caption area images. And then, dynamically determining the fusion weight of each video image frame according to the sub-mirror switching identification, fusing the caption area characteristics with the motion compensation characteristics extracted from the video image frames according to the fusion weight to obtain fusion characteristics, and performing attention mechanism processing on the fusion characteristics of the adjacent video image frames to obtain time sequence fusion characteristics. Furthermore, the sequence fusion feature can be decoded to obtain an initial repair feature, the caption region in the initial repair feature is corrected by utilizing the caption region feature, and a repair video image frame for eliminating the caption is generated by utilizing the corrected initial repair feature, so that the caption elimination quality can be improved.

Inventors

YU YI
GAN WEIHAO

Assignees

马栏山音视频实验室

Dates

Publication Date: 20260512
Application Date: 20260319

Claims (11)

1. A subtitle canceling method, comprising: acquiring video image frames and corresponding subtitle masks; performing feature extraction, optical flow estimation and motion compensation on the video image frames and the subtitle masks to obtain motion compensation features; determining a caption area in the video image frame by utilizing the caption mask, extracting features of caption area images to obtain caption area features, and determining a sub-mirror switching identifier corresponding to the video image frame according to the image difference between adjacent caption area images, wherein the sub-mirror switching identifier represents the existence of sub-mirror switching or the absence of sub-mirror switching; Determining the fusion weight of the video image frames according to the sub-mirror switching identification, fusing the motion compensation feature and the subtitle region feature according to the fusion weight to obtain a fusion feature, and performing attention mechanism processing on the fusion feature of the adjacent video image frames to obtain a time sequence fusion feature; and decoding the time sequence fusion characteristic to obtain an initial repair characteristic, correcting a caption area in the initial repair characteristic by utilizing the caption area characteristic, and generating a repair video image frame with the caption eliminated by utilizing the corrected initial repair characteristic.
2. The subtitle removal method of claim 1, wherein determining a subtitle region in the video image frame using the subtitle mask includes: Determining a minimum circumscribed matrix corresponding to the caption mask, and determining a position coordinate of the minimum circumscribed matrix in the video image frame, wherein the position coordinate comprises a center coordinate, an upper left corner coordinate and an upper right corner coordinate; calculating standard deviation of the center coordinates in a preset number of continuous video image frames, and if the standard deviation is smaller than a preset value, determining a center coordinate average value, an average width and an average height in the continuous video image frames by utilizing position coordinates in the continuous video image frames; and determining the caption area according to the center coordinate average value, the average width and the average height.
3. The subtitle removal method of claim 1, wherein performing feature extraction on the subtitle region image to obtain the subtitle region feature includes: Extracting features of the caption area image by using a convolution layer to obtain a global feature vector; And determining the color average value of each color channel in the caption area image, and splicing the color average value with the global feature vector to obtain the caption area feature.
4. The subtitle removal method of claim 1, wherein determining a sub-mirror switch identifier corresponding to the video image frame based on an image difference between adjacent subtitle region images includes: determining a structural similarity value and a color histogram difference value between images of adjacent caption areas, and determining an average amplitude value of optical flow in the caption areas by utilizing the motion compensation characteristics; Judging whether the structural similarity value, the color histogram difference value and the optical flow average amplitude meet a preset split-mirror condition or not, wherein the preset split-mirror condition comprises that the structural similarity value is smaller than a first preset value, the color histogram difference value is larger than a second preset value and the optical flow average amplitude is smaller than a third preset value; if yes, setting a sub-mirror switching mark representing that sub-mirror switching exists for the current video image frame; if not, setting a split mirror switching mark for representing that the split mirror switching does not exist for the current video image frame.
5. The subtitle elimination method of claim 1, wherein determining the fusion weight of the video image frame according to the split mirror switch identification comprises: Acquiring a current sub-mirror switching identification and a current frame sequence number of a current video frame, determining a historical frame sequence number of the last sub-mirror switching before the current video frame according to the sub-mirror switching identification of each video image frame, and determining an interval value between the current frame sequence number and the historical frame sequence number; If the current sub-mirror switching identification indicates that sub-mirror switching does not exist and the interval value is not located in a preset interval range, setting the fusion weight as a first weight value, wherein the first weight value is a preset value; if the current sub-mirror switching identification represents that sub-mirror switching exists, setting the fusion weight as a second weight value, wherein the second weight value is a preset value and smaller than the first weight value; If it is determined that the current sub-mirror switching identification indicates that sub-mirror switching does not exist and the interval value is located in a preset interval range, setting the fusion weight to be a third weight value between the second weight value and the first weight value by using a preset linear function and the interval value.
6. The subtitle cancellation method of claim 1, wherein fusing the motion compensation feature and the subtitle region feature according to the fusion weight to obtain a fusion feature includes: Determining average caption area characteristics corresponding to a preset number of continuous video image frames; according to the fusion weight, fusing the average caption area characteristics to caption areas in the motion compensation characteristics to obtain the fusion characteristics; And performing gain processing on the caption area in the fusion characteristic by using a gain coefficient.
7. The subtitle removal method of claim 1, wherein performing correction processing on a subtitle region in the initial repair feature using the subtitle region feature includes: determining average caption area characteristics corresponding to a preset number of continuous video image frames, and extracting characteristics to be compared, which are positioned in the caption area, in the initial repair characteristics; determining a characteristic difference value between the average caption area characteristic and the characteristic to be compared, and judging whether the characteristic difference value is larger than a preset difference threshold value or not; If the average subtitle region feature is larger than the average subtitle region feature, weighting and fusing the feature to be compared to correct the subtitle region in the initial repair feature; And if not, reserving the characteristics to be compared.
8. The subtitle cancellation method of claim 1, wherein performing feature extraction and optical flow estimation on the video image frame and the subtitle mask yields motion compensation features, comprising: Extracting features of the video image frames and the subtitle masks to obtain feature images; Performing optical flow estimation on feature images of a current video image frame, a previous video image frame and a next video image frame to obtain a bidirectional optical flow field; And performing spatial alignment on feature images of the previous video image frame and the next video image frame by using the bidirectional optical flow field to obtain the motion compensation feature.
9. A caption removing device, characterized by comprising: The acquisition module is used for acquiring the video image frames and the corresponding subtitle masks; The motion compensation feature extraction module is used for carrying out feature extraction, optical flow estimation and motion compensation on the video image frames and the subtitle masks to obtain motion compensation features; The fixed region time sequence enhancement module is used for determining a caption region in the video image frame by utilizing the caption mask, extracting features of caption region images to obtain caption region features, and determining a sub-mirror switching identifier corresponding to the video image frame according to the image difference between adjacent caption region images, wherein the sub-mirror switching identifier represents the existence of sub-mirror switching or the absence of sub-mirror switching; The feature propagation module is used for determining the fusion weight of the video image frames according to the sub-mirror switching identification, fusing the motion compensation feature and the subtitle region feature according to the fusion weight to obtain a fusion feature, and performing attention mechanism processing on the fusion feature of the adjacent video image frames to obtain a time sequence fusion feature; And the repair generation module is used for decoding the sequence fusion characteristics to obtain initial repair characteristics, correcting the caption areas in the initial repair characteristics by utilizing the caption area characteristics, and generating the repair video image frames with the captions eliminated by utilizing the corrected initial repair characteristics.
10. An electronic device, comprising: A memory for storing a computer program; A processor for implementing the subtitle elimination method according to any one of claims 1 to 8 when executing the computer program.
11. A computer readable storage medium having stored therein computer executable instructions which when loaded and executed by a processor implement the subtitle cancellation method according to any one of claims 1 to 8.

Description

Subtitle eliminating method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of video processing, and in particular, to a subtitle eliminating method, apparatus, electronic device, and storage medium. Background With the popularization of video content creation, high-quality caption erasure becomes a requirement of industries such as film and television later stage, online education, short video production and the like. However, in the related art, after the caption is erased, the residual image and the flicker are easy to appear, and such problems are particularly prominent in the case of switching the split mirrors. Disclosure of Invention The invention aims to provide a subtitle eliminating method, a subtitle eliminating device, electronic equipment and a storage medium, which can reduce the afterimage and flickering phenomenon caused by the switching of a split lens when eliminating the subtitle in a video image, and improve the subtitle eliminating effect. In order to solve the above technical problems, the present invention provides a subtitle eliminating method, including: acquiring video image frames and corresponding subtitle masks; performing feature extraction, optical flow estimation and motion compensation on the video image frames and the subtitle masks to obtain motion compensation features; determining a caption area in the video image frame by utilizing the caption mask, extracting features of caption area images to obtain caption area features, and determining a sub-mirror switching identifier corresponding to the video image frame according to the image difference between adjacent caption area images, wherein the sub-mirror switching identifier represents the existence of sub-mirror switching or the absence of sub-mirror switching; Determining the fusion weight of the video image frames according to the sub-mirror switching identification, fusing the motion compensation feature and the subtitle region feature according to the fusion weight to obtain a fusion feature, and performing attention mechanism processing on the fusion feature of the adjacent video image frames to obtain a time sequence fusion feature; and decoding the time sequence fusion characteristic to obtain an initial repair characteristic, correcting a caption area in the initial repair characteristic by utilizing the caption area characteristic, and generating a repair video image frame for eliminating the caption by utilizing the corrected initial repair characteristic. Optionally, the determining the subtitle region in the video image frame using the subtitle mask includes: Determining a minimum circumscribed matrix corresponding to the caption mask, and determining a position coordinate of the minimum circumscribed matrix in the video image frame, wherein the position coordinate comprises a center coordinate, an upper left corner coordinate and an upper right corner coordinate; calculating standard deviation of the center coordinates in a preset number of continuous video image frames, and if the standard deviation is smaller than a preset value, determining a center coordinate average value, an average width and an average height in the continuous video image frames by utilizing position coordinates in the continuous video image frames; and determining the caption area according to the center coordinate average value, the average width and the average height. Optionally, the extracting the features of the caption area image to obtain the features of the caption area includes: Extracting features of the caption area image by using a convolution layer to obtain a global feature vector; And determining the color average value of each color channel in the caption area image, and splicing the color average value with the global feature vector to obtain the caption area feature. Optionally, the determining the sub-mirror switching identifier corresponding to the video image frame according to the image difference between the adjacent caption area images includes: determining a structural similarity value and a color histogram difference value between images of adjacent caption areas, and determining an average amplitude value of optical flow in the caption areas by utilizing the motion compensation characteristics; Judging whether the structural similarity value, the color histogram difference value and the optical flow average amplitude meet a preset split-mirror condition or not, wherein the preset split-mirror condition comprises that the structural similarity value is smaller than a first preset value, the color histogram difference value is larger than a second preset value and the optical flow average amplitude is smaller than a third preset value; if yes, setting a sub-mirror switching mark representing that sub-mirror switching exists for the current video image frame; if not, setting a split mirror switching mark for representing that the split mirror swi