CN-116634223-B - Subtitle extraction method based on video text merging, filtering and classifying

CN116634223BCN 116634223 BCN116634223 BCN 116634223BCN-116634223-B

Abstract

The invention discloses a subtitle extraction method based on video text merging, filtering and classifying, which comprises the steps of extracting frames from a video, identifying texts in all video frames by using optical characters to obtain a text box set in the video, merging and filtering the text box set in the video according to text content, text box coordinates, text appearance time and the like, predicting whether each text box after filtering is a subtitle by using a subtitle classification model based on machine learning, and storing texts and position information of the texts, which are judged to be subtitle categories, as subtitle information of the video. The method of the invention can preliminarily filter most texts which do not belong to the caption type by merging and filtering the text boxes, and can further determine the text box type by constructing a machine learning caption classification model. The method can solve the problem of changeable positions of the existing Internet video subtitles without defining subtitle areas.

Inventors

Jia Biwei
FANG PENGZHAN

Assignees

焦点科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20230522

Claims (4)

1. A subtitle extraction method based on video text merging, filtering and classifying is characterized by comprising the following steps: Step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an optical character recognition technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set; Step2, for the text boxes in the first text box set, performing text merging by using text box information of each text box in a single frame image and among a plurality of continuous frame images to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to occurrence time of the merged text boxes, wherein text similarity and text box merging ratio of adjacent frames are calculated for the text boxes of all frames in a single video, and the text boxes meeting a preset merging ratio threshold are merged; and 3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes, wherein the filtering comprises: according to the list of duration time, deleting the text boxes with duration time exceeding a preset duration threshold value from the second text box set; Deleting the text boxes exceeding a preset offset threshold from the second text box set according to the maximum offset information of the vertex coordinates; Calculating the inclination angle of the text box according to the vertex coordinates of the text box, and deleting the text box with the inclination angle larger than a preset inclination threshold value from the second text box set; Presetting a character quantity threshold, and deleting the text boxes which do not meet the character quantity threshold from the second text box set; Step 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set; Including determining a classification characteristic for each text box: Calculating a median of durations of individual characters in all text boxes in the second set of text boxes; calculating the absolute value of the duration time and the median of the duration time of the single character of each text box as a feature I; Calculating a median by using the character heights of all text boxes in the second text box set, and normalizing the median according to the video pixel heights; Calculating the absolute value of the character height of each text box and the median of the character heights of all the text boxes to obtain a second characteristic; Calculating a thermodynamic diagram of a text region according to the coordinate position and duration of the text boxes, obtaining a thermodynamic average value of the region where each text box is located according to the thermodynamic diagram, and calculating the median of the thermodynamic average values of all text boxes in the second text box set; Calculating the absolute value of the median of the thermal average value of each text box and the thermal average value of the whole video, and taking the absolute value as a third characteristic; combining the first feature, the second feature and the third feature into feature vectors of each text box; Step 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle; And 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video.
2. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 1, wherein in the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates, and in the step 2, obtaining the second text box set includes: Aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, and merging different text boxes conforming to the judging rule, wherein the judging rule comprises the following steps: Calculating a first difference absolute value of the heights of the current text box and the text box to be combined, wherein the first difference absolute value is smaller than a preset first pixel value; Calculating a second difference absolute value of a maximum value of the y-axis coordinates of the current text box and a minimum coordinate value of the y-axis of the text box to be combined, wherein the second difference absolute value is smaller than a preset second pixel value; Calculating a third difference absolute value of the midpoint coordinates of the widths of the current text box and the text box to be combined, wherein the third difference absolute value is smaller than a preset third pixel value; And updating the text content and the vertex coordinates of the combined new text box, and keeping the height of the original text box to be the character height of the combined text.
3. The method for extracting subtitles based on video text merging, filtering and classifying according to claim 2, wherein in the step 2, the first pixel value is 6 pixels, the second pixel value is 20 pixels, and the third pixel value is 80 pixels.
4. The method for extracting subtitles based on video text merging, filtering and classifying as claimed in claim 3, wherein in the step 5, the method for constructing the subtitle classifying model comprises the following steps: Collecting a plurality of videos for training, and extracting classification features from text boxes in a second text box set through the processing of the steps 1-4, wherein the three-dimensional feature of each text box is used as a training sample; Labeling whether each training sample is a subtitle or not; And inputting training samples, taking the labels as expected outputs, and training a subtitle classification model for judging the subtitles in the video.

Description

Subtitle extraction method based on video text merging, filtering and classifying Technical Field The invention relates to the technical field of computers, in particular to the technical field of computer vision technology and machine learning, and particularly relates to a subtitle extraction method based on video text merging, filtering and classifying. Background With the development of self-media and electronic commerce, the number of short video authoring has increased dramatically. The subtitle information in the video is extracted, so that the subtitle information can be further used in combination with a large-scale semantic understanding model, text dimension is added to assist in video understanding on the basis of image and video picture understanding, the subtitle information can be used for a plurality of other scenes, such as multi-language translation during transnational propagation, and a subtitle-free video is identified for subtitle generation and the like. Subtitles are of different kinds, such as dominant subtitles, captions, talking subtitles, etc. In the existing subtitle extraction technology, the common method is to set a subtitle region first, then use an optical character recognition technology to extract the characters in a specific region, and the method is more suitable for videos with certain rules in subtitle positions such as film and television programs. In addition, there is a multi-modal subtitle extraction model based on voice recognition and optical character recognition, and the multi-modal model mainly extracts talking subtitles, and although the accuracy of subtitle text can be effectively increased, the multi-modal subtitle extraction model is difficult to be applied to subtitle types without voice accompanying. The caption positions in the short video are changeable, and a general template cannot be used for extracting caption information, so how to use a general method for extracting multiple types of captions is a problem to be solved. Disclosure of Invention The invention aims to solve the technical problems of overcoming the defects of the prior art and solving the pertinence and universality problems of a general subtitle extraction method in the short video field, and provides a subtitle extraction method based on video text merging, filtering and classifying. In order to solve the technical problems, the invention provides a subtitle extraction method based on video text merging, filtering and classifying, which is characterized by comprising the following steps: Step 1, extracting frames from a video at preset time intervals, performing text detection and recognition on frame images, including detecting texts in all frame images by using an Optical Character Recognition (OCR) technology to form text boxes, and counting multi-dimensional features according to text box information and video time axis information, wherein the multi-dimensional features comprise text content, vertex coordinates and appearance time of each text box to obtain a first text box set; Step 2, for the text boxes in the first text box set, performing text merging by using text box information of each text box in a single frame image and among a plurality of continuous frame images to obtain a second text box set, updating vertex coordinates of the merged text boxes, and generating a duration list according to occurrence time of the merged text boxes; step3, for the text boxes in the second text box set, filtering the text boxes according to preset conditions by using the multidimensional features of the text boxes; Step 4, extracting the feature vector of each text box based on the time domain and space domain information of the text box for training a machine learning algorithm for the multidimensional features of the text boxes in the filtered second text box set; Step 5, inputting the feature vector of the second text box set into a subtitle classification model based on machine learning for training and judging whether the text box set is a subtitle; And 6, setting the text box set judged as the subtitle as a third text box set, and setting the third text box set as subtitle information of the video. In the step 1, the vertex coordinates include x-axis coordinates and y-axis coordinates, and in the step 2, obtaining the second text box set includes: Step 2-1, aiming at more than one text box with the same appearance time, sequentially performing text box operation from top to bottom according to the y-axis coordinates of each text box, calculating to obtain the midpoint coordinates of the corresponding height and width of each text box by using vertex coordinates, presetting a judging rule based on the midpoint coordinates of the height and width of the text box, merging different text boxes conforming to the judging rule, updating the text content and the vertex coordinates of the new text box after merging, and keeping the height of the original text box as the character height