CN-118366143-B - High-precision video text tracking method and device based on topological structure feature association

CN118366143BCN 118366143 BCN118366143 BCN 118366143BCN-118366143-B

Abstract

The invention discloses a video text tracking method based on topological structure feature association, which comprises the steps of firstly utilizing a text detector to generate a text detection box for recall aiming at a video frame, then carrying out feature matching between text examples of a front frame and a rear frame based on a three-stage matching association strategy, and finally generating all tracks containing the same text target position information and identity information in the video. Dividing the text detection result into a high-resolution box and a low-resolution box, carrying out data association matching of a first stage and a second stage with the unpaired track sequentially, carrying out local search of the missing text box at the breakpoint by utilizing a local search tracker to refer to the text characteristic of the history track aiming at the interrupt track of the second stage, and carrying out data association matching of the local search text and the unpaired track in a third stage. The method is accurate and efficient, and achieves optimal tracking precision and considerable efficiency on video text tracking references of a plurality of main streams. The invention also provides a corresponding video text tracking device based on the topological structure feature association.

Inventors

ZHOU XINGSHENG
WANG XINGGANG
LIU WENYU

Assignees

华中科技大学

Dates

Publication Date: 20260512
Application Date: 20240423

Claims (3)

1. The video text tracking method based on the topological structure feature association is characterized by comprising the following steps of: (1) Training the text topological structure representation network and the mutual convolution matching network by utilizing a video text tracking training set, and comprises the following substeps: Extracting all text image block features corresponding to the front and back frame training images by using a backbone network of a text detector and a candidate region alignment operator; (1.2) training a text topological structure characterization network by utilizing the text image block characteristics generated in the step (1.1), and specifically comprises the following sub-steps: (1.2.1) extracting appearance feature embedding corresponding to the text image block features of the front and rear frames by using a text topological structure representation network; (1.2.2) for each text track of the preceding and following frames, determining the text instance in the track 、 And text instances outside the track The track identities of the three and the embedding of the appearance features corresponding to the step (1.2.1) are respectively 、、 And 、、 ; (1.2.3) Utilizing triplet losses Optimizing the text topological structure representation network parameters, and learning the text similarity, wherein boundary values Is set to be 1 and is set to be 1, The specific formula is as follows: ; (1.3) randomly extracting texts of adjacent frames of all training images aiming at text labeling boxes of all training images to generate text labeling pairs; (1.4) training a mutual convolution matching network by using the text label pair generated in the step (1.3), and specifically comprises the following sub-steps: (1.4.1) increasing the text box area in the annotation pair by 5 and 2.5 times respectively to generate a track text with expanded background and a detected text image; (1.4.2) predicting the trajectory text and detecting the similarity of the text image and Using the binary Cross-entropy loss function Optimizing the parameters of the mutual convolution matching network; Representing the logarithm of the input text image, And (3) with Respectively the first The specific formulas are as follows: ; (2) The confidence level is divided for a high recall text detection box generated by a text detection model, and the data association matching of a first-stage high score detection box and a history track is carried out on the basis of a text topological structure representation network and a mutual convolution matching network in the step (1), and the method is characterized by comprising the following steps: (2.1) generating a text detection box for a large number of current frames using a text detection model, setting two thresholds, a higher and a lower And (3) with Make the confidence degree larger than Is a high-resolution detection frame with confidence level between And (3) with The detection frame of (2) is a low-resolution detection frame; (2.2) predicting the position of the historical track in the current frame by using an exponential moving average algorithm, and calculating a position distance cost matrix of the historical track and the high-resolution detection frame in the step (2.1) based on the cross ratio; (2.3) characterizing the appearance distance cost matrix of the high-resolution detection frame and the historical track in the step (2.1) by using the trained text topological structure in the step (1.2), and then synthesizing the appearance distance cost matrix and the position distance cost matrix in the step (2.2) to generate an overall distance cost matrix; (2.4) performing background expansion similarity learning on the confusion text of the historical track by using the trained mutual convolution matching network in the step (1.4) and directly matching the text with the highest similarity; (2.5) analyzing the total distance cost matrix in the step (2.3) after screening by using a Hungary algorithm, and establishing the optimal matching between all the historical tracks and the high-resolution detection frame; (2.6) updating the historical track by using the paired texts in the step (2.5), and opening up a new track by using the unpaired high-resolution detection frame; (3) Performing a second stage low score detection box matching with the data association of the historical track not paired in step (2.4), comprising the sub-steps of: (3.1) calculating an overall distance cost matrix between the unpaired historical track and the low-resolution detection frame in the step (2.4) by referring to the algorithms in the steps (2.2) and (2.3); (3.2) analyzing the total distance cost matrix in the step (2.3) after screening by using a Hungary algorithm, and establishing the optimal matching between the unpaired historical track and the low-resolution detection text in the step (2.4); (3.3) updating the history track by using the paired texts in the step (3.2) and discarding the unpaired low-score detection texts; (4) Performing data association matching of the third-stage low score detection frame and the unpaired historical track in the step (3.2), wherein the method comprises the following substeps: (4.1) predicting (3.2) the position of the unpaired historical track in the current frame by using an exponential moving average algorithm and generating a missing text search area; (4.2) performing regression of the local search text box inside the missing text search area generated in step (4.1) using a local search tracker; (4.3) referring to the algorithms in steps (3.1) and (3.2), determining the best match between the unpaired historical track and the local search text in step (3.2); And (4.4) updating and deleting the historical track by using the matching result in the step (4.3), discarding the unpaired local search text, and simultaneously continuing the reasoning flow of the next frame in the step (2.1).
2. The method for tracking video text based on topological structure feature association according to claim 1, wherein in the step (1.3), the value of the random interval frame number of the adjacent frames of the training image is extracted randomly to be 10 at the maximum.
3. A video word tracking device based on topological feature association, comprising at least one processor and a memory, the at least one processor and the memory being connected by a data bus, the memory storing instructions for execution by the at least one processor, the instructions, when executed by the processor, for performing the video word tracking method based on topological feature association of any one of claims 1-2.

Description

High-precision video text tracking method and device based on topological structure feature association Technical Field The invention belongs to the technical field of deep learning and computer vision, and particularly relates to a high-precision video text tracking method and device based on topological structure feature association. Background Multi-target tracking is a classical research topic in the field of computer vision, and video text tracking is an important branch in multi-target tracking tasks. The video text tracking task requires predicting the trajectories of all text objects in the video, including IDs representing the identity information of the text objects and quadrilateral detection boxes representing the text location information. The multi-stage paradigm of the current mainstream divides video text tracking into two parts, firstly detects all text examples of the current frame, and secondly carries out data association between a history track and the text examples of the current frame. Video text tracking is widely used in comprehensive tasks such as landmark understanding, video retrieval, video content review, smart city, and the like. The difficulties in video text tracking tasks are mainly problems of text blurring, deformation, shielding and the like caused by various factors. The problems not only cause confusion of tracks at the data association level, but also cause missed detection of texts and interruption of corresponding tracks at the detection level. The basis of the correlation of the front frame data and the back frame data is mainly appearance and position information of texts, but in the scene with dense text distribution, such as supermarkets, warehouses and the like, texts with indistinguishable positions and appearance layers can exist, which causes serious interference to the data correlation in video text tracking. One of the key challenges of video text tracking is the problem of how to adequately characterize the structural features of a text image. The confusion caused by appearance during text association is basically due to the fact that the representation of the text image is not accurate enough. The most obvious feature of the visual layer text image is the arrangement structure of language characters from left to right. The specific character ordering represents specific semantics, so that the current video text tracking method mainly utilizes the RNN structure to characterize the semantics of the text image so as to achieve better text association effect. The feature differences of the same character in different text images limit the method of characterizing semantics. The method is characterized in that the internal structural features of the text are treated as unidirectional topological structures, and a graph convolutional neural network is introduced to model and characterize the structures, so that the learning of high-dimensional semantic information is avoided, and more accurate structural feature expression of the text image can be realized. In addition, video text tracking in special scenes also faces the problem of how to discern confusing text from a level other than the apparent location. In the scene of densely distributed articles such as supermarkets and warehouses, texts with similar appearance positions exist, and the similar texts need to be distinguished from layers outside the appearance positions. At present, no video text tracking method is researched for the confusing text. Background expansion is performed on the text, namely the text with similar appearance positions can be distinguished by utilizing the difference of background images around the text. In summary, there are many unresolved issues in the field of video text tracking, how to more effectively perform front-to-back frame feature fusion, how to combine more robust and efficient text characterization branching with detector design, etc. With continuous and deep research of various fields, the problems are hopeful to be solved, and the development of the video text tracking field is finally promoted. Disclosure of Invention Aiming at the defects of the prior art and the improvement demands, the invention aims to design a robust video text tracking method based on topological structure feature association. According to the method, a text candidate box with high recall rate is generated in a current frame tracking flow by using a text detector and a local search tracker, and then feature matching between a historical track and a candidate text is carried out based on a multi-stage and hierarchical association strategy, so that the text track is updated continuously. The method uses a high-score text box, a low-score text box and a text box recalled by a local search tracker which are output by a text detector to respectively carry out data association matching in three stages. The data association strategy effectively distinguishes the background and the low confidence