CN-117095027-B - Robust visual tracking reciprocal interlayer-time discrimination target model
Abstract
The invention discloses a reciprocal interlayer-time discrimination target model for robust visual tracking, belongs to the technical field of target tracking, and is used for solving the problem that the conventional twin tracking algorithm rarely considers information interaction of templates and search areas, so that accumulated errors of category target interference influence tracking results. The invention firstly builds an interlayer target perception enhancement model, realizes interlayer characteristic information interaction by building pixel-by-pixel correlation of a template and a search area in a characteristic extraction process, reduces accumulated errors caused by invisibility of a target to the search area, enhances the perception of the target, designs a time interference assessment strategy for weakening the influence of interference, builds a similarity relation among a plurality of candidate positions of adjacent frames by utilizing an inter-frame candidate propagation module, eliminates the similarity interference according to the determined similarity score, obtains more reliable target positions and realizes robust tracking.
Inventors
- ZHAO YANCHUN
- ZHANG HUANLONG
- MA ZONGHAO
- JIANG BIN
- TIAN YANGYANG
- ZHI PENGPENG
- SHEN FENGLI
- WAN YOU
- DUAN YULONG
Assignees
- 电子科技大学长三角研究院(湖州)
Dates
- Publication Date
- 20260512
- Application Date
- 20230828
Claims (5)
- 1. A robust visual tracking reciprocal inter-layer-time discrimination target model comprising the steps of: step one, respectively extracting the characteristics of a template and a search area by utilizing a pre-trained network, and obtaining a response chart by a matching module ; Step two, establishing the pixel-by-pixel correlation of the template and the search area in the characteristic extraction process through an interlayer target perception enhancement model, carrying out information interaction of a first stage and a fourth stage of a network, and obtaining a response graph through a matching module ; Step three, will And Weighting fusion is carried out to obtain a response graph which can more highlight the target ; Step four, reserving a response diagram through a time interference assessment strategy Is a plurality of candidates in the database; Establishing a connection among a plurality of candidates of adjacent frames through an inter-frame candidate propagation module to obtain a similarity score; step six, eliminating similar interference candidates according to the obtained similarity score, so as to obtain a more reliable target position; Step seven, the follow-up tracking process is carried out according to the steps until the video is finished; In the second step, the information interaction between the first stage and the fourth stage is established as follows: Firstly, remolding an input feature by a 1*1 convolution and maximum downsampling mode, then obtaining the similarity of each pixel between a template and the features of a search area by using a pixel-by-pixel correlation module to obtain a similarity feature, and aggregating the feature by 1*1 convolution and upsampling to finally obtain a feature with more discriminant property; and (3) eliminating the influence of similar interference candidate objects in the response graph according to the time interference evaluation strategy in the step four to obtain more reliable target positions, wherein the selection of the candidate objects is as follows: a set of candidates is represented and, A corresponding score is represented and is used to represent the score, The position corresponding to the score is indicated, Representing the features extracted at that location, Is an argument representing that contains 5 elements; the acquisition mode of (a) is as follows: Wherein the method comprises the steps of Representing the position with highest score in the response graph as candidate set In the presence of an element of the group, The value of the independent variable is 1-5, which means that 5 positions with the first five scores are selected ; Features (e.g. a character) The expression form of (2) is as follows: Representing a feature extraction network based on scores And position Obtaining the characteristics of the corresponding positions And store in a candidate set Is a kind of medium.
- 2. The robust visual tracking reciprocal inter-layer-time discriminant object model of claim 1, wherein said network in step one is a FBNet network.
- 3. The robust visual tracking reciprocal inter-layer-time discriminant object model of claim 1, wherein said response map is of step three Is represented as follows: to weight the fused response map that is more targeted, Is a weight factor for adjusting And The proportion of the material is that, Is the original response diagram without information interaction, And the response diagram is obtained after the information interaction of the template and the search area.
- 4. The robust visual tracking reciprocal inter-layer-time discriminant object model of claim 1, wherein candidates in step four include the score, location and characteristics of the object.
- 5. The robust visual tracking reciprocal inter-layer-time discriminant object model of claim 1, wherein the final result is expressed in the form of: the location of the end result is indicated, Rank the current frame score Is used to determine the position of the result, For the resulting position where the current frame scores highest, And representing the similarity score of the current frame predicted position and the previous frame predicted position determined by the inter-frame candidate propagation module.
Description
Robust visual tracking reciprocal interlayer-time discrimination target model Technical Field The invention relates to the technical field of target tracking, in particular to a reciprocal interlayer-time discrimination target model for robust visual tracking. Background Visual tracking is one of important research subjects in the field of computer vision, and has wide application in the fields of video monitoring, unmanned driving, road traffic monitoring and the like. It refers to identifying a target in the first frame of a given video sequence and then tracking the target in subsequent frames. In recent years, the Siamese tracking algorithm is favored by a large number of researchers with excellent tracking precision and speed, and the visual tracking technology also has greatly progressed, but the problem of how to realize more robust tracking is still to be solved in the face of challenges such as similar target interference and target deformation existing in the actual tracking process. A common Siamese tracking algorithm determines the location of a target by calculating the similarity between the template and the search area. However, during feature extraction, the targets are blind to the search area, which may cause accumulated errors of category target interference to adversely affect the final features and thus the tracking results. In order to solve the problems, some algorithms consider that tracking the most remarkable target in the search area can obtain more excellent performance, so that the most remarkable target is focused on designing a better feature extraction mode, some algorithms consider that enhancing the characterization of the target is equally important, so that a focus mechanism is provided for focusing on the features of the target further, some algorithms directly perform information interaction between the two branches in the feature extraction process, and establish a connection between the template and the search area to improve tracking accuracy. While these algorithms that improve tracking performance from different angles greatly promote the development of visual tracking, the problem of interference is not well addressed by using only a tracking framework based on appearance modeling. To improve the discrimination capability of trackers, some algorithms attempt to more effectively discriminate between objects and background by building interference models. Some of them use a learning interference sensing module to capture the appearance change of the target in the tracking process, some propose to actively track the interference, run the target and the interference to propagate between different frames by designing a correlation network, and use the interference to further infer the target. It should be noted that background-aware tracking algorithms tend to ignore further mining of targets, and thus interference in the background can also adversely affect tracking results. In view of the above, it is necessary to design a tracking algorithm that can suppress background interference even when the target perception is enhanced, and the objective feature is further mined while utilizing the discrimination capability of the background. Disclosure of Invention Aiming at the defects, the invention provides a reciprocal interlayer-time discrimination target model for robust visual tracking, the method enhances target perception by establishing information interaction between a template and a search area in the characteristic extraction process, and meanwhile, similar interference is eliminated by utilizing a similar relation between adjacent frame candidate objects, so that reciprocity between the template and the search area is realized, and the tracking robustness is improved. In order to achieve the technical purpose, the technical scheme of the invention is as follows: a robust visual tracking reciprocal inter-layer-time discrimination target model comprising the steps of: Firstly, respectively extracting features of a template and a search area by utilizing a pre-trained network, and then obtaining a response map M 1 by a matching module; Step two, establishing a pixel-by-pixel correlation between a template and a search area in a feature extraction process through an interlayer target perception enhancement model, performing information interaction between a first stage and a fourth stage of a network, and obtaining a response map M 2 through a matching module; step three, carrying out weighted fusion on the M 1 and the M 2 to obtain a response diagram M which can more highlight the target; Step four, reserving a plurality of candidates in a response chart M through a time interference evaluation strategy; Establishing a connection among a plurality of candidates of adjacent frames through an inter-frame candidate propagation module to obtain a similarity score; step six, eliminating similar interference candidates according to the obtained similarity score, so as to