CN-119941790-B - Multi-level semantic interactive cross-modal tracking method based on illusion engine

CN119941790BCN 119941790 BCN119941790 BCN 119941790BCN-119941790-B

Abstract

The invention provides a multi-level semantic interactive cross-modal tracking method based on a illusion engine. The method comprises the steps of constructing a virtual simulation world by using a illusion engine 5, constructing virtual multi-target tracking data of pedestrians and vehicles, constructing a text-track matching pair to generate multi-mode tracking data, constructing a multi-target tracking model fusing multi-mode semantic features layer by layer, enhancing perception query features by using text features, mapping decoding perception features to semantic space by using linear layers, calculating similarity with coded text features, and updating target track information by using perception query results.

Inventors

ZHAO ZHICHENG
SU FEI
SONG YANG
MA ZELIANG

Assignees

北京邮电大学

Dates

Publication Date: 20260508
Application Date: 20250102

Claims (10)

1. The multi-level semantic interactive cross-modal tracking method based on the illusion engine is characterized by comprising the following steps of: Step 1, building a virtual simulation world through a illusion engine 5, collecting tracks of pedestrians and vehicles in the world, and generating multi-target tracking data; step 2, cleaning and filtering, describing and marking and quality evaluation processing are carried out on the target track, so that the accuracy and usability of data are ensured; step 3, combining the appearance description and the motion description of the mark according to a certain rule by means of a large language model, and associating with the target track segment to finally obtain a high-quality low-cost semantic tracking data set; Step 4, constructing a multi-target tracking model fusing multi-mode semantic features layer by layer; Step 5, video frame cutting is sent into a model, features of images and texts are extracted by using two backbone networks respectively, and feature fusion is achieved by using a spread spectrum mode encoder; Step 6, initializing a group of detection queries, merging the detection queries with the tracking queries output before to obtain sensing queries, guiding the sensing queries by applying text features, and inputting the sensing queries and the fusion features into a decoder together to capture semantic targets in the video; Step 7, mapping the decoded query features to a semantic space by using a linear layer, comparing the semantic space with the encoded text features in similarity, and screening out query results with higher similarity; step 8, classifying the categories and carrying out coordinate regression processing on the screened query results by using a multi-layer perceptron; and 9, updating the track tracking result, changing the perceived query into the tracking query after the next frame, and repeating the steps 5 to 9 until the tracking of all image sequences of the video is completed.
2. The multi-level semantic interactive cross-modal tracking method based on the illusion engine as claimed in claim 1, wherein in step 1, a virtual simulation world is built through the illusion engine 5, pedestrian and vehicle tracks in the world are collected, multi-target tracking data are generated, specifically, the 3D modeling and camera positions are continuously changed by utilizing resources disclosed in the illusion engine 5 and changing the appearance such as colors and sizes of the 3D modeling to generate vehicle and crowd examples in the system, then the vehicle and crowd examples are placed in the virtual world, and the urban traffic system is simulated by using internal traffic and traffic movement rules, finally, a camera is placed in the world, video pictures are recorded, coordinate information and appearance information of objects in the pictures can be directly obtained, and the multi-target tracking data are generated in batches by using the method, namely, the 3D modeling and the camera positions of the objects are continuously changed.
3. The multi-level semantic interactive cross-modal tracking method based on the illusion engine according to claim 1 is characterized in that in step 2, cleaning, filtering, describing and quality evaluating treatment is carried out on target tracks to ensure accuracy and usability of data, specifically, firstly, tracks which are collected in step 1 and are seriously blocked by buildings and scenes with fewer targets are filtered out, secondly, describing and recording movements of target track fragments, and finally, quality evaluating is carried out on processed tag data to screen out descriptions with longer duration and obvious behaviors.
4. The multi-level semantic interactive cross-modal tracking method based on the illusion engine according to claim 1 is characterized in that in step 3, the motion description marked in step 2 is combined with the appearance description automatically generated by 3D modeling, multiple groups of paraphrasing words are derived from the combined words by using a large language model, common behavior description phrases are manually screened out, and finally the corresponding description phrases are associated with track fragments, start and stop time of tracks is recorded, and text-track matching pairs are generated.
5. The method of claim 1, wherein in step 4, a multi-objective tracking model is constructed that fuses multi-modal semantic features layer by layer, specifically comprising an image encoder, a frozen text encoder, a fusion encoder, a semantic guidance module, a tracking decoder, and a semantic correlation branch prediction module, wherein the text encoder selects RoBERTa and CLIP, the fusion encoder and the fusion decoder each adopt DETR structures comprising multi-layer convertors, and the input of the semantic guidance module is provided with a learnable query for sensing different semantic features, the size of the learnable query is Wherein Is the number of queries that can be learned.
6. The multi-level semantic interactive cross-modal tracking method based on illusion engine as claimed in claim 1, wherein in step 5, firstly, frame cutting is performed on the video generated in step 3 to obtain an image sequence, and a backbone network pair based on CNN is utilized Feature extraction is carried out on the images at moment to obtain image pyramid features Next, the text instruction is encoded into a vector using a frozen text encoder Then, the cross-modal fusion encoder is used for encoding Is combined with each layer of (1) Fusion to generate cross-modal features 。
7. The illusion engine-based multi-level semantic interactive cross-modal tracking method of claim 1, wherein in step 6, a set of detection queries is initialized with learnable variables, denoted as And combines tracking queries from the previous frame output Generating a perceived query, noted as Interaction of perceived queries with semantic instructions using semantic guidance modules Added with the position codes and sent to the self-attention module for integration Linear mapping the text feature to the same latitude, and the calculation formula is as follows: cross-attention is reused for Text guidance is carried out, a perception query with text prompt is generated, and a calculation formula is as follows: finally, using a fusion decoder to enable the perception query to capture the high-dimensional information of the target in the fusion characteristics, and outputting and recording as 。
8. The multi-level semantic interactive cross-modal tracking method based on the illusion engine according to claim 1, wherein in step 7, the result of step6 is obtained through a linear layer Mapping to a semantic feature space, wherein a calculation formula is as follows: and recoding the text instruction by using a CLIP text encoder, wherein the calculation formula is as follows: And calculate by means of the CLIP's powerful cross-modal alignment capability And (3) with The similarity between the two modes is finally obtained, and the correlation degree between each query and the text instruction is recorded as The calculation formula is as follows: Further, query results with similarity greater than 0.5 are screened out.
9. The multi-level semantic interactive cross-modal tracking method based on the illusion engine according to claim 1, wherein in step 8, category classification and coordinate regression processing are performed on the screened query result by using a multi-level perceptron to obtain category information and coordinate information of the semantic target in the current frame.
10. The multi-level semantic interactive cross-modal tracking method based on the illusion engine according to claim 1, wherein in step 9, the track tracking result is updated, the perceived query is kept until the next frame is changed into the tracking query, steps 5 to 9 are repeated until the tracking of all image sequences of the video is completed, in particular, the track queue is updated by the result, if the target is composed of Capture, create a new track if defined by Capturing, updating the corresponding old track, and finally inquiring the perception Record and update tracking query for next frame Steps 5 to 9 are repeated until the tracking of all image sequences for the video is completed.

Description

Multi-level semantic interactive cross-modal tracking method based on illusion engine Technical Field The invention relates to the field of signal processing technology and deep learning, in particular to a multi-level semantic interactive cross-modal tracking method based on a illusion engine. Background In recent years, the field of visual tasks is coming up with the fusion trend of natural language descriptions, and a series of innovative technologies are induced. The semantic target tracking task is one of the tasks, and aims to open a new chapter of man-machine interaction, namely, a user can realize direct communication with a tracking system through natural language instructions, and accurately identify and track a target object appointed in an image or video. This task presents a double challenge, on the one hand, the system must possess the ability to understand complex natural language in depth, and on the other hand, it also requires the implementation of efficient image processing functions. There is thus a great challenge in requiring visual target tracking methods that can adequately understand semantic instructions. Despite the innovation in this area, the availability of relevant datasets is relatively limited. Typically, existing reference datasets are re-annotated on top of the disclosed multi-target tracking reference. As a result of this approach, the accuracy of the new reference obtained is doubly affected by the inherent limitations of the original reference and the subjective judgment bias in the manual annotation process, thereby affecting the performance of downstream tasks. In order to solve the problem, the invention provides a multi-level semantic interactive cross-modal tracking method based on a illusion engine. The method overcomes the defects of the existing data set by constructing the reference data set with high precision and low cost. In particular, by utilizing the illusion engine to generate a highly realistic virtual simulation environment, the dependence on manual annotation is eliminated and the mass production of the data set is realized. In addition, the invention also introduces an end-to-end multi-level guide frame aiming at the recall rate problem caused by insufficient fusion of semantics and texts in the prior art. The framework ensures that semantic information can be effectively integrated from the encoder stage to each level of the pre-measurement head, and the performance of the model in practical application is remarkably improved. Disclosure of Invention The invention provides a multi-level semantic interactive cross-modal tracking method based on a illusion engine, which is characterized by comprising the following steps of: Step 1, building a virtual simulation world through a illusion engine 5, collecting tracks of pedestrians and vehicles in the world, and generating multi-target tracking data; step 2, performing cleaning and filtering, description marking, quality evaluation and other treatments on the target track, and ensuring the accuracy and usability of the data; step 3, combining the appearance description and the motion description of the mark according to a certain rule by means of a large language model, and associating with the target track segment to finally obtain a high-quality low-cost semantic tracking data set; Step 4, constructing a multi-target tracking model fusing multi-mode semantic features layer by layer; Step 5, video frame cutting is sent into a model, features of images and texts are extracted by using two backbone networks respectively, and feature fusion is achieved by using a spread spectrum mode encoder; Step 6, initializing a group of detection queries, combining the detection queries with the tracking queries output before to form a perception query, guiding the perception query by applying text features, and inputting the perception query and the fusion features into a decoder together to capture semantic targets in the video; Step 7, mapping the decoded query features to a semantic space by using a linear layer, comparing the semantic space with the encoded text features in similarity, and screening out query results with higher similarity; and 8, classifying the categories and carrying out coordinate regression processing on the screened query results by using a multi-layer perceptron. Step 9, updating the track tracking result, changing the perceived query to the next frame to be the tracking query, and repeating the steps 5 to 9 until the tracking of all image sequences of the video is completed; Specifically, in step 1, the resources disclosed in the illusion engine 5 are utilized and the appearance (color and size) of the 3D modeling is changed to generate vehicle and crowd instances in the system. They are then placed into the virtual world and the urban traffic system is simulated using internal traffic and people flow movement rules. Finally, a camera is placed in the world, video pictures are recorded, and