CN-122015857-A - Dynamic traffic track anchoring method of semantic compass type

CN122015857ACN 122015857 ACN122015857 ACN 122015857ACN-122015857-A

Abstract

A dynamic traffic track anchoring method based on semantic compass mode comprises the steps of S1, collecting and fusing multi-mode sensing characteristics, S2, conducting semantic orientation priori fusion based on a visual language model, S3, conducting interactive orientation priori fusion, S4, initializing track anchor points, S5, generating and optimizing conditional diffusion tracks, firstly, constructing a unified multi-mode sensing characteristic collecting and fusing mechanism, effectively relieving the problem of information deficiency and uncertainty of single mode sensing, secondly, introducing a semantic compass mode guiding mechanism, fusing high-level semantic understanding depth into track generation, improving the interpretability and rule consistency of planning results, thirdly, enhancing the generation stability under a multi-agent dynamic interaction scene through interactive orientation priori fusion and track anchoring mechanism, fourthly, taking account of uncertainty modeling and directional constraint simultaneously in a track generation mode based on a diffusion model, improving the robustness in a complex traffic environment, and thirdly, realizing good engineering realizability through end-to-end and modularized design of an integral frame.

Inventors

MIAO YANZI
ZHANG JUNJIE
WANG YU
BU RAN

Assignees

中国矿业大学

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (6)

1. The dynamic traffic track anchoring method of the semantic compass type is characterized by comprising the following steps of: s1, acquiring and fusing multi-mode perception features, namely acquiring multi-mode information of a traffic scene through a perception system, and primarily processing and fusing the acquired information to obtain a unified fusion feature map Trajectory query vector And agent state query vectors ; S2, semantic orientation priori based on the visual language model, coding the image acquired in the step S1 by utilizing the pre-trained visual language model, carrying out layer normalization and processing by a multi-layer perceptron to obtain a semantic direction vector ; S3, interactive directional prior fusion, and generating interactive feature vectors based on interactive relations among multiple agents The method is used as prior information for modulating diffusion modeling process, and the influence direction and intensity among intelligent agents are guided explicitly to realize the reinforcement modeling of key interaction behaviors; S4, initializing a track anchor point, constructing an intelligent body sensing track generator module, and inputting a unified fusion feature map And the labels and states of the traffic agent, and extracting local semantic representations for K initial track points by applying bilinear interpolation Let the local semantic representation As key value, obtaining track prior with intelligent body perception Adding temporal embedding for each trace point by non-linear projection And follow the track a priori Outputting initial track anchor points by combining input to track initialization network ; S5, generating and optimizing a conditional diffusion track, adopting a multi-layer conditional diffusion decoder based on a transducer, gradually extracting a structured target-oriented track from high-dimensional noise by the decoder through layered interaction and cross-modal fusion, and finally, selecting the track with highest confidence from the generated multi-modal candidate tracks as a final output track by the system 。
2. The method for anchoring a dynamic traffic track of a semantic compass type as claimed in claim 1, wherein the step S1 is specifically as follows: s11, acquiring continuous frame multi-view RGB images through a multi-view camera device, acquiring single-frame environment point clouds through a laser radar device, and synchronously acquiring intelligent body states through a vehicle state acquisition unit; s12, after the continuous frame multi-view RGB image, the single frame environment point cloud and the intelligent agent state are aligned in the time dimension, the multi-view RGB image and the single frame environment point cloud are processed by a BEV encoder, and the intelligent agent state is primarily processed by a state encoder to obtain intelligent agent state characteristics ; S13, uniformly mapping the preliminarily processed multi-view RGB image and single-frame environment point cloud to a bird 'S-eye view angle coordinate system to obtain an initial bird' S-eye view feature image ; S14, characterizing the state of the intelligent agent Embedding as structured state, intelligent body state characteristics And an initial bird's eye view feature map Commonly input to a multimode fusion network based on a transducer to obtain a unified fusion feature map Trajectory query vector And agent state query vectors 。
3. The method for anchoring a dynamic traffic track of semantic compass type as claimed in claim 2, wherein the step S2 is specifically as follows: s21, applying a structural prompt to the multi-view RGB image acquired in the step S11 by utilizing a pre-trained visual language model to generate a directional language instruction related to a driving scene; S22, coding the directional language instruction in the step S21 through a pre-trained text coder, and obtaining a semantic direction vector after layer normalization and a multi-layer perceptron The semantic direction vector The high-level semantic information in the multi-view RGB image can be converted into an explicit driving intention guiding signal to serve as a semantic orientation priori in the subsequent diffusion track generation process.
4. The method for anchoring a dynamic traffic track of a semantic compass type as claimed in claim 2, wherein the step S3 is specifically: S31, interactive orientation priori extraction, namely, inquiring the intelligent agent state vector generated in the step S1 Acquiring tags and states of detected traffic agents in a scene through a pre-measurement head Filtering by using a label to obtain an effective intelligent agent, and extracting the relative position of the effective intelligent agent And size To build a target set ; According to the relative directions of the effective intelligent agents in the self-centering framework, dividing the positions of the effective intelligent agents into three space areas, namely a front side, a left side and a right side, and calculating a direction interaction weight vector through distance and size weighting statistics: (1) Wherein: is a direction indication function that classifies each active agent into one of three spatial regions; Is a numerical stability constant; Then through nonlinear mapping function Interaction weight vector of direction Projection as interaction feature vector ; S32, a direction perception interaction fusion module is used for carrying out the semantic direction vector generated in the step S2 Interaction feature vector generated with step S31 Fusing, firstly calculating semantic direction vector Interaction feature vector Cosine similarity in shared potential space (2) Generating final directional guide vector by adopting segmentation fusion strategy according to cosine similarity (3) Wherein: Is that And Adaptive weighting coefficients of linear interpolation between them, resulting in final directional guide vector Projected to the track inquiry to obtain the track inquiry vector after diffusion optimization , (4) Wherein: is an adjustable super-parameter for controlling the direction guiding and representing the fusion strength.
5. The method for anchoring a dynamic traffic track of semantic compass type as claimed in claim 4, wherein the step S4 is specifically: S41, constructing an intelligent agent sensing track generator module, and inputting the unified fusion feature map generated in the step S1 And the label and state of the traffic agent obtained in step S31 To capture fine local interactions, a region-of-interest based feature extraction method is employed for each detected feature having a state In the unified fusion of feature graphs Building up a rotationally aligned sampling grid And extracting local semantic representation for K initial track points by bilinear interpolation , (5) S42, introducing a group of learnable anchoring queries by adopting a multi-head cross attention mechanism to enlarge the perceptible space range To let the local semantic representation As a key to selectively aggregate obstacle awareness features: (6) Wherein: respectively as key vectors and value vectors by representing local semantics Performing linear projection to obtain Is encoded with a trajectory prior with agent perception; S43, in order to apply time semantics and ensure continuity between track points, a time embedding scheme is also introduced, i.e. time embedding is added for each track point through nonlinear projection And follow the track a priori Outputting initial track anchor points by combining input to track initialization network 。
6. The method for anchoring a dynamic traffic track of semantic compass type according to claim 5, wherein step S5 is specifically: adopting a transform-based multi-layer conditional diffusion decoder to query the state of the agent output in the step S1 Initial bird's eye view feature map Diffusion optimized track query vector output in step S3 And the initial track anchor point output in the step S4 As condition input, trace generation is carried out through a decoder of a multi-layer transducer to initially generate multi-mode candidate traces, each layer is provided with three cross-mode attention modules, namely a BEV attention module, an intelligent agent attention module and a self-attention module, and in addition, each layer integrates a feedforward network and a time step-based embedding Through hierarchical interaction and cross-modal fusion, the decoder can gradually extract structured and target-oriented tracks from high-dimensional noise, and finally, the system selects the track with highest confidence from the generated multi-modal candidate tracks as a final output track 。

Description

Dynamic traffic track anchoring method of semantic compass type Technical Field The invention relates to an intelligent traffic track planning method, in particular to a dynamic traffic track anchoring method of a semantic compass, and belongs to the technical field of intelligent traffic automatic driving. Background With the rapid development of automatic driving technology, intelligent traffic systems and multi-mode perception algorithms, the problems of safety decision and track planning of vehicles in complex dynamic traffic environments are receiving a great deal of attention. In a real traffic scene, vehicles need to face complex factors such as interaction of multiple traffic participants, traffic rule constraint, scene semantic change and the like at the same time, and the planning requirements of high safety and high generalization are difficult to meet only by relying on geometric information or a local motion model. Therefore, how to generate a reasonable, interpretable and robust driving track in a dynamic traffic flow becomes one of the core research problems in the current automatic driving field. The existing track planning method is mainly developed based on rule driving, optimizing models or data driving strategies, and the rule and optimizing methods generally depend on manually set behavior models and constraint conditions, so that the method has certain stability in a structured road scene, is difficult to cover diversified traffic behaviors in a complex interaction scene, and lacks uncertainty and future describing capability of multiple modes. In recent years, a data driving method based on deep learning is gradually developed, and future tracks are directly generated from sensing results through an end-to-end model, so that planning flexibility is improved to a certain extent. However, such methods focus on modeling of low-level geometric or motion features, and have insufficient understanding of high-level semantic information and interactive intention in traffic scenes, resulting in the problem that the planning result is unreasonable or the interpretability is insufficient easily in complex scenes. Meanwhile, a diffusion model is introduced into the tasks of track generation and intention prediction because of the advantages in terms of being able to generate diverse samples and characterize uncertainty. Although the diffusion type track planning method can generate multi-mode tracks with reasonable distribution, the existing method has the defects that firstly, the diffusion process lacks high-level semantic guidance, the generated result is mainly driven by low-level characteristics and is difficult to embody semantic constraints such as traffic rules and scene intentions, secondly, the interactive relation among multiple traffic participants is not modeled sufficiently, the explicit interactive orientation prior is lacking, unstable or conflict tracks are easy to generate in dense traffic flows, thirdly, the track generation process mainly starts from a random noise or weak constraint initial state, and the efficient anchoring of reasonable starting points and directional intentions is lacking, so that the convergence efficiency and track controllability are limited. In summary, track planning and intent prediction in current autopilot scenarios still face multiple technical challenges in complex dynamic traffic environments. On one hand, the existing track planning method is mostly dependent on a modeling mode based on rules or optimization, has limited expression capability on traffic participant behaviors, is difficult to adapt to multi-subject interaction, uncertainty evolution and long time sequence behavior change existing in a real road, and particularly has the problem that a planning result is easy to be conservative, conflict or unstable in a high-density traffic flow or complex interaction scene, and on the other hand, the data driving method based on deep learning is capable of learning track distribution from multi-mode perception information, but is often focused on low-level geometric or motion feature modeling, lacks systematic utilization on high-level semantic information of traffic scenes, and causes defects in interpretability, rule consistency and generalization capability of generated tracks. With the development of visual language models, researchers began to attempt to utilize large-scale pre-training models to semantically understand traffic scenes, introducing language-level constraints for the planning process. However, the existing work mostly takes semantic information as additional features or posterior screening conditions, a unified framework for deeply integrating semantic understanding into the whole track generation process is not formed yet, the association between the semantic and the track is still loose, and the continuous guiding effect in a dynamic traffic environment is difficult to realize. Disclosure of Invention Aiming at the technic