CN-122019846-A - Strong interactive driving scene coding and unified searching method

CN122019846ACN 122019846 ACN122019846 ACN 122019846ACN-122019846-A

Abstract

The invention relates to the technical fields of intelligent traffic, automatic driving data processing and scene retrieval, in particular to a strong interactive driving scene coding and unified retrieval method. Firstly, constructing a frame-by-frame dual feature vector, carrying out weighted pooling on the frame-by-frame feature by adopting multi-head time sequence attention, extracting time sequence mode features representing an interactive evolution process, introducing a traffic rule prior vector for coding, and obtaining a scene embedded vector for similarity calculation after block normalization and weighted splicing. Based on the built-in index library and scene clustering labels, a unified sub-module retrieval objective function is provided, similar scene retrieval and diversified coverage retrieval are realized under the same greedy framework, and a scene set meeting the requirements of relevance and coverage is output. The method can be used for finding difficult cases, completing training data and constructing course learning, reducing repeated searching and improving scene type coverage. The invention has good universality, expandability and interpretability.

Inventors

HANG PENG
DONG XINWEI
XIE JIABIN
XU CHENGKAI
FANG SHIYU

Assignees

同济大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A strong interaction driving scene coding and unified searching method is characterized by comprising the following steps: S1, acquiring strong interaction driving scene data, acquiring or reading state sequences of traffic participants in a plurality of time frames, and determining a focus interaction main body pair Simultaneously acquiring scene meta information related to the interaction subject pair; S2, constructing the dual characteristics frame by frame, selecting interaction subject pairs from the plurality of time frames And calculates normalized dual feature vectors for each frame of the common visible frame Composing a dual eigenvector sequence ; S3, data driving multi-head time sequence attention coding based on the dual characteristic vector sequence Constructing a multi-head time sequence attention weight, carrying out weighted pooling on the dual feature vector sequence and splicing to obtain an attention feature block ; S4, prior fusion of the time sequence mode and the traffic rule, and extracting the time sequence mode feature block from the dual feature vector sequence Constructing a traffic rule priori feature block according to the scene meta information To (3) pair Respectively dividing into blocks Normalizing and weighting splicing to obtain scene embedded vector ; S5, performing step S1 to step S4 on a plurality of strong interactive driving scenes in a scene library to obtain a corresponding scene embedding vector set, clustering based on the scene embedding vector set to obtain a clustering label, and calculating the uniqueness score of each scene; s6, searching the unified sub-greedy, and when a query scene is received Then, embedding vectors based on the scenes corresponding to the query scenes Similarity with embedded vector in index library, and constructing marginal gain function by combining cluster coverage gain, uniqueness score and redundancy penalty term Adopting greedy iterative mode to select The individual scenes form a search result set S, in which Is a candidate scene.
2. The method according to claim 1, wherein in step S1, The strong interactive driving scene interacts the subject pair with the focus Is a central tissue in which Representing the first body in the focus interaction body pair, The focus interaction subject pair is extracted from multi-subject interaction relations in the scene or specified by the outside; the scene meta information includes at least one of post-intrusion time PET, path relation PathRelation, path category PathCategory, turn label TurnLabel, priority label PriorityLabel, speed limit SPEEDLIMIT.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises, The strong interaction scene segment used in step S1 is extracted from the data set in advance, and the original scene number in the extraction result is used Interactive subject identification Start frame End frame Metadata and method for producing the same Positioning the alignment information in (a) wherein The field records the identity of the body of the two parties of the interaction, resolving to obtain focus interaction subject pair Each subject is in a time frame The state of (a) at least includes a position vector Velocity vector And course angle And may also include intention labels Visibility mark Post intrusion time Meta information.
4. The method according to claim 1, wherein in step S2, the set of common visible frames refers to a subject in a plurality of time frames With the main body A set of time frames, all effectively observed, is noted as Wherein Is of a common visible frame number and ; The dual eigenvector sequence A plurality of dual eigenvectors arranged in common visible frame time sequence A sequence of components.
5. The method of claim 1, wherein in step S2, the dual eigenvector comprises a geometric correction Distance between main bodies Approach speed Deceleration required for collision avoidance Intent conflict score A main body Speed of speed A main body Speed of speed Lateral orientation, lateral offset, and heading consistency.
6. The method according to claim 1, wherein in step S3, For dual eigenvector sequences Constructing multiple attention heads Each attention header generates a query vector based on different conflict signals And pass through The function gets the attention weight : Wherein, the Is the first The attention heads are at the moment Is used for the inquiry signal of the (a), In order to correspond to the attention weight, For temperature parameters, for controlling the sharpness of the attention profile, For the total number of attention deficit points, ; After the weight of each attention head is obtained, the dual feature vector sequence is weighted and pooled to obtain the aggregate vector of each attention head : Wherein, the Is the first The aggregate vectors corresponding to the individual attention headers, Is the number of common visible frames; The aggregate vectors of all the attention heads are spliced to obtain an attention characteristic block 。
7. The method of claim 6, wherein 5 attention heads are provided: Attention head, distance attention head, intention conflict attention head, Attention head Attention head of drop rate, query vector The calculation method comprises the following steps: Wherein, the Is that Is used for the normalization of the values of (c), Is the distance Is used for the normalization of the values of (c), For the moment of time Is a intent conflict score of (1).
8. The method according to claim 6, wherein in step S4, From dual eigenvector sequences Extracting time sequence mode characteristic block The time sequence mode feature block is used for representing the evolution trend of the interaction process Is a 12-dimensional vector comprising the minimum A conflict peak occurrence time ratio, Trend and danger continuous proportion, Peak value, intention conflict average value, Score, likelihood of path intersection, intersection type path category indicator, sink type path category indicator, and common visible frame coverage; The traffic rule prior feature block Includes continuous rule feature sub-vector B and class rule feature single-hot coding, and is combined with time sequence mode feature block Jointly participating in the construction of the scene embedding vector; Wherein, the The category rule feature independent heat codes are used for coding the category of the path relation and the symmetrical steering combination category, wherein the category rule feature independent heat codes are used for coding the traffic rule information including priority, yield, speed limit, steering and path risk, Is the one-hot coding sub-vector corresponding to the path relation category, The contribution of the independent heat coding sub-vectors corresponding to the steering symmetrical combination categories to similarity calculation is enhanced according to preset scaling coefficients; the one-hot encoded sub-vector corresponding to the path relation class is expressed as: Wherein, the Scaling factors of the single thermal codes corresponding to the path relation categories; Is a path relation category An index in a preset vocabulary; is a standard base vector corresponding to the path relation category; the one-hot encoded sub-vectors corresponding to the steering symmetry combination class are expressed as: Wherein, the Scaling factors for one-hot encoding corresponding to the turn-symmetrical combination class; A standard base vector corresponding to the steering symmetry combination category index; for turning to the label assembly A corresponding category index; Attention to feature block Time sequence mode feature block Traffic rule prior feature block Respectively dividing into blocks After normalization, weighting and splicing are carried out, and global normalization is carried out to obtain a final scene embedded vector : Wherein, the 、、 Respectively represent 、、 The result of the block L2 normalization of (c), Representing that global L2 normalization operation is implemented on the spliced result; 、、 The weight coefficients of the attention feature block, the time sequence mode feature block and the traffic rule priori feature block are respectively.
9. The method of claim 1, wherein the offline stage of step S5 is based on scene embedding vectors for each scene in the scene library Constructing a retrieval index; The scene library is composed of a plurality of strong interaction scene fragments extracted from a traffic scene data set offline by an interaction event extraction module, wherein the strong interaction scene fragments are organized according to scene identifications, interaction subject identifications and initial frame alignment; the inter-scene similarity matrix S and the distance matrix D which can be pre-calculated are as follows: Wherein, the For the scene index to be a function of the scene, Is a scene And scene Is used for the cosine similarity of the (c), Is the cosine distance; Clustering the scene embedded vector set to obtain a cluster label Wherein Representing a scene The corresponding interaction type cluster number when two scenes have the same Indicating that both belong to the same cluster in the embedded space when <0 Represents an outlier; Uniqueness score The distance from the scene to the nearest neighbor scene is combined with the average distance from the scene to all other scenes to obtain: Wherein, the Is a scene The distance to its nearest neighbor scene, Is a scene The average distance to all other scenes is, The scene library scale is; wherein N is the scene library scale, And (3) with The nearest neighbor distance maximum value and the average distance maximum value of the whole scene are respectively used for normalization; as an outlier indication function, when scene Taking 1 when the scene is an outlier, otherwise taking 0, and introducing an outlier rewarding item when the scene is the outlier so as to improve the exploration priority of the scene; generating an exploration sequence covering a scene space by sampling the FPS by the furthest point without providing a query scene: Wherein, the Indicating that the furthest point is sampled at the first Step one, scene indexes are selected; Is a scene Is a uniqueness score of (2); Is a scene With the selected scene The distance between the two plates is set to be equal, Representing candidate scenes The nearest distance to the selected set; the sampling process preferably selects the scene furthest from the selected set to gradually expand the scene space coverage.
10. The method according to claim 1, wherein in step S6, Cosine similarity calculation is carried out based on scene embedded vectors, and marginal gain functions of unified sub-module retrieval are constructed ; Cosine similarity The following are provided: Wherein, the To inquire about scenes Is used to embed the vector in the scene, For candidate scenes Since the embedded vector has been L2 normalized; Marginal gain function: Wherein, the Respectively a correlation weight, a coverage weight, a unique reward weight and a redundancy penalty weight; to cover the gain indicating function, when the scene is candidate The belonging cluster labels have not been assembled Taking 1 when covering, otherwise taking 0; For candidate scenes Is a uniqueness score of (2); and the similarity between the candidate scene and the most similar scene in the selected set is used for suppressing redundancy.

Description

Strong interactive driving scene coding and unified searching method Technical Field The invention relates to the technical fields of intelligent traffic, automatic driving data processing and scene retrieval, in particular to a strong interactive driving scene coding and unified retrieval method. Background In tasks such as automatic driving movement prediction, behavior decision and safety evaluation, training data are required to cover various road topologies and multi-subject interaction behaviors. However, strong interaction scenarios (e.g., car-meeting yield, cross-conflict, lane-doubling, etc.) are relatively low in natural driving data, and random sampling tends to be difficult to cover such rare but critical interaction types. The prior art generally has the defects of insufficient generalization, low efficiency of finding out difficult cases, incapability of stably reflecting the prediction difficulty of a scene to a downstream model by screening based on single indexes such as distance or collision time, redundancy of search results, easiness in returning a large number of near repeated scenes only according to cosine similarity ordering, insufficient coverage of a training set when the training set is completed, cutting of a search framework, and easiness in realizing similar search and diversified search by using different algorithms, and difficulty in fairly comparing and switching as required under the same objective function. Therefore, a unified search method capable of realizing interpretable coding for a strong interaction scene and simultaneously considering relevance and coverage under the same framework is needed. Disclosure of Invention The invention aims to provide a strong interaction driving scene coding and unified retrieval method based on traffic rule priori and data driven attention, which aims to solve the problems of insufficient coverage, similarity retrieval redundancy, inconsistent heuristic screening and model difficulty and the like caused by rarity of the strong interaction scene. The aim of the invention is achieved by the following technical scheme: a strong interaction driving scene coding and unified searching method comprises the following steps: S1, acquiring strong interaction driving scene data, acquiring or reading state sequences of traffic participants in a plurality of time frames, and determining a focus interaction main body pair Simultaneously acquiring scene meta information related to the interaction subject pair; S2, constructing the dual characteristics frame by frame, selecting interaction subject pairs from the plurality of time frames And calculates normalized dual feature vectors for each frame of the common visible frameComposing a dual eigenvector sequence; S3, data driving multi-head time sequence attention coding based on the dual characteristic vector sequenceConstructing a multi-head time sequence attention weight, carrying out weighted pooling on the dual feature vector sequence and splicing to obtain an attention feature block; S4, prior fusion of the time sequence mode and the traffic rule, and extracting the time sequence mode feature block from the dual feature vector sequenceConstructing a traffic rule priori feature block according to the scene meta informationTo (3) pairRespectively dividing into blocksNormalizing and weighting splicing to obtain scene embedded vector; S5, performing step S1 to step S4 on a plurality of strong interactive driving scenes in a scene library to obtain a corresponding scene embedding vector set, clustering based on the scene embedding vector set to obtain a clustering label, and calculating the uniqueness score of each scene; s6, searching the unified sub-greedy, and when a query scene is received Then, embedding vectors based on the scenes corresponding to the query scenesSimilarity with embedded vector in index library, and constructing marginal gain function by combining cluster coverage gain, uniqueness score and redundancy penalty termAdopting greedy iterative mode to selectThe individual scenes form a search result set S, in whichIs a candidate scene. Advantageous effects Compared with the prior art, the invention has the following advantages: (1) The focus-oriented interaction subject pair builds an interpretable scene embedding. By means of frame-by-frame dual characteristics and data driving multi-head attention aggregation, key conflict frames can be highlighted, irrelevant fragments can be restrained, and effective depiction of an interactive evolution process is achieved. (2) Introducing traffic rules prior enhances topology consistency. By encoding the rule information such as priority, speed limit, steering and path relation as the prior feature block, the capability of distinguishing the interactive topology types can be improved, and the monitoring training of the encoder is not relied on. (3) The unified sub-module search framework takes the relevance and coverage into consideration. By