CN-122024206-A - Automatic driving threat object detection method based on semantic-geometric pseudo tag
Abstract
The invention discloses an automatic driving threat object detection method based on semantic-geometric pseudo tags. The method comprises the steps of constructing a threat object detection system consisting of a semantic branch candidate frame generation module, a geometric branch candidate frame generation module, a semantic-geometric complementary fusion module and a low-rank open vocabulary target detector, generating a semantic and geometric branch candidate frame set by the semantic branch candidate frame generation module and the geometric branch candidate frame generation module, fusing the two candidate frame sets by the semantic-geometric complementary fusion module to generate an unknown threat object pseudo tag set, combining the pseudo tag with a known class real tag, training the low-rank open vocabulary target detector to enable the detector to learn a universal threat object representation, detecting the threat object by the trained threat object detection system, generating a class name and a boundary frame for the known class, classifying the unknown class into an unknown threat and generating the boundary frame. The invention can identify unknown threat objects and has high detection precision.
Inventors
- CHEN WEI
- HE HONGYU
- HE YULIN
- ZHOU WENJUAN
- Jian Zhikang
- JIANG RAN
Assignees
- 中国人民解放军国防科技大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260225
Claims (10)
- 1. An automatic driving threat object detection method based on semantic-geometric pseudo tags is characterized by comprising the following steps: The method comprises the steps of firstly, constructing a threat object detection system based on semantic-geometric double-branch pseudo tag generation and parameter efficient fine adjustment, wherein the threat object detection system comprises a semantic branch candidate frame generation module, a geometric branch candidate frame generation module, a semantic-geometric complementary fusion module and a low-rank open vocabulary target detector, wherein the semantic branch candidate frame generation module comprises a first multi-mode large language model and a first open vocabulary target detector, the geometric branch candidate frame generation module comprises a 3D basic model and a point cloud processing module, the semantic-geometric complementary fusion module comprises a prompt generation module and a second multi-mode large language model, the low-rank open vocabulary target detector comprises 3 low-rank adaptation modules and a second open vocabulary target detector, and three low-rank adaptation modules are embedded in the second open vocabulary target detector; Secondly, constructing a training set and a testing set, wherein the method comprises the following steps: constructing training set using validation set of automated driving image dataset CODA containing corner cases , Wherein, the method comprises the steps of, Representation of The total number of images in the image frame, Represent the first The image is to be displayed on a screen, Representation of A corresponding set of true-value labels, Is that The total number of marked objects 1 i I,1 j In (1) In the process, the Represent the first The bounding box coordinates of the individual annotation objects, , Representing bounding boxes The upper left corner pixel point coordinates, Representation of Lower right corner pixel coordinates; Represent the first Constructing a test set by using the test set of the CODA ; Third, initializing the threat object detection system, including initializing parameters of the semantic branch candidate frame generation module, wherein the parameters of the geometric branch candidate frame generation module include a horizontal boundary interval Maximum radial distance Perspective normalization weighting factor Initializing a language-guided query selection module of a second open vocabulary target detector to filter query vectors Number threshold of (2) Initializing the rank and scaling factors of the three low-rank adaptation modules; fourth, the semantic branch candidate frame generation module, the geometric branch candidate frame generation module and the semantic-geometric complementary fusion module are matched with each other to generate a training set Is obtained by unknown threat object pseudo tag Pseudo tag sets of (a) Will be Sending the result to a loss calculation module; Fifth step, using pseudo tag set Pseudo tag and training set in (a) The method for performing fine tuning training on the low-rank open vocabulary target detector comprises the following steps: Step 5.1, initializing weight parameters of a low-rank open vocabulary target detector in a threat object detection system, wherein pre-training weights of Grounding DINO models are adopted as the initialization weights of the low-rank open vocabulary target detector; step 5.2 will Divided into several batches, each batch comprising The image is to be displayed on a screen, Is a positive integer, and the obtained number is Each training batch is constructed as a shape of Wherein, And Respectively representing the height and width of the image, and the value of 3 corresponds to RGB three color channels of the image, constructing a text prompt sequence , Is "pedestrian. Cyclist. Car. Structure. Bus. Unknown threat", Including known class names and generic unknown threat object cues, The symbol "," in (a) is a segmenter; step 5.3 initializing batch number Initializing a learning rate to be a floating point number by adopting AdamW as a training optimizer of the open vocabulary target detector; step 5.4 the visual feature extraction module adopts a visual feature extraction method to extract visual features from the training set Visual feature sequence of images of individual batches Will be ; Step 5.5 text feature extraction Module As text prompt, text feature extraction method is adopted for extraction Text features of (a) The said Threat semantic adaptation characteristics output by the first low-rank adaptation module are superimposed in the method Transmitting the data to a cross-mode coding module; Step 5.6 Cross-modality encoding Module receives from the visual feature extraction Module Receiving from a text feature extraction module Adopts a characteristic enhancement method pair And The feature enhancement is performed such that, And Threat perception enhancement adaptation characteristics output by the second low-rank adaptation module are superimposed in the training set to obtain the training set Enhanced visual features for individual batches of images And enhanced text features of text prompt sequences Will be To be sent to a language-guided query selection module and a cross-modal decoding module Respectively transmitting the information to a language-guided query selection module, a cross-modal decoding module and a prediction generation module; step 5.7 language directed query selection Module receives from Cross-modality encoding Module And Cross-modal query method is adopted from Is selected from Index of features and from Extracting feature vectors of corresponding positions as training set Initial object query vector for images of individual lots Transmitting to a cross-mode decoding module; Step 5.8 Cross-Modal decoding Module receives from the Cross-Modal encoding Module Receiving from a language-guided query selection module Adopts a cross-modal decoding method 、 And Make attention calculations and calculate from Extracting threat object information to obtain training set Refining zone characterization of images of individual lots , A third low-rank adaptation module is superimposed to output threat perception refined adaptation characteristics, and Sending the prediction result to a prediction generation module; Step 5.9 the prediction generation module receives from the cross-modality encoding module Receiving from a cross-modality decoding module Adopts a prediction generation method to Result prediction is carried out to generate a prediction result set , Wherein, the triplet Represents the first In the first image Complete detection information of the individual prediction results, The coordinates of the bounding box are represented, The category of the prediction is indicated, Representing confidence of the prediction; step 5.10 the loss calculation module calculates the total loss function for back propagation ; Step 5.11 employ Optimizing the weight parameters of the first, second and third low-rank adaptation modules by using AdamW optimizer through gradient back propagation method to obtain the first Weight parameters of the first, second and third low-rank adaptation modules after batch training; step 5.12 if Order-making Turning to step 5.4, if Obtaining weight parameters of the trained first, second and third low-rank adaptation modules; Step six, respectively loading weight parameters of the trained first, second and third low-rank adaptation modules into the first, second and third low-rank adaptation modules in the low-rank open vocabulary detector to obtain a trained threat object detection system; seventh, the low-rank open vocabulary detector of the threat object detection system after training detects threat objects in driving scene images, and the method is as follows: step 7.1, the visual characteristic extraction module receives an image to be detected in a driving scene From the step 5.4, using the visual feature extraction method described in step Extracting visual characteristic sequences Will be Transmitting the data to a cross-mode coding module; step 7.2 text feature extraction Module As text prompts, the text feature extraction method described in step 5.5 is adopted for extraction Text features of (a) Will be Transmitting the data to a cross-mode coding module; step 7.3 Cross-modality encoding Module receives from the visual feature extraction Module Receiving from a text feature extraction module The characteristic enhancement method pair described in the step 5.6 is adopted And Performing feature enhancement to obtain an image to be detected Is of (1) And an image to be detected Enhanced text features of (a) Will be And The method comprises the steps of sending the query to a language-guided query selection module and a cross-mode decoding module; Step 7.4 language directed query selection module receives from the cross-modality encoding module And The cross-modal query method in the step 5.7 is adopted from Is selected from Index of features and from Extracting feature vectors of corresponding positions as images to be detected Initial object query vector Transmitting to a cross-mode decoding module; Step 7.5 Cross-modality Decode Module receiving from language-guided query selection Module Receiving from a cross-modality encoding module And The cross-modal decoding method pair described in the step 5.8 is adopted 、 And Make attention calculations and calculate from Extracting threat object information to obtain an image to be detected Is characterized by refining zone of (2) Will be Sending the prediction result to a prediction generation module; step 7.6 the prediction generation module receives from the cross-modality encoding module Receiving from a cross-modality decoding module The prediction generation method described in the step 5.9 is adopted for Result prediction is carried out to generate a prediction result set , Wherein the triplet Representing an image to be detected Is the first of (2) Complete detection information of the individual prediction results, The coordinates of the bounding box are represented, The category of the prediction is indicated, The confidence level of the prediction is represented, Indicating the total number of predictors.
- 2. The automatic driving threat object detection method based on semantic-geometric pseudo tags of claim 1, wherein the first multi-modal large language model adopts a Qwen-VL model, performs scene reasoning on an original image in a training set according to a thinking chain text prompt word, generates a candidate category to obtain a text description set, wherein the text description set contains text description of threat objects in the original image, and sends the text description set to an open vocabulary object detector; The 3D basic model adopts MoGe-2 model, receives the original image in training set, and converts the original image into pseudo point cloud data Will be The point cloud processing module receives the 3D basic model For a pair of Clustering is carried out after preprocessing, a clustering result is projected back to an original 2D space, a geometric branch candidate frame set is obtained, and the geometric branch candidate frame set is sent to a semantic-geometric complementary fusion module; The prompt generation module receives a semantic branch candidate frame set from the semantic branch candidate frame generation module, receives a geometric branch candidate frame set from the geometric branch candidate frame generation module, marks candidate frames on an original image to generate visual prompts, generates specific text prompts for each candidate frame, and sends the visual prompts and the text prompts to a second multi-mode large language model, wherein the second multi-mode large language model adopts a Qwen-VL model, receives the visual prompts and the text prompts from the prompt generation module, performs binary verification on each semantic branch candidate frame and each geometric branch candidate frame under the guidance of the text prompts, judges whether threat exists in each candidate frame, generates a pseudo tag set, and sends the pseudo tag set to a low-rank open vocabulary object detector; The second open vocabulary object detector adopts Grounding DINO model, the model comprises visual feature extraction module, text feature extraction module, cross-modal encoding module, language-guided query selection module, cross-modal decoding module, prediction generation module and loss calculation module, the first low-rank adaptation module is inserted into the self-attention query projection layer and the value projection layer in the text feature extraction module, the second low-rank adaptation module is inserted into the attention value projection layer and the feedforward neural network layer in the cross-modal encoding module, the third low-rank adaptation module is inserted into the attention value projection layer and the feedforward neural network layer in the cross-modal decoding module, and the visual feature extraction module converts an input image into visual features The text feature extraction module prompts text input by a user Conversion to text features The first low-rank adaptation module receives the same inputs as the self-attention query projection layer and the value projection layer, and linearly superimposes the output threat semantic adaptation features onto the values In the method, the method is used for learning the semantic representation of the threat object, so that the model can understand the semantic meaning of the threat to obtain the threat semantic adaptation characteristic superimposed The text feature extraction module will Transmitting to a cross-modal encoding module which receives the visual features And (3) with By cross-attention mechanism Feature enhancement to obtain enhanced visual features And is opposite to Feature enhancement is performed to obtain enhanced text features The second low-rank adaptation module receives the same input as the attention value projection layer and the feedforward neural network layer, and superimposes the output threat perception enhancement adaptation features on the enhancement features of the cross-modal encoding module And The method is used for realizing fusion of the visual features and the text features of the threat object, so that the model realizes modal alignment of the visual and text, and the enhanced visual features with the threat perception enhanced adaptation features superimposed are obtained And enhanced text features overlaid with threat awareness enhanced adaptation features Will be To be sent to a language-guided query selection module and a cross-modal decoding module Respectively transmitting to a language-guided query selection module, a cross-modal decoding module and a prediction generation module, wherein the language-guided query selection module receives And Calculating the similarity between the two features and selecting the feature with the highest similarity Visual characteristics of this Individual visual features as query vectors Sending the data to a cross-modal decoding module which receives the data 、 、 For a pair of Cross-modal decoding to generate refined visual features The third low-rank adaptation module receives the same input as the attention value projection layer and the feedforward neural network layer, and superimposes the threat perception refinement adaptation features of the output to the output features of the cross-modal decoding module The method is used for realizing the perception of the threat object under the text prompt and obtaining the refined visual characteristics with the threat perception refined adaptive characteristics superimposed Will be Transmitting to a prediction generation module, receiving by the prediction generation module And (3) with Will be Prediction result set decoded into open vocabulary object detector Includes category, confidence and boundary frame coordinates, and sets the prediction results The loss calculation module receives the data from the prediction generation module during the training phase Receiving a pseudo tag set from a semantic-geometric complementation fusion module, extracting a real tag set corresponding to a training set image from a training set, mixing the pseudo tag set and the real tag set as a mixed tag set, and calculating And updating parameters of the three low-rank adaptation modules through a back propagation algorithm based on the gradient calculated by the loss value, wherein the original weight of the open vocabulary target detector is kept in a frozen state.
- 3. An automatic driving threat object detection method based on semantic-geometric pseudo tags as defined in claim 1, wherein in the second step said =4884 Training set Including 5 known categories, pedestrian, rider, car, truck, bus, which are common categories in traffic scenes, said Including labels for known classes, and 33 rare classes, labeled as "unknown threat" classes.
- 4. A method of automatically driving threat object detection based on semantic-geometric pseudo tags as defined in claim 1, wherein the third step of initializing the threat object detection system is: Step 3.1 initializing parameters of a semantic branch candidate frame generating module, mainly initializing and screening confidence threshold values of semantic candidate frames ; Step 3.2 initializing parameters of the geometric branch candidate frame generating module, including initializing point cloud space filtering parameters, defining a horizontal boundary interval Wherein The rice is used for the production of rice, Meter, defining maximum radial distance Rice for screening near-distance point cloud around vehicle, initializing perspective normalization weight factor ; Step 3.3 initializing parameters of the second open vocabulary object detector, filtering query vectors with the language-guided query selection module Number threshold of (2) Set to 100; Step 3.4, initializing parameters of three low-rank adaptation modules, setting the rank of a first low-rank adaptation module to 8, setting a scaling factor to 16, setting the ranks of a second low-rank adaptation module and a third low-rank adaptation module to 128, and setting the scaling factor to 256.
- 5. The automatic driving threat object detection method based on semantic-geometric pseudo tag as defined in claim 1, wherein in the fourth step, the semantic branch candidate frame generation module, the geometric branch candidate frame generation module, and the semantic-geometric complementary fusion module cooperate with each other to generate a training set The method of unknown threat object pseudo tags of (a) is: Step 4.1 initializing variables Initializing Pseudo tag sets of (a) Is an empty set; step 4.2, the semantic branch candidate frame generating module generates a training set Middle (f) Sheet image Semantic branch candidate box set of (a) Will be Sending to a semantic-geometric complementary fusion module; Step 4.3 generating training set by geometric branch candidate frame generating module Middle (f) Sheet image Is a geometric branch candidate frame set Will be Sending to a semantic-geometric complementary fusion module; step 4.4 semantic-geometric complementary fusion Module pair Combining the candidate frames in the model, judging the threat of the object in the candidate frames, and generating Pseudo tag set for all training data in a computer system The method comprises the following steps: Step 4.4.1 will Semantic branch candidate box set of (a) And geometric branch candidate frame set Combining to obtain Is a set of total candidate frames of (1) ; Step 4.4.2 initialization Order-making Representing a set of total candidate boxes The b-th candidate frame of (2) has the coordinates of , Representation of The upper left corner pixel point coordinates, Representation of Lower right corner pixel coordinates; Step 4.4.3 pairs of hint Generation Module in semantic-geometric complementation fusion Module The b-th candidate frame in (2) Generating specific prompts by displaying images Marked out To construct visual cues and text cues, the task is to evaluate the labeling quality and the true collision threat to the own vehicle for the bounding box marked in the image, please respond only with true or false, the task is to evaluate the labeling quality and the true collision threat to the own vehicle for the bounding box marked in the image " Step 4.4.4 the second multi-mode large language model in the semantic-geometric complementation fusion module receives the visual prompt and the text prompt constructed by the prompt generation module, judges the threat of the object in the candidate frame, and obtains the result of whether the object in the b candidate frame has threat or not , , Representing objects in the b-th candidate box as having a threat, Representing that the object in the b candidate box is not a threat; Step 4.4.5 if Image is formed Adding pseudo tag sets to the b-th candidate frame pseudo tag of (b) The pseudo tag is dictionary structure and image Pseudo tag of = { image number: Bounding box : Category: unknown threat }; Step 4.4.6 if Order-making Turning to step 4.4.3, if Description of completion of Pseudo tag generation of (a) to obtain an image Pseudo tag sets of (a) Will be Put into Step 4.5, turning to the step; Step 4.5 if Order-making Turning to step 4.2, if The pseudo tag generation flow of the training data is finished to obtain Pseudo tag set for all training data in a computer system ; Step 4.6 pseudo tag Assembly And a loss calculation module sent to the open vocabulary object detector.
- 6. The method for automatically detecting a driving threat object based on a semantic-geometric pseudo tag according to claim 5, wherein said semantic branch candidate block generating module generates a training set in step 4.2 Middle (f) Sheet image The method of the semantic branch candidate frame set is as follows: Step 4.2.1 initializing training set Middle (f) Sheet image Semantic branch candidate box set of (a) For the empty set, constructing a thinking chain text prompt word, namely, a task, namely, identifying all object categories possibly constituting collision risks; in steps 1) examine the scene, 2) list potential collision objects, 3) retain only the actual existing and well-defined objects, 4) normalize the output format;' Step 4.2.2 the first multimodal big language model in the semantic branch candidate frame generating module combines the mental chain text prompt with the training set The first of (3) Sheet image As input, output Is a set of textual descriptions containing potential threat objects Will be Sending to an open vocabulary object detector; step 4.2.3 open vocabulary object detector receives from the first multimodal large language model According to Detecting an image The object contained in (a) is obtained Initial set of prediction results of (a) , Is provided with therein A prediction result, wherein Individual prediction results Is a triplet , Representing open vocabulary object detector pair images Is the first of (2) The bounding box coordinates of the individual predictors, Representing open vocabulary object detector pairs Is the first of (2) The class of the individual prediction results is defined, The confidence score is a measure of the confidence, ; Step 4.2.4 initialization ; Step 4.2.5 for prediction results If (if) Not belonging to a known class and , As a threshold value of the confidence level, Will be Adding in Semantic branch candidate box set of (a) Turning to step 4.2.6, otherwise, turning directly to step 4.2.6; if step 4.2.6 Turning to step 4.2.5, if Description is given of , ={ Go to step 4.2.7; Step 4.2.7 will And sending the data to a semantic-geometric complementary fusion module.
- 7. The method for automatically detecting a driving threat object based on a semantic-geometric pseudo tag as defined in claim 5, wherein said step 4.3 step geometric branch candidate box generation module generates a training set Middle (f) Sheet image Is a geometric branch candidate frame set The method of (1) is as follows: step 4.3.1 initializing training set Middle (f) Sheet image Is a geometric branch candidate frame set For empty set, 3D basic model pair image in geometric branch candidate frame generation module Monocular geometric estimation is carried out, and pseudo point cloud is output Matrix of camera internal parameters , Wherein Representing pseudo point cloud in camera coordinate system Middle (f) The coordinates of the individual points are used to determine, , Is the total number of points in pseudo point cloud, and the reference matrix of camera Containing images Corresponding focal length of camera And principal point parameters Pseudo point cloud Sending the camera internal parameter matrix to a point cloud processing module The method is used for calculating point cloud projection in the step 4.3.8; Step 4.3.2 the point cloud processing module receives the pseudo point cloud Removal using RANSAC algorithm Obtaining a set of non-ground points , Wherein Representation of The first of (3) A point of the light-emitting diode is located, Representing under the camera coordinate system Is used for the purpose of determining the coordinates of (a), Representation of Total of midpoints 1 ; Step 4.3.3 pair Performing spatial filtering operation, and reserving according to the spatial filtering parameters initialized in the third step Meets the horizontal boundary And maximum radial distance Obtain a set of valid points located in close range and road range , Wherein Representation of The first of (3) A point of the light-emitting diode is located, Representing under the camera coordinate system Is used for the purpose of determining the coordinates of (a), Representation of Total of midpoints 1 The radial distance of the pointing point in the short distance and road range Less than And transverse coordinates Is positioned at In (a) and (b); Step 4.3.4 pair In (a) and (b) Performing perspective normalization processing on each point, wherein the first point Individual points Performing normalization calculation according to formula (1) to obtain normalized points , The normalized points form a point cloud set : Formula (1); is a scaling factor that is used to scale the image, , Is a constant value of the stability of the numerical value, ; Step 4.3.5 Using DBSCAN Algorithm pair Clustering to obtain clustered clusters, and recording the number of clustered clusters as Is (as) Assigning cluster labels, i.e. 1 to 1 Integer serial numbers between the two clusters, and cluster labels of points in the same cluster are the same; step 4.3.6 is based on Index order of (2) is Corresponding points of (1) are assigned identical labels, i.e. point-to-point Distribution and distribution The same cluster tag; Step 4.3.7 order Initializing for cluster numbering after clustering ; Step 4.3.8 extraction of the first All the points in the cluster form a point set , Wherein Representing a set of points The first of (3) A point of the light-emitting diode is located, Representing under the camera coordinate system Is used for the purpose of determining the coordinates of (a), Representation of Total number of midpoints by camera reference matrix Calculation of Projection of each point in (3), wherein the point Projection in a two-dimensional image plane is , The calculation formula is formula (2): formula (2); All two-dimensional projection points ; Step 4.3.9 calculation Is the smallest bounding rectangle of (a) to obtain a bounding box , Representing bounding boxes The upper left corner pixel point coordinates, Representation of Lower right corner pixel coordinates; step 4.3.10 calculates a bounding box With semantic branch candidate sets The cross ratio of all projection frames in the image is calculated, the maximum value of the cross ratio is calculated, and the complementarity score is introduced : Formula (3) Step 4.3.11 if Will be Joining a set of geometric branch candidate boxes To capture potential threat objects of semantic branch omission; if step 4.3.12 Order-making Turning to step 4.3.8, if Description to get the first Picture-taking picture Is a geometric branch candidate frame set , ={ Go to step 4.3.13; Step 4.3.13 will And sending the data to a semantic-geometric complementary fusion module.
- 8. An automatic driving threat object detection method based on semantic-geometric pseudo tags as defined in claim 1, wherein the step 5.2 is Step 5.3 setting the learning rate to The visual characteristic extraction method in the step 5.4 comprises the following steps of Extraction of the first Image tensors for individual batches , In the form of the four-dimensional tensor described in step 5.2, extraction by hierarchical moving window attention mechanisms Flattening and projecting the multi-scale visual feature map to a unified feature dimension Obtaining a visual characteristic sequence Will be , Is one dimension of Three-dimensional tensor of the three-dimensional tensor of Representing the length of the feature sequence, the value of the feature sequence corresponds to the total number of visual tiles after the multi-scale visual feature map is flattened, Representing the unified feature embedding dimension for representing the high-dimensional information of each visual block in the hidden layer space, wherein the text feature extraction method in step 5.5 comprises the following steps that a text feature extraction module extracts the text features according to the following steps of Will be divided into segments Breaking down into text lists containing multiple token, then encoding the text lists into text feature sequences Is one dimension of Three-dimensional tensor of the three-dimensional tensor of Representing the length of the sequence of text features, Representing unified feature embedding dimension for representing high-dimensional semantic feature information of each word element of an input text list in a hidden layer space, a first low-rank adaptation module introducing a group of leachable low-rank matrix parameters in a text feature extraction module bypass in an incremental manner, keeping trunk parameters in the text feature extraction module unchanged, and performing fine adjustment on parameters of the first low-rank adaptation module only In the method, threat semantic adaptation characteristics are superimposed The text feature extraction module will The characteristic enhancement method in the step 5.6 is that the attention mechanism pair is adopted to transmit the characteristic enhancement method to the cross-mode coding module And Feature enhancement to obtain enhanced visual features And text features And Tensors all in three dimensions, an Dimension of (2) The same is true of the fact that, Dimension of (2) The second low-rank adaptation module receives the same input as the attention value projection layer and the feedforward neural network layer, and superimposes the output threat perception enhancement adaptation features on the visual features of the cross-modal encoding module And On the basis, enhanced visual characteristics with threat perception enhanced adaptation characteristics superimposed are obtained And enhanced text features overlaid with threat awareness enhanced adaptation features Will be To be sent to a language-guided query selection module and a cross-modal decoding module The cross-modal query method is characterized in that the cross-modal query method is respectively transmitted to a language-guided query selection module, a cross-modal decoding module and a prediction generation module, and the cross-modal query method comprises the following steps of calculating And (3) with For measuring semantic relatedness between visual and text features, from Selecting the highest response value Index of each feature and extract Feature vectors corresponding to the medium feature indexes are used as query vectors input into the cross-mode decoding module Is one dimension of Wherein Representing the length of the sequence of filtered feature vectors, The cross-modal decoding method in step 5.8 is that the cross-modal decoding module pair 、 And Performing attention calculation to obtain refined region characterization The third low-rank adaptation module receives the same input as the attention value projection layer and the feedforward neural network layer, and superimposes the output threat perception refinement adaptation features on the output features of the cross-modal decoding module Obtaining the refined visual characteristics with the threat perception refined adaptive characteristics superimposed Will be Sending the prediction result to a prediction generation module; Is a three-dimensional tensor with its dimension and query vector Similarly, it represents the advanced features of those image regions that the open vocabulary object detector deems "most likely to be the object of interest".
- 9. An automatic driving threat object detection method based on semantic-geometric pseudo tags as defined in claim 1, wherein the prediction generation method of step 5.9 is: step 5.9.1 calculation And (3) with Classification confidence matrix of (a) For a pair of Converting the Sigmoid function into a probability value between 0 and 1, wherein the calculation formula is formula (4): formula (4); Wherein, the The Sigmoid function is represented as a function, For mapping arbitrary real numbers to (0, 1) intervals; is one dimension of Three-dimensional tensor of the three-dimensional tensor of Representative of Is provided for the length of (a), Representing the length of the text feature sequence; Element representation of (a) And (3) with The degree of matching between the two; Step 5.9.2 prediction generating Module edge Third dimension of (i) text feature sequence length The dimension execution dimension reduction operation comprises the following steps: Step 5.9.2.1 prediction generating Module edge Calculating the maximum value of the dimension to obtain a shape as Confidence tensor of (2) And the first of the tensors Line 1 Column element Representation of Middle (f) The first image sample Personal features and text features And taking the maximum matching degree of the features as the final prediction confidence of the prediction result corresponding to the features; step 5.9.2.2 prediction generating Module edge The dimension calculates the maximum Is provided with an index position in the middle, generating a shape as Category index tensor Each element in the tensor Is an integer, and points directly to a specific vocabulary position in the text feature; Step 5.9.2.3 the prediction generation module is based on the class index tensor Retrieving corresponding text descriptions from the text prompt vocabulary to determine a predictive category label for each predicted outcome ; Step 5.9.3 the prediction generation module uses the multi-layer perceptron pair Decoding and predicting to obtain boundary frame coordinate tensors of all potential threat objects ; Is in the shape of Tensor of (2), wherein Representing the length of the query vector sequence, the value 4 representing the coordinates of the prediction bounding box Is a 4-dimensional one that is to be used, Representing the pixel point coordinates in the top left corner of the prediction bounding box, Representing coordinates of a pixel point at the lower right corner of the prediction boundary box; step 5.9.4 the prediction generation module generates a prediction based on 、 、 Index consistency of these three tensors will 、 、 Combining to construct a final prediction result set , Wherein, the triplet Represents the first In the first image Complete detection information of the individual prediction results, The coordinates of the bounding box are represented, The category of the prediction is indicated, Representing the confidence of the prediction.
- 10. A method of automated driving threat object detection based on semantic-geometric pseudo tags as defined in claim 1 wherein said loss calculation module calculates a total loss function for back propagation in step 5.10 The method of (1) is as follows: Step 5.10.1 the loss calculation module receives the prediction result set from the prediction generation module From a set of pseudo tags Extracting false labels of unknown classes, extracting true labels of known classes from true labels of training sets, merging the false labels and the true labels to obtain mixed labels; Step 5.10.2 the loss calculation module calculates a loss function value between the prediction result and the hybrid annotation The calculation formula is as follows: Equation (5); Wherein, the classification loss Calculating by Focal Loss, and returning Loss of boundary frame Calculating the absolute error of the coordinates of the boundary frame of the matched sample by adopting the L1 loss, wherein the formula is as follows: formula (6) Wherein, the Representing the number of positive sample pairs successfully matched in the current batch, wherein the positive sample pairs are calculated through a Hungary algorithm; Represents the first Coordinate vectors of the prediction boundary boxes, including Four components; represents the first The coordinate vector of the bounding box in each label comprises Four components; Representation pair Performing 1-norm operation; Adopting GIoU to make loss, setting loss weight super-parameter And 。
Description
Automatic driving threat object detection method based on semantic-geometric pseudo tag Technical Field The invention relates to the field of automatic driving environment perception, in particular to an automatic driving universal threat object detection method based on semantic-geometric pseudo tags. Background The target detection algorithm is used as one of the core components of the automatic driving environment sensing system, and various objects threatening the automatic driving safety are required to be accurately detected in complex and changeable road scenes. Conventional target detection algorithms are typically based on a Closed set assumption (Closed-Set Assumption) that only predefined categories (e.g., pedestrians, vehicles, riders, etc.) in the training set can be detected. However, the autopilot scenario is a typical open world scenario, and the target detection algorithm inevitably encounters unknown objects not included in the training set during operation, and these objects beyond the predefined class often form a potential security threat, and if they cannot be effectively identified, serious security accidents will result. As target detection algorithms move toward open world scenarios, open world target detection and open vocabulary target detection have been widely used in autopilot to enhance the ability to detect threat objects in generally unpredictable edge scenarios (i.e., low probability, high risk long tail scenarios that are difficult to cover the whole area in a training set and are extremely prone to induce false or missed detection of target detectors), thereby improving system security. Open world target detection (OWOD) aims to identify and locate known classes of objects while marking unknown objects as "unknown". Early research works ORE (see literature "KJ Joseph et al Towards open world object detection CVPR, 2021. Paper by KJ Joseph et al: step toward open world target detection) formally defined this task for the first time and introduced a contrast clustering mechanism in the potential space to separate known and unknown classes. Document "Akshita Gupta, et al, ow-DETR: open-world detection transform In CVPR, 2022" (Ow-DETR) proposes an attention-driven pseudo tag generation scheme by selecting an Object query (Object Queries) with a high attention score but which does not match a known category as an unknown candidate box. Literature "Orr Zohar, et al. Prob: Probabilistic objectness for open world object detection. In CVPR, 2023."(PROB) integrates the object probabilities into the classification by probability estimation. Document "Wenteng Liang, et al. Unknown sniffer for object detection: Don't turn a blind eye to unknown objects. In CVPR, 2023."(UnSniffer) establishes a generic object confidence score to distinguish objects from background. Literature "Yulin He et al. Recalling unknowns without losing precision: An effective solution to large model-guided open world object detection, In IEEE TIP, 2024."(SGROD) attempts to assist recall of unknown objects using a segmentation large model (e.g., SAM), improving recall capability of threat objects in open driving scenarios. However, the OWOD method mainly relies on underlying visual cues or heuristic rules to discover unknown targets, and fails to learn semantic knowledge in the open world effectively, so that accurate detection of threat objects in the open scene is difficult to achieve. Open vocabulary target detection (OVOD) aims at detecting objects described by arbitrary text cues through extensive vision-language pre-training learning of semantic knowledge in the open world. Document "Shilong Liu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2022."(GroundingDINO) fuses visual and linguistic modalities in multiple stages, achieving powerful zero sample detection capability. In the field of autopilot, literature "Yulin He, et al. Sniffing threatening open-world objects in autonomous driving by open-vocabulary models. In ACM MM, 2024."(AD-OWOD) proposes the concept of threat object detection and introduces a textual vocabulary related to driving scenarios to identify threats. However, the existing OVOD method is highly dependent on explicit class text cues, but it is almost impossible to enumerate all relevant text words due to the dynamics and long tail distribution characteristics of threat object classes. Furthermore, in the face of the concept of semantic ambiguity of "threat objects", it is difficult for the model to achieve efficient alignment between visual features and threat semantics, resulting in false positives and false negatives. The human driver's perception mechanism in driving scenarios has significant "threat guidance" features, rather than a single "class guidance", which makes it difficult to fully adapt the threat object perception task in automatic driving based on the detection paradigm of predefined class cues. In order to align