CN-122024198-A - Intelligent driving target detection method based on depth information perception

CN122024198ACN 122024198 ACN122024198 ACN 122024198ACN-122024198-A

Abstract

The invention discloses an intelligent driving target detection method based on depth information perception, and aims to improve the detection precision of unknown classes. The target detection system based on depth information perception is constructed by an image preprocessing module, a depth map processing module, a main feature extraction module, an image-text fusion module, a prediction module, a depth-based object assessment module, a depth-guided text embedding learning module and a loss function calculation module. Training the main feature extraction module by using a training set, generating depth scores by using the depth-based object property evaluation module to the prediction frame of the prediction module, screening the prediction frame of the prediction module by using the depth scores by using the depth-guided text embedded learning module, generating pseudo labels to guide training, and carrying out target detection on an image to be detected by using a target detection system after training to obtain a detection result. The invention can improve the perception capability of unknown objects and maintain the detection precision of common objects.

Inventors

CHEN WEI
LI LIN
HE YULIN
ZHOU WENJUAN
TANG MINGXIN
WANG HAOTIAN
DING RUIHUA

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260512
Application Date: 20260225

Claims (11)

1. The intelligent driving target detection method based on depth information perception is characterized by comprising the following steps of: The method comprises the steps of constructing a target detection system based on depth information perception, wherein the target detection system comprises an image preprocessing module, a depth map processing module, a main feature extraction module, an image-text fusion module, a prediction module, a depth-based object assessment module, a depth-guided text embedding learning module and a loss function calculation module; the second step, constructing a training set and a test set by using the nu-OWODB intelligent driving scene data set, dividing the nu-OWODB intelligent driving scene data set into 3 task training sets based on 3 major classes of vehicles, pedestrians and obstacles of the nu-OWODB intelligent driving scene data set, wherein in the task training set, the currently known class is a vehicle class, the unknown class is a pedestrian class and an obstacle class, in the task training set, the currently known class is a vehicle class, the past known class is a pedestrian class, the unknown class is a vehicle class, in the task training set, the currently known class is a vehicle class, the past known class is a vehicle class, a pedestrian class and an unknown class, and the total number of images in the task training set, the task training set and the task training set are respectively 、、 Selecting T images except for three task training sets from the nu-OWODB data set to form a test set; third, the depth map processing module and the image preprocessing module carry out enhancement processing on the task one training set, the task two training set and the task three training set to obtain an enhanced task one training set which consists of normalized images, normalized depth maps, normalized depth change maps, scaled position truth value labels and classification truth value labels corresponding to the position truth value labels in the training set Enhanced task two training set Enhanced task three training set Will be Position truth value tag and after Zhang Sufang Classification truth value label corresponding to middle position truth value label, Position truth value tag and after Zhang Sufang Classification truth value label corresponding to middle position truth value label, Position truth value tag and after Zhang Sufang The classification truth value label corresponding to the middle position truth value label is sent to the loss function calculation module; The fourth step, the image preprocessing module adopts an image scaling normalization method to carry out enhancement processing on T images and corresponding labels in the Test set, and a new Test set Test consisting of normalized T images, position truth labels after T Zhang Sufang and classification truth labels corresponding to the position truth labels in the Test set is obtained; Fifthly, training the text embedded vector parameters of the main feature extraction module in the target detection system constructed in the first step by utilizing a gradient back propagation method, and freezing the network weight parameters of the other modules to obtain the trained text embedded vector parameters of the main feature extraction module, wherein the method comprises the following steps: Step 5.1 letting variable Represents the first Tasks; step 5.2 initializing network weight parameters of each module in the target detection system and text embedded vectors in the main feature extraction module , Wherein For the number of categories to be considered, The j-th text of (a) is embedded, ; Step 5.3 setting training parameters of the target detection system, including selecting AdamW as a model training optimizer, setting initial learning rate Super-parameter weight attenuation of optimizer, batch size of network training Maximum value of training step length epoch ; Step 5.4 training the target detection system by taking the difference between the predicted position and the predicted category of the known object output by the current target detection system and the true value as a loss value loss, updating the text embedded vector parameters of the main feature extraction module by utilizing gradient back propagation, and sequentially executing three tasks, wherein each task is used until the training is completed After the epoch is finished, the text embedded vector parameter when each task is finished is the first The text embedded vector parameters of the epoch main feature extraction module are used as initial text embedded vector parameters by the next task when the previous task is executed, and the text embedded vector parameters are used as final text embedded vector parameters of the target detection system when the third task is executed, wherein the specific method comprises the following steps: Step 5.4.1 initializing batch number =1, Initializing step sequence number =1; Step 5.4.2 visual feature extraction Module of the Master feature extraction Module is slave Read the first Batch co-production Images, this The individual images are noted as having a size of In the form of a matrix of (a) Wherein H represents the height of the input image, W represents the width of the input image, and '3' represents the RGB three channels of the image; step 5.4.3 the visual characteristic extraction module adopts a visual characteristic extraction method to perform the visual characteristic extraction on Extracting visual characteristics to obtain visual characteristics The method comprises the steps of Darknet backbone network pairs in a visual characteristic extraction module Extracting visual characteristics to obtain visual characteristics Will be Is sent to the image-text fusion module, Wherein Is a first-scale visual characteristic diagram with the size of , Is a second-scale visual characteristic diagram with the size of , Is a third-scale visual characteristic diagram with the size of , As a dimension of the visual characteristics, ; Step 5.4.4 text feature extraction Module embeds the text in the Main feature extraction Module into vectors Sending to an image-text fusion module; step 5.4.5 image-text fusion Module receives from the Main feature extraction Module And For a pair of And Performing feature enhancement and alignment operations to obtain enhanced visual features Enhanced text-embedded collection Will be And Is sent to the prediction module and is used for predicting the data, , For the splicing fusion characteristic obtained after the weighted trunk branch characteristic and the cross-stage branch characteristic are spliced, , Is that Embedding an enhanced text of the fused image information obtained after enhancement; Step 5.4.6 the prediction module receives from the image-text fusion module And Adopts a prediction method to pair And Predicting to obtain the category information of the prediction frame And location information The method comprises the following steps: step 5.4.6.1 regression header network receives from image-text fusion module In the following Respectively setting 3 anchor frames with fixed aspect ratio, wherein the parameters of the anchor frames are as follows ), Is the x-coordinate of the center of the anchor frame, Is the y-coordinate of the center of the anchor frame, For the width of the anchor frame, Is the height of the anchor frame, and returns the head network slave Extracting image embedded sets corresponding to all anchor frames , For the number of anchor frames, Is the first Embedding the images, wherein the content of the images is less than or equal to 1 percent ≤ And all normalized offsets (dx, dy, dw, dh) for the relative anchor frame, dx being the offset relative to the anchor frame center x-coordinate, dy being the offset relative to the anchor frame center y-coordinate, dw being the scaled offset relative to the anchor frame width w, dh being the scaled offset relative to the anchor frame height h, converting the offsets to pixel-level prediction frame position information based on the parameters of each anchor frame =( ), To predict the x-coordinate of the center of the frame, To predict the y-coordinate of the center of the frame, In order to predict the width of the frame, To predict the height of the frame To the classification header network to be The method comprises the steps of sending the text to a depth-guided text embedded learning module and a depth-based object assessment module; Step 5.4.6.2 the classification header network receives from the regression header network Receiving from an image-text fusion module For a pair of Is embedded into and embedded into each image of (1) Performs L2 normalization and then calculates Each image of (a) is embedded into Semantic similarity of all text embeddings in the image is generated, and a similarity vector of each image embeddings is generated to obtain category similarity matrixes of all prediction frames For a pair of Normalization is performed to obtain all prediction frame category information Will be Transmitting the text to a text embedded learning module of depth guidance; Step 5.4.7 the depth-based objectivity assessment module receives the enhanced depth map and the depth variation map from the image pre-processing module and the prediction frame position information from the prediction module The pixel duty ratio of the prediction frame in the depth map and the depth change map is counted to obtain a depth score set , Will be Transmitting the text to a text embedded learning module of depth guidance; is a depth score of the depth of the object, , Is the coefficient of the exponential decay, Is according to the first Foreground probability of each prediction frame And object probability The initial depth score that is calculated, Is a natural constant; Step 5.4.8 depth-guided text-embedded learning module receives truth labels from the image pre-processing module and receives depth-based objectivity assessment module Receiving category information of a prediction frame from a prediction module And location information Deep screening is carried out on the prediction frame to obtain a pseudo tag set Will be 、 And (3) with 、 The loss function calculation modules are sent to the loss function calculation modules; step 5.4.9 the supervision truth construction module in the loss function calculation module receives from the depth-guided text-embedded learning module Receiving from an image preprocessing module True value tag of (1) build The supervision truth value of each batch is as follows Generating corresponding category information supervision truth value Is that Generating corresponding position information supervision truth value Will be 、、 Sending the loss value to a loss value calculation module; Step 5.4.10 the loss value calculation module receives from the supervised truth construction module 、、 According to 、、 Calculating a loss function value And updating the text embedded vector in the main feature extraction module by gradient back propagation The method comprises the following steps: , (1) Wherein the method comprises the steps of Is that And The calculated classification Loss function is obtained by Focal Loss calculation, Is a regression Loss function, and adopts an average absolute error L1 Loss; is an cross ratio loss function, and CIoU loss is adopted; Is the weight of the classification loss function; is the weight of the regression loss function; Is the weight of the cross-ratio loss function; If step 5.4.11 ≤ B, order Turning to step 5.4.2, if > B, namely, the main feature extraction module is completed Training the step length to obtain the text embedded vector parameters of training, and making Turning to step 5.4.12; If step 5.4.12 Order-making = Turning to 5.4.2, if Describing that the training of the current task is completed, storing text embedded vector parameters of the current task, and turning to step 5.4.13; If step 5.4.13 Order-making Turning to step 5.2, if Explaining that all three tasks are completed training, obtaining text embedded vectors of a text feature extraction module in the main feature extraction module, and turning to step 5.4.14; Step 5.4.14, loading a text embedded vector of a text feature extraction module in the main feature extraction module obtained through training into a target detection system to obtain a trained target detection system; Sixthly, performing category and position detection on the image to be detected and the text input by the user by adopting a trained target detection system to obtain category labels and position information of a prediction frame of the image set to be detected, wherein the method comprises the following steps: step 6.1, an image preprocessing module receives an image to be detected, a corresponding label and a text input by a user; The image preprocessing module carries out enhancement processing on the image to be detected and the corresponding label input by the user by adopting the image scaling normalization method in the fourth step to obtain a image set to be detected, which is composed of the normalized image, the scaled position truth label and the classification truth label corresponding to the position truth label of the image to be detected; Step 6.3, the visual feature extraction module of the main feature extraction module performs visual feature extraction on the image set to be detected by adopting the visual feature extraction method in the step 5.4.3 to obtain Visual characteristics Will (i) be Sending to an image-text fusion module; Step 6.4, the text feature extraction module of the main feature extraction module adopts a wafer pre-trained transducer neural network in YOLO-World to extract text features of a text input by a user, converts the text features into text embedded vectors, and splices the text embedded vectors of the text feature extraction module obtained by training in step 5.4.13 and the text embedded vectors obtained by category texts input by the user to obtain text embedded vectors corresponding to all texts Will be Sending to an image-text fusion module; step 6.5 image-text fusion Module receipt And For a pair of And Performing feature enhancement and alignment operations to obtain visual features with enhanced features And feature enhanced text features Will be And Sending the result to a prediction module; step 6.6 prediction Module receipt And The prediction method pair described in the step 5.4.6 is adopted And Predicting to obtain the category information of the prediction frame And location information Taking category information The category with the highest similarity is used as the category label of the prediction frame, namely the category label of the image set prediction frame to be detected; seventh, outputting the category label and the position information of the image set prediction frame to be detected obtained in the step 6.6 The target detection ends.
2. The intelligent driving target detection method based on depth information perception according to claim 1 is characterized in that in the target detection system based on depth information perception, a depth map processing module is connected with an image preprocessing module, the depth map processing module consists of a depth map generating module and a depth map preprocessing module, the depth map generating module consists of a monocular depth estimation model MoGe, images in a training set are received when the target detection system is trained, corresponding depth maps are generated, the depth maps are sent to the image preprocessing module and the depth map preprocessing module, the depth map preprocessing module performs feature enhancement on the depth maps received from the depth map generating module during training to obtain a depth change map, and the depth change map is sent to the image preprocessing module; The image preprocessing module is connected with the main feature extraction module, the depth map generation module, the depth map preprocessing module, the depth-based object performance evaluation module, the loss function calculation module and the depth-guided text embedding learning module, receives images in a training set during training, receives a depth change map from the depth map preprocessing module, receives the depth map from the depth map generation module, carries out enhancement processing on the depth change map, the depth map and the images in the training set, sends the images in the enhancement processing to the main feature extraction module, sends the depth map and the depth change map after the enhancement processing to the depth-based object performance evaluation module, sends a true value tag of the training set image to the loss function calculation module and the depth-guided text embedding learning module, receives images to be detected and texts input by a user during target detection on the images to be detected, carries out enhancement processing on the images to be detected to obtain an image set to be detected, and sends the images set to be detected and texts input by the user to the main feature extraction module; The system comprises an image preprocessing module, an image-text fusion module, a visual feature extraction module, a text feature extraction module and a text feature detection module, wherein the main feature extraction module consists of the visual feature extraction module and the text feature extraction module, the visual feature extraction module consists of a Darknet neural network of YOLOv and a path aggregation network of a multi-scale feature pyramid, the visual feature extraction module is indistinguishable when in training and when in target detection of an image to be detected, receives an enhanced image from the image preprocessing module, extracts visual features and sends the visual features to the image-text fusion module; The image-text fusion module is connected with the main feature extraction module and the prediction module and consists of a visual feature enhancement module and a text feature enhancement module; the text-image cross-attention network receives the text embedded vector from the text feature extraction module, receives the visual feature from the visual feature extraction module, splits the visual feature into a main branch feature and a cross-stage branch feature, calculates the category similarity of the main branch feature and the text embedded vector, screens out text semantics most relevant to a visual area in an input picture, generates attention weight, dynamically weights the main branch feature to obtain weighted main branch feature, splices the weighted main branch feature and the cross-stage branch feature to obtain enhanced visual feature of fused text information, and sends the enhanced visual feature to the prediction module; The prediction module is connected with the image-text fusion module, the depth-based object property evaluation module and the depth-guided text embedding learning module, and consists of a regression head network and a classification head network, wherein the regression head network receives enhanced visual characteristics from the image-text fusion module, extracts image embedding sets corresponding to all anchor frames from the enhanced visual characteristics, calculates position information of the prediction frames based on the anchor frame position information, sends the image embedding sets to the classification head network, sends the position information of the prediction frames to the depth-based object property evaluation module and the depth-guided text embedding learning module; the depth-based object property evaluation module is connected with the image preprocessing module, the prediction module and the depth-guided text embedding learning module, receives the enhanced depth map and the depth change map from the image preprocessing module, receives the position information of the prediction frames from the prediction module, calculates the pixel proportion of each prediction frame in the enhanced depth map and the depth change map, and obtains the foreground probability of the prediction frame And object probability For a pair of And Weighted summation to obtain an initial depth score of the prediction frame when the prediction frame Or (b) Below the threshold value, for Performing exponential penalty to obtain a depth score of the prediction frame so as to inhibit high-score false positives of a background area, and sending the depth score of the prediction frame to a text embedded learning module guided by depth; The depth-guided text embedding learning module is connected with the prediction module, the depth-based object property evaluation module and the loss function calculation module, receives a true value label from the image preprocessing module, receives position information and category information of a prediction frame from the prediction module, receives a depth score of the prediction frame from the depth-based object property evaluation module, judges whether the prediction frame contains a target object from the category information of the prediction frame, calculates an intersection ratio IoU of the prediction frame judged to contain the target object and all marking frames, carries out unknown pseudo-label on the prediction frame according to the intersection ratio and the depth score, obtains a pseudo label, and sends the pseudo label, the position information and the category information of the prediction frame to the loss function calculation module; The loss function calculation module is connected with the image preprocessing module and the depth-guided text embedded learning module and is composed of a supervision truth value construction module and a loss value calculation module, the supervision truth value construction module receives a truth value label from the image preprocessing module and generates a supervision truth value for loss value calculation, the loss value calculation module receives the supervision truth value from the supervision truth value module and receives position information and category information of a prediction frame from the depth-guided text embedded learning module, the loss value is calculated, and the text embedded vector parameters loaded by the target detection system are updated by gradient back propagation.
3. The intelligent driving target detection method based on depth information perception as claimed in claim 1, wherein the method for constructing training set and testing set by using nu-OWODB intelligent driving scene data set in the second step is: Step 2.1, constructing a training set and a testing set by using a nu-OWODB intelligent driving scene data set, wherein nu-OWODB is based on a nuScenes data set as a target detection data set, and the nu-OWODB data set is provided with 3 major classes, 23 minor classes and 75549 images, which are most relevant to the intelligent driving scene, of vehicles, pedestrians and obstacles; Step 2.2, dividing the nu-OWODB intelligent driving scene data set into 3 task training sets based on 3 major classes of nu-OWODB intelligent driving scene data set, namely pedestrian class and obstacle class, respectively training the perceptibility of a target detection system to a class of semantic similar targets, sequentially training three tasks to construct training logics which gradually introduce new classes, selecting 53850 images meeting the current known class as the vehicle class from 75549 images, selecting 53850 images meeting the current known class as the pedestrian class and obstacle class from 75549 images as the task training set, selecting 34957 images meeting the current known class as the pedestrian class, selecting 34957 images meeting the current known class as the vehicle class from 75549 images, selecting 25682 images without unknown class as the vehicle class and pedestrian class as the task training set; at the end of the line of 53850, At the end of the line of 34957, 25682; the test sets of the three tasks in the step 2.3 are all the same, 14884 images except for the three task training sets are selected from 75549 images of the nu-OWODB data set, and the total number of images T in the test set is 14884.
4. The intelligent driving target detection method based on depth information perception according to claim 1, wherein in the third step, the depth map processing module and the image preprocessing module perform enhancement processing on a task one training set, a task two training set and a task three training set to obtain an enhanced task one training set Enhanced task two training set Enhanced task three training set The method of (1) is as follows: Step 3.1, the depth map processing module and the image preprocessing module are matched with each other, and an image enhancement processing method is adopted to enhance the task training set The obtained product is composed of Normalized image, normalized depth map, normalized depth change map, Position truth value tag and task after Zhang Sufang Enhanced tasks composed of classification truth labels corresponding to position truth labels in training set Training set The specific method is as follows: step 3.1.1 depth map generation Module pairs tasks through the monocular depth estimation model MoGe Training set Predicting the image to obtain Zhang Shendu figure, will Zhang Shendu, sending the map to an image preprocessing module and a depth map preprocessing module; Step 3.1.2 the depth map preprocessing module receives from the depth map generation module Zhang Shendu, filling tiny holes in the depth map through morphological change, calculating depth gradient by utilizing a Sobel operator, enhancing boundary characteristics of a target and a background vertical to the ground, obtaining a depth change map, and sending the depth change map to an image preprocessing module; step 3.1.3 the image preprocessing module receives the depth map from the depth map generating module, receives the depth change map from the depth map preprocessing module, and reads the task Training set Sheet image and corresponding label, for task Training set Enhancing the sheet image and the corresponding label to obtain an enhanced task Training set The method comprises the following steps: Step 3.1.3.1 let variable m=1, initialize the task after enhancement processing Training set ; Step 3.1.3.2 providing the mth image with Marking frames, order The first of the label boxes Labeling frame Is a position truth value tag, ,( ) Represents the first Labeling frame Upper left corner coordinates of [ (] and [ (]) is ) Represents the first Labeling frame Coordinates of lower right corner of the container Category of (2) ,1≤ ≤ , ; Step 3.1.3.3 adopts a random overturn method to perform task M-th image in training set, corresponding depth map and depth change map, Turning to obtain an mth turned image, a depth map, a depth change map and a turned position truth value label Setting the turnover probability of the random turnover method to 0.5; Step 3.1.3.4 adopts a random clipping method to make the m-th overturned image, depth map, depth change map, Cutting to obtain an mth cut image, a depth map, a depth change map and a cut position truth value label Setting the clipping probability of the random clipping method to 0.5; step 3.1.3.5 adopts a random scaling method to perform random scaling on the m-th overturned image, the depth map, the depth change map, Scaling to obtain an image, a depth map, a depth change map and a scaled position truth value label after the mth Zhang Sufang Setting the scaling probability of the random scaling method to 0.5; Step 3.1.3.6 adopts image normalization operation to normalize the image, the depth map and the depth change map after the m Zhang Sufang th image, the normalized depth map and the normalized depth change map are obtained, and the m normalized image, the normalized depth map, the normalized depth change map and the scaled position truth value label are obtained Classification truth labels corresponding to the truth labels of the position in the training set of the task are put into ; Step 3.1.3.7 if m is less than or equal to , Turning to 3.1.3.2, if m- To obtain the product of Normalized image, normalized depth map, normalized depth change map, Position truth value tag and task after Zhang Sufang Classification truth-value label corresponding to position truth-value label in training set Turning to 3.2; Step 3.2, the depth map processing module and the image preprocessing module are matched with each other, and the image enhancement processing method described in step 3.1 is adopted to enhance the task two training set The obtained product is composed of Normalized image, normalized depth map, normalized depth change map, Enhanced task two training set consisting of Zhang Sufang position truth labels and classification truth labels corresponding to position truth labels in task two training set ; Step 3.3, the depth map processing module and the image preprocessing module are matched with each other, and the image enhancement processing method described in step 3.1 is adopted to enhance the task three training set The obtained product is composed of Normalized image, normalized depth map, normalized depth change map, Position truth value tag and task after Zhang Sufang Enhanced task three training set composed of classification truth labels corresponding to position truth labels in training set ; Step 3.4 the image preprocessing module will The position truth value label after Zhang Sufang is matched with the classification truth value label corresponding to the position truth value label in the task-training set, The position truth value label after Zhang Sufang and the classification truth value label corresponding to the position truth value label in the task two training set, And the position truth value label after Zhang Sufang and the classification truth value label corresponding to the position truth value label in the task three training set are sent to the loss function calculation module.
5. The intelligent driving target detection method based on depth information perception according to claim 1, wherein the fourth step of the image preprocessing module performs enhancement processing on the T images in the Test set and the corresponding labels by using an image scaling normalization method, and the method for obtaining a new Test set Test consisting of the normalized T images, the position truth labels after T Zhang Sufang and the classification truth labels corresponding to the position truth labels in the Test set is as follows: Step 4.1, let variable t=1, initialize new Test set Test to be empty; step 4.2 scaling the t-th image in the test set to a zoom operation The size of the image is obtained, the position truth value label of the t image is scaled, and the scaled position truth value label is obtained; Step 4.3, normalizing the t scaled image by adopting an image normalization operation to obtain a normalized t image, and placing the normalized t image, a position truth value tag after t Zhang Sufang and a classification truth value tag corresponding to the position truth value tag in the Test set into a new Test set; And 4.4, if T is less than or equal to T and is 4.2, if T is more than T, obtaining a new Test set Test consisting of the T normalized images, the T Zhang Sufang position truth labels and the classification truth labels corresponding to the T position truth labels, and turning to the fifth step.
6. The intelligent driving target detection method based on depth information perception as claimed in claim 1, wherein in step 5.2, the network weight parameters of each module in the target detection system and the text embedded vector in the main feature extraction module are initialized The method comprises initializing network weight parameters of main feature extraction module, image-text fusion module and prediction module by using network weight parameters of YOLO-World model, if Text embedding vectors corresponding to currently known classes in the YOLO-World model are adopted Initializing text embedded vectors in a main feature extraction module Text embedding vector corresponding to currently known class in Text embedding vectors using "object" text correspondence Initialization of Text embedding vector corresponding to object text in By using Initializing unknown class text embedding Obtaining all text embedded vectors if Text embedding vectors corresponding to currently known classes in the YOLO-World model are adopted Initialization of In (a) and (b) Text embedding vectors using "object" text correspondence Initialization of Text embedding vector corresponding to object text in By tasks -1 Text-embedded vector initialization for a currently known class of a trained target detection system In past known class text embedding vector parameters, initializing with a current unknown class text embedding vector Unknown class text embedding Wherein Fixing other text embedded vector parameters, training, and making Wherein For the number of categories to be considered, The j-th text of (a) is embedded, The method for setting the training parameters of the target detection system in the step 5.3 is that the initial learning rate is set Super parameter "weight decay" of AdamW of 0.001, batch size of network training of 0.025 The maximum value of the training step length epoch is The step 5.4.10 Set to 1; set to 5; Set to 2.
7. The intelligent driving target detection method based on depth information perception according to claim 1, wherein the image-text fusion module in step 5.4.5 receives from the main feature extraction module And For a pair of And Performing feature enhancement and alignment operations to obtain enhanced visual features Enhanced text-embedded collection The method of (1) is as follows: Step 5.4.5.1 the visual feature enhancement module in the image-text fusion module receives from the text feature extraction module Receiving from a visual feature extraction module First, through a channel splitting operation, the method Divided into two parts along the channel dimension, i.e , , Is that Is a cross-phase branching feature of (c), Is that Is then calculated using a max-sigmoid attention mechanism And (3) with Each text is embedded in Generates attention weights according to category similarity of (2) For a pair of Carrying out dynamic weighting, and finally splicing the weighted trunk branch characteristics with the cross-stage branch characteristics to obtain enhanced visual characteristics of the fused text information The method comprises the following steps: Step 5.4.5.1.1 order 1, The first Dimensional visual characteristics, order 1, The first Embedding a personal text; step 5.4.5.1.2 calculation Single-scale backbone branch feature in (a) With text embedding Similarity of (2) ; If step 5.4.5.1.3 Order-making Turning 5.4.5.1.2, if Make the maximum similarity For a pair of Generating attention weights using sigmoid functions By using Dynamically weighting the visual features to obtain weighted main branch features , Rotating 5.4.5.1.4; step 5.4.5.1.4 will And cross-phase branching feature Splicing to obtain splicing fusion characteristics , Rotating 5.4.5.1.5; if step 5.4.5.1.5 Order-making Order-making Turning 5.4.5.1.2, if Obtaining enhanced visual features of the fused text information Rotating 5.4.5.1.6; Step 5.4.5.1.6 will Sending the result to a prediction module; Step 5.4.5.2 the text feature enhancement module in the image-text fusion module receives from the text feature extraction module Receiving from a visual feature extraction module For a pair of Performing 3×3 max pooling operation on each scale visual feature map to generate patch tokens under each scale Splicing to obtain a patch token matrix The text is embedded as a query and, As key and value, multi-headed attention mechanism update is adopted Obtaining enhanced text-embedded collection , : Step 5.4.5.2.1 order 1; Step 5.4.5.2.2 for single-scale visual feature map Performing 3×3 max pooling calculation to obtain the first patch token ; If step 5.4.5.2.3 Order-making Turning 5.4.5.2.2, if Splicing Obtaining a patch token matrix , Rotating 5.4.5.2.4; Step 5.4.5.2.4 embeds text as a query, and As key and value, multi-headed attention mechanism update is adopted Enhanced text embedding set for obtaining fused image information , ; Step 5.4.5.2.5 will And sending the result to a prediction module.
8. The intelligent driving target detection method based on depth information perception as claimed in claim 1, wherein said classification head network in step 5.4.6.2 is paired from a regression head network Is embedded into and embedded into each image of (1) Performs L2 normalization and then calculates Each image of (a) is embedded into Semantic similarity of all text embeddings in the image is generated, and a similarity vector of each image embeddings is generated to obtain category similarity matrixes of all prediction frames For a pair of The normalization is performed to obtain the category information of all the prediction frames by the following steps: Step 5.4.6.2.1 order 1, The first Embedding individual images and making 1, The first Embedding a personal text; step 5.4.6.2.2 pair Middle (f) Personal image embedding Performing L2 normalization to eliminate scale difference to obtain image embedding of eliminating scale difference ; If step 5.4.6.2.3 Order-making Turning 5.4.6.2.2, if Description of the embodiments After the scale difference is eliminated by embedding the K images, the method enables Rotating 5.4.6.2.4; Step 5.4.6.2.4 pair The j-th text embedding in Performing L2 normalization to eliminate scale difference to obtain text embedding with eliminated scale difference ,1≤ ≤ ; If step 5.4.6.2.5 Order-making Turning 5.4.6.2.4, if Description of the embodiments After the C texts are embedded and the scale difference is eliminated, the method enables Rotating 5.4.6.2.6; step 5.4.6.2.6 calculation And Semantic similarity of (2) , Wherein Is a scaling factor of 0< ≤ , Is a deviation factor of-5 to less than or equal to ≤ ; If step 5.4.6.2.7 Order-making Turning 5.4.6.2.6, if Description generates Category similarity vector of (a) Will be Category similarity matrix for all prediction frames Rotating 5.4.6.2.8; if step 5.4.6.2.8 Order-making , Turning 5.4.6.2.6, if Class similarity matrix of all prediction frames is generated , Rotating 5.4.6.2.9; Step 5.4.6.2.9 pair Performing Softmax normalization to obtain similarity score matrixes of K prediction frames respectively embedded and output to C texts, wherein the similarity score matrixes are called prediction frame category information ; Step 5.4.6.2.10 will And sending the text to the text embedded learning module of the depth guidance.
9. The intelligent driving target detection method based on depth information perception as claimed in claim 1, wherein in step 5.4.7, the depth-based objectivity assessment module counts the pixel ratio of the prediction frame in the depth map and the depth change map to obtain a depth score set The method of (1) is as follows: step 5.4.7.1 order 1, The first A plurality of prediction frames; Step 5.4.7.2 calculate the first The pixel values of the prediction frames in the depth map are lower than a first threshold value Is used for the display of the display device, as a foreground probability , ; Step 5.4.7.3 calculates that the pixel value in the depth change map is lower than the second threshold Is used to obtain the object probability ; Step 5.4.7.4 calculates an initial depth score First weight Second weight ; Step 5.4.7.5 optimizes the initial depth score reliability when the kth prediction box Or (b) Below 0.5, the initial depth score is given Performing exponential penalty to obtain depth score , , Is a natural constant, an exponential decay coefficient Will be Put depth score set In (a) and (b); if step 5.4.7.6 Order-making Turning 5.4.7.2, if Describing depth scores of all prediction frames generated to obtain a depth score set , Rotating 5.4.7.7; step 5.4.7.7 will And sending the text to the text embedded learning module of the depth guidance.
10. The intelligent driving target detection method based on depth information perception as claimed in claim 1, wherein the method for depth screening the prediction frame by the text embedded learning module guided by depth in step 5.4.8 to obtain the pseudo tag set is as follows: step 5.4.8.1 order 1, Making the pseudo tag set obtained by screening Is an empty set; Step 5.4.8.2 from Extracting unknown class text from a document And the first Prediction frame Similarity score of (2) Generic object text embedding And the first Prediction frame Similarity score of (2) ; If step 5.4.8.3 , Is a super-parameter which is used for the processing of the data, Order-making Turning 5.4.8.2, if Description prediction frame Including the target object, go to step 5.4.8.4; Step 5.4.8.4 calculates a prediction frame The intersection ratio IoU with all label boxes in the truth label is recorded as the maximum value in IoU ; If step 5.4.8.5 , Is a super-parameter which is used for the processing of the data, Order-making Turning to step 5.4.8.2, if Using in S Unknown pseudo-labeling is carried out on the selected boundary box by adopting pseudo labels, and the pseudo labels are added into a pseudo label set Step 5.4.8.6 is performed; if step 5.4.8.6 Order-making Turning 5.4.8.2, if Description of obtaining pseudo tag set Rotating 5.4.8.7; Step 5.4.8.7 will 、 And (3) with 、 And sends the result to the loss function calculation module.
11. The intelligent driving target detection method based on depth information perception as claimed in claim 1, wherein the supervision truth construction module in step 5.4.9 constructs the first one The method for monitoring the true value of each batch is as follows: step 5.4.9.1 the supervision truth construction module uses the Hungary algorithm to construct a true value Class truth labels in (1) To pair one by one as Generating corresponding category information supervision truth value ; Step 5.4.9.2 will In (a) Classification truth value replacement of corresponding position is In (a) Corresponding depth score ; Step 5.4.9.3 will Position truth-value tag in (a) To pair one by one as Generating corresponding position information supervision truth value ; Step 5.4.9.4 will 、、 And sending the loss value to a loss value calculation module.

Description

Intelligent driving target detection method based on depth information perception Technical Field The invention relates to the field of intelligent driving target detection, in particular to a target detection method in an intelligent driving scene based on depth information perception. Background Target detection for intelligent driving scenes has been a key task for achieving high-level automatic driving. However, the complex open world environment, the numerous object categories are key factors restricting the landing of the target detection method. Most of the existing target detection algorithms are based on predefined categories, only can detect a fixed number of categories consistent with training data, and cannot detect unknown objects beyond the training data, so that unusual unknown objects such as wild animals suddenly appearing, stones in the middle of a road and the like cannot be detected, the safety of intelligent driving is greatly influenced, and serious traffic accidents such as car accidents and the like are easily caused. In order to solve the problem of detecting any category, a visual-language target detection model is proposed, and a powerful zero sample recognition capability is realized. The visual-language target detection model maps the category to a text semantic space by using a text or word representation mode, and trains by using large-scale image-text data to obtain rich object and scene knowledge, so that the visual-language target detection model becomes a new research hotspot in the target detection field. Document "Cheng T, Song L, Ge Y, et al. YOLO-World: Real-time open-vocabulary object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 16901-16911."（YOLO-World） describes a method for constructing classes in a target detection dataset as text descriptions, which is based on the document "Varghese R, Sambath M. Yolov8: A novel object detection algorithm with enhanced performance and robustness[C]//2024 International conference on advances in data engineering and intelligent computing systems (ADICS). IEEE, 2024: 1-6. 2022."（YOLOv8） detector, adds multiple image-text feature fusion, and trains on multiple large-scale datasets, realizing powerful zero sample recognition capability, and can realize diversified object class detection by recognizing text vocabulary and detecting object classes which are never seen in the training process. The visual-language target detection model YOLO-World can achieve a Precision of 35.4mAP (MEAN AVERAGE Precision, average Precision mean) without training using the LVIS dataset (see article Lvis of literature "Gupta A, Dollar P, Girshick R. Lvis: A dataset for large vocabulary instance segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 5356-5364."Gupta A et al), and the model is proved to have abundant object, scene knowledge and powerful zero sample detection performance by arranging the 1 st bit in the zero sample recognition capability of the LVIS dataset. YOLO-World can reach an inference speed of 52.0FPS on a V100 GPU, and exhibits significant advantages in both accuracy and real-time. YOLO-World is a generic model rather than a professional model, and although excellent in a generic test dataset such as LVIS, it is poor in a smart driving dataset nu-OWODB (see article "Li Z, Xiang Z, West J, et al. From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects[J]. arXiv preprint arXiv:2411.18207, 2024."Li Z et al: OVOW), and prone to missed detection of unusual unknown objects (e.g., road blocks, signs, etc.). Therefore, literature "Liu L, Feng J, Chen H, et al. Yolo-uniow: Efficient universal open-world object detection[J]. arXiv preprint arXiv:2412.20645, 2024."（YOLO-Uniow） is based on YOLO-World, and the unknown class objects are detected by designing unknown class text embedding to help a model by means of semantic information, however, class texts during training are limited, and all the unknown class objects cannot be detected. In view of the above, further research into model design and training strategies is required to apply the current advanced YOLO-World model to intelligent driving scenarios. How to enhance the detection capability of the YOLO-World model on unknown objects in the intelligent driving scene and how to improve the capability of the model to perceive the unknown objects while maintaining the detection accuracy of the model on the known objects is a key of the intelligent driving target detection method to cope with complex driving environments. Disclosure of Invention Aiming at the problem that the existing visual-language target detection model (such as YOLO-Uniow) has poor detection precision on unknown objects of an intelligent driving scene, so that the existing advanced detection technology cannot be applied to the intelligent driving scene, the invention provides an int