CN-121661334-B - Target detection method and electronic equipment

CN121661334BCN 121661334 BCN121661334 BCN 121661334BCN-121661334-B

Abstract

The invention discloses a target detection method and electronic equipment. The target detection method comprises the steps of obtaining an image for reasoning and a data set text prompt, wherein the data set text prompt comprises a plurality of text prompts containing scale information, inputting the image for reasoning and the data set text prompt into a target detection model to obtain a predicted target type and a predicted target position output by the target detection model, constructing the target detection model based on a deformable attention model, and carrying out target detection on the image by adopting the text prompt containing the scale information by the target detection model. According to the invention, the target detection model for carrying out target detection on the image by adopting the text prompt containing the scale information is constructed, the target detection is guided by utilizing the text prompt containing the scale information, the accuracy of small target description is enhanced, and the semantic understanding capability of the model on the small target is improved. The invention can better realize the extraction and positioning of the small target characteristics, thereby improving the detection precision of the small target.

Inventors

LI MINGZHU
YANG XUEJUN
HU YUNLONG
DU LINGSHUANG
LI DIXIONG
LI YUMIAO
XIE MENG

Assignees

上汽通用汽车有限公司

Dates

Publication Date: 20260512
Application Date: 20260204

Claims (9)

1. A method of detecting an object, comprising: Acquiring images for reasoning and a data set text prompt, wherein the data set text prompt comprises a plurality of text prompts containing scale information; Inputting the image for reasoning and the text prompt of the data set into a target detection model to obtain a predicted target type and a predicted target position which are output by the target detection model, wherein the target detection model is constructed based on a deformable attention model, the target detection model adopts the text prompt containing scale information to carry out target detection on the image, and the target detection model comprises a backbone network, a text coding module, an encoder layer, a decoder layer and a detection head which are based on the deformable attention model, wherein: the backbone network is used for extracting image features of multiple scales from the image and inputting the image features of the multiple scales into the encoder layer; The text encoding module is used for acquiring a data set text prompt, encoding the data set text prompt to obtain a text embedding, adding the text embedding and an original target query to obtain a target query containing scale information, wherein the original target query comprises a plurality of query features; The encoder layer is used for encoding the image features of a plurality of scales to obtain image feature codes of the plurality of scales; the decoder layer is used for interacting the image feature codes of a plurality of scales output by the encoder layer with the target query by adopting an attention mechanism to obtain a plurality of decoder layer output features; And the detection head is used for outputting the predicted target category and the predicted target position according to the output characteristics of the decoder layer.
2. The method of claim 1, wherein the scale information comprises a plurality of scale types, wherein encoding the dataset text prompt results in a text insert, wherein adding the text insert to the original target query results in a target query comprising scale information, and wherein the method comprises: encoding the data set text prompt to obtain text embedding; splicing first learnable weight matrixes of a plurality of scale types to obtain a spliced weight matrix, and multiplying the spliced weight matrix and the text embedding to obtain a predicted text embedding; and embedding the predicted text and adding the predicted text with the original target query to obtain the target query containing the scale information.
3. The method of claim 1, wherein the interacting the encoding of the image features of the plurality of scales output by the encoder layer with each target query containing scale information using the attention mechanism to obtain the plurality of decoder layer output features comprises: For each target query containing scale information, the following operations are performed: Multiplying the target query with a second learnable weight matrix of each scale information respectively to obtain a group of normalized offsets of each scale, wherein the normalized offsets of each scale are different in number, and the smaller the scale is, the larger the number of normalized offsets is; Superposing the normalized offset of each scale on the basis of the reference position corresponding to the target query to obtain a sampling point of each scale; acquiring an image feature code of each sampling point from the image feature code of each scale; multiplying the image feature codes of the sampling points of each scale by a third learnable weight matrix of each scale to obtain an attention value vector; And performing attention fusion based on the target query and the attention value vector to obtain decoder layer output characteristics.
4. The object detection method of claim 1, wherein the object detection model further comprises a text embedding branch for outputting predictive text embedding.
5. The target detection method of claim 4, wherein the target detection model further comprises a center point locating branch for outputting a gaussian heat map.
6. The target detection method according to claim 5, further comprising: Acquiring training data, wherein the training data comprises an image for training and a target label for training, and the target label comprises a target type matched with each target, a target position and a text prompt containing scale information; Inputting the training data into a target detection model for carrying out iterative training for a plurality of times, and executing the following steps in each iterative training: obtaining a predicted target type, a predicted target position, a predicted text embedding and a predicted Gaussian heat map which are output by a target detection model; taking the target type of the target tag as a target type true value and taking the target position of the target tag as a position true value; embedding a text obtained by encoding the text prompt of the target tag as a text embedding truth value; according to the target position of the target label, determining a Gaussian heat map corresponding to the image for training as a Gaussian heat map true value; Calculating a type loss value based on the predicted target type and the target type true value; calculating a position loss value based on the predicted target position and the position truth value; calculating a text loss value based on the predicted text embedding and the text embedding truth value; calculating a Gaussian heat map loss value based on the Gaussian heat map and the Gaussian heat map true value; calculating a total loss value based on the type loss value, the position loss value, the text loss value, the gaussian heat map loss value; And adjusting the target detection model based on the total loss value, and executing the next iteration training.
7. An electronic device, comprising: at least one processor, and A memory communicatively coupled to at least one of the processors, wherein, The memory stores instructions for execution by at least one of the processors, the instructions being executable by at least one of the processors to enable the at least one of the processors to perform the object detection method of any one of claims 1 to 6.
8. A storage medium storing computer instructions which, when executed by a computer, are adapted to carry out all the steps of the object detection method according to any one of claims 1 to 6.
9. A computer program product comprising computer program/instructions which, when executed by a processor, implements the object detection method according to any one of claims 1 to 6.

Description

Target detection method and electronic equipment Technical Field The present invention relates to the field of object detection in computer vision, and in particular, to an object detection method, an electronic device, a storage medium, and a computer program product. Background The small target detection is a technology for identifying and positioning targets with small image pixel occupation ratio in the field of computer vision, and is widely applied to the traffic field. In an automobile scene, small targets are more common, such as traffic signs, traffic lights, pedestrian vehicles at a distance and the like in an automatic driving scene, and the accuracy of detection of the small targets influences traffic safety. The core difficulty of small target detection is mainly reflected in two key aspects related to each other. First, small objects, due to low resolution, have sparse visual details, are more susceptible to interference from image noise, further blur key features, and in addition, available contextual information tends to be inadequate, making it difficult for the model to extract feature representations that are sufficiently robust and have differentiation. Secondly, the positioning accuracy requirement of the small target is very strict, the bounding box of the small target is offset by one pixel point, and the influence of the relative error is far greater than that of the large target. These two difficulties are superimposed together, making small target detection a difficulty in the field of computer vision. Disclosure of Invention Based on this, it is necessary to provide an object detection method, an electronic device, a storage medium, and a computer program product. The invention provides a target detection method, which comprises the following steps: Acquiring images for reasoning and a data set text prompt, wherein the data set text prompt comprises a plurality of text prompts containing scale information; Inputting the image for reasoning and the text prompt of the data set into a target detection model to obtain the predicted target type and the predicted target position output by the target detection model, constructing the target detection model based on the deformable attention model, and carrying out target detection on the image by adopting the text prompt containing scale information by the target detection model. Further, the object detection model comprises a backbone network, a text encoding module, an encoder layer based on a deformable attention model, a decoder layer and a detection head, wherein: the backbone network is used for extracting image features of multiple scales from the image and inputting the image features of the multiple scales into the encoder layer; The text encoding module is used for acquiring a data set text prompt, encoding the data set text prompt to obtain a text embedding, adding the text embedding and an original target query to obtain a target query containing scale information, wherein the original target query comprises a plurality of query features; The encoder layer is used for encoding the image features of a plurality of scales to obtain image feature codes of the plurality of scales; the decoder layer is used for interacting the image feature codes of a plurality of scales output by the encoder layer with the target query by adopting an attention mechanism to obtain a plurality of decoder layer output features; And the detection head is used for outputting the predicted target category and the predicted target position according to the output characteristics of the decoder layer. Still further, the scale information includes a plurality of scale types, the encoding the dataset text prompt to obtain a text insert, adding the text insert to an original target query to obtain a target query including scale information, including: encoding the data set text prompt to obtain text embedding; splicing first learnable weight matrixes of a plurality of scale types to obtain a spliced weight matrix, and multiplying the spliced weight matrix and the text embedding to obtain a predicted text embedding; and embedding the predicted text and adding the predicted text with the original target query to obtain the target query containing the scale information. Further, the interaction between the encoding of the image features of multiple scales output by the encoder layer and each target query containing scale information by adopting the attention mechanism, to obtain multiple output features of the decoder layer, includes: For each target query containing scale information, the following operations are performed: Multiplying the target query with a second learnable weight matrix of each scale information respectively to obtain a group of normalized offsets of each scale, wherein the normalized offsets of each scale are different in number, and the smaller the scale is, the larger the number of normalized offsets is; Superposing the normalized offset