CN-116363459-B - Target detection method, model training method, device, electronic equipment and medium

CN116363459BCN 116363459 BCN116363459 BCN 116363459BCN-116363459-B

Abstract

The disclosure provides a target detection method, a model training method, a device, electronic equipment and a medium, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, image processing, deep learning and the like. The method comprises the steps of obtaining an image to be detected, carrying out feature extraction on the image to be detected to obtain an image feature map of the image to be detected, encoding the image feature map through an encoder of a pre-trained target detection network to obtain global attention features of the image to be detected, carrying out feature mapping on the global attention features through a first decoder of the target detection network to obtain regression features of the image to be detected, and carrying out feature mapping on the global attention features through a second decoder of the target detection network to obtain classification features of the image to be detected.

Inventors

CHEN ZILIANG

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20230327

Claims (12)

1. A target detection method comprising: acquiring an image to be detected, extracting features of the image to be detected, and acquiring an image feature map of the image to be detected; Coding the image feature map through a pre-trained encoder of a target detection network to acquire global attention features of the image to be detected; Determining a first content feature and a first key value feature according to the global attention feature through a first decoder of the target detection network, determining a first query feature based on a preset query feature vector, and performing cross attention processing based on the first content feature, the first key value feature and the first query feature to complete feature mapping and obtain a regression feature of the image to be detected; determining a second content feature and a second key value feature according to the global attention feature by a second decoder of the target detection network, determining a second query feature based on a regression feature vector, performing cross attention processing based on the second content feature, the second key value feature and the second query feature, and finishing feature mapping to obtain a classification feature of the image to be detected, wherein the first decoder and the second decoder are cascaded, and the cascade sequence of the first decoder and the second decoder is that the output of the first decoder is the input of the second decoder; The regression features are input into a regression prediction layer of the target detection network to obtain the position of a prediction frame, and the classification features are input into a classification prediction layer of the target detection network to obtain the category of the target in the prediction frame.
2. The method of claim 1, wherein the first decoder is configured identically to the second decoder.
3. The method of claim 1, wherein the feature extraction of the image to be detected, and obtaining an image feature map of the image to be detected, comprises: and extracting the characteristics of the image to be detected through the main branch of the target detection network, and obtaining an image characteristic diagram of the image to be detected.
4. A training method of a target detection model, comprising: acquiring an image to be trained, a position of a target frame corresponding to a target in the image to be trained, and a category of the target in the image to be trained; The method comprises the steps of obtaining a target detection network, carrying out feature extraction on an image to be trained, obtaining an image feature image of the image to be trained, encoding the image feature image through an encoder of the target detection network, and obtaining global attention features of the image to be trained; Determining a first content feature and a first key value feature according to the global attention feature through a first decoder of the target detection network, determining a first query feature based on a preset query feature vector, and performing cross attention processing based on the first content feature, the first key value feature and the first query feature to complete feature mapping and obtain a regression feature of the image to be trained; determining a second content feature and a second key value feature according to the global attention feature by a second decoder of the target detection network, determining a second query feature based on a regression feature vector, performing cross attention processing based on the second content feature, the second key value feature and the second query feature, and finishing feature mapping to obtain a classification feature of the image to be trained, wherein the first decoder and the second decoder are cascaded, and the cascade sequence of the first decoder and the second decoder is that the output of the first decoder is the input of the second decoder; Determining regression and classification losses based on the regression features and the classification features; training the target detection network according to the regression loss and the classification loss.
5. The method of claim 4, wherein the determining regression and classification losses based on the regression features and the classification features comprises: Inputting the regression characteristics into a regression prediction layer of the target detection network to obtain the position of a prediction frame; inputting the classification characteristics into a classification prediction layer of the target detection network to obtain the category of the target in the prediction frame; Determining regression loss according to the position of the prediction frame and the position of the target frame, and determining classification loss according to the category of the target in the image to be trained and the category of the target in the prediction frame.
6. The method of claim 4, wherein the first decoder has a structure identical to that of the second decoder.
7. The method of claim 4, wherein the object detection network further comprises a backbone branch; the step of extracting the features of the image to be trained to obtain an image feature map of the image to be trained comprises the following steps: and extracting the characteristics of the image to be trained through the main branch of the target detection network, and obtaining an image characteristic diagram of the image to be trained.
8. An object detection apparatus comprising: The main network module is used for acquiring an image to be detected, extracting the characteristics of the image to be detected and acquiring an image characteristic diagram of the image to be detected; the encoder module is used for encoding the image feature map through an encoder of a pre-trained target detection network to acquire global attention features of the image to be detected; The decoder module is used for determining a first content feature and a first key value feature according to the global attention feature through a first decoder of the target detection network, determining a first query feature based on a preset query feature vector, and performing cross attention processing based on the first content feature, the first key value feature and the first query feature to complete feature mapping so as to obtain a regression feature of the image to be detected; determining a second content feature and a second key value feature according to the global attention feature by a second decoder of the target detection network, determining a second query feature based on a regression feature vector, performing cross attention processing based on the second content feature, the second key value feature and the second query feature, and finishing feature mapping to obtain a classification feature of the image to be detected, wherein the first decoder and the second decoder are cascaded, and the cascade sequence of the first decoder and the second decoder is that the output of the first decoder is the input of the second decoder; the prediction module is used for inputting the regression features into a regression prediction layer of the target detection network to obtain the position of a prediction frame, and inputting the classification features into a classification prediction layer of the target detection network to obtain the category of the target in the prediction frame.
9. A training device for a target detection model, comprising: The data acquisition module is used for acquiring an image to be trained, the position of a target frame corresponding to a target in the image to be trained and the category of the target in the image to be trained; The feature training module is used for extracting features of the image to be trained to obtain an image feature map of the image to be trained; The decoding training module is used for determining a first content feature and a first key value feature according to the global attention feature through a first decoder of the target detection network, determining a first query feature based on a preset query feature vector, and performing cross attention processing based on the first content feature, the first key value feature and the first query feature to complete feature mapping so as to obtain a regression feature of the image to be trained; determining a second content feature and a second key value feature according to the global attention feature by a second decoder of the target detection network, determining a second query feature based on a regression feature vector, performing cross attention processing based on the second content feature, the second key value feature and the second query feature, and finishing feature mapping to obtain a classification feature of the image to be trained, wherein the first decoder and the second decoder are cascaded, and the cascade sequence of the first decoder and the second decoder is that the output of the first decoder is the input of the second decoder; And the back propagation module is used for determining regression loss and classification loss based on the regression feature and the classification feature, and training the target detection network according to the regression loss and the classification loss.
10. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection method of any one of claims 1-3 and the training method of the object detection model of any one of claims 4-7.
11. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the object detection method according to any one of claims 1-3 and the training method of the object detection model according to any one of claims 4-7.
12. A computer program product comprising a computer program which, when executed by a processor, implements the object detection method according to any one of claims 1-3 and the training method of the object detection model according to any one of claims 4-7.

Description

Target detection method, model training method, device, electronic equipment and medium Technical Field The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of computer vision, image processing, deep learning, and the like. Specifically, the disclosure relates to a target detection method, a model training method, a device, an electronic device and a medium. Background The target detection task generally involves two tasks, a classification task to determine the class of target objects and a regression task to determine the size and location information of the target objects. The features required for classification tasks and regression tasks tend to be inconsistent, with natural conflicts. Disclosure of Invention The disclosure provides a target detection method, a model training method, a device, electronic equipment and a medium. According to a first aspect of the present disclosure, there is provided a target detection method comprising: acquiring an image to be detected, extracting features of the image to be detected, and acquiring an image feature map of the image to be detected; Coding the image feature map through a pre-trained encoder of a target detection network to acquire global attention features of the image to be detected; Performing feature mapping on the global attention feature through a first decoder of the target detection network to obtain a regression feature of the image to be detected; performing feature mapping on the global attention feature through a second decoder of the target detection network to obtain the classification feature of the image to be detected; The regression features are input into a regression prediction layer of the target detection network to obtain the position of a prediction frame, and the classification features are input into a classification prediction layer of the target detection network to obtain the category of the target in the prediction frame. According to a second aspect of the present disclosure, there is provided a training method of a target detection model, the method comprising: acquiring an image to be trained, a position of a target frame corresponding to a target in the image to be trained, and a category of the target in the image to be trained; The method comprises the steps of obtaining a target detection network, carrying out feature extraction on an image to be trained, obtaining an image feature image of the image to be trained, encoding the image feature image through an encoder of the target detection network, and obtaining global attention features of the image to be trained; Performing feature mapping on the global attention feature through a first decoder of the target detection network to obtain a regression feature of the image to be trained; performing feature mapping on the global attention feature through a second decoder of the target detection network to obtain the classification feature of the image to be trained; Determining regression and classification losses based on the regression features and the classification features; training the target detection network according to the regression loss and the classification loss. According to a third aspect of the present disclosure, there is provided an object detection apparatus comprising: The main network module is used for acquiring an image to be detected, extracting the characteristics of the image to be detected and acquiring an image characteristic diagram of the image to be detected; the encoder module is used for encoding the image feature map through an encoder of a pre-trained target detection network to acquire global attention features of the image to be detected; The decoder module is used for carrying out feature mapping on the global attention feature through a first decoder of the target detection network to obtain the regression feature of the image to be detected; the prediction module is used for inputting the regression features into a regression prediction layer of the target detection network to obtain the position of a prediction frame, and inputting the classification features into a classification prediction layer of the target detection network to obtain the category of the target in the prediction frame. According to a fourth aspect of the present disclosure, there is provided a training apparatus of an object detection model, the apparatus comprising: The data acquisition module is used for acquiring an image to be trained, the position of a target frame corresponding to a target in the image to be trained and the category of the target in the image to be trained; The feature training module is used for extracting features of the image to be trained to obtain an image feature map of the image to be trained; the decoding training module is used for carrying out feature mapping on the global attention feature through a first decoder of the target detection network to obtain the regres