CN-117011620-B - Target detection method and related equipment thereof

CN117011620BCN 117011620 BCN117011620 BCN 117011620BCN-117011620-B

Abstract

The application discloses a target detection method and related equipment thereof, which can rapidly complete a target detection task, improve the completion efficiency of the target detection task and further improve the user experience. When target detection is required to be carried out on a target image, the target image containing an object to be detected can be acquired first, and the target image is input into a target model. The target model may then perform feature extraction on the target image, resulting in a first feature of the target image. The target model may then encode the first feature of the target image to obtain a second feature of the target image. The target model may then decode the second feature of the target image based on the preset query vector, resulting in a third feature of the target image. Finally, the target model may acquire a detection result of the target image based on the third feature, and the detection result may be used to determine position information of the object and a category of the object. Thus, the target detection for the target model is completed.

Inventors

CHEN XINGHAO
LI SIWEI
YANG YIJING
WANG YUNHE

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20230629

Claims (20)

1. A method for detecting an object, the method being implemented by an object model, the method comprising: acquiring a target image, wherein the target image comprises an object to be detected; extracting the characteristics of the target image to obtain first characteristics of the target image; encoding the first feature to obtain a second feature of the target image, the encoding comprising at least one convolution and not comprising an attention-based mechanism of processing; decoding the second feature based on a preset query vector to obtain a third feature of the target image, wherein the decoding comprises at least one convolution and does not comprise processing based on an attention mechanism; Based on the third feature, acquiring a detection result of the target image, wherein the detection result is used for determining the position information of the object and the category of the object; the decoding the second feature based on the preset query vector to obtain a third feature of the target image includes: performing first processing on a preset query vector to obtain a fourth characteristic of the query vector, wherein the first processing comprises channel-by-channel convolution and point-by-point convolution; performing a second process on the second feature and the fourth feature to obtain a fifth feature of the target image, wherein the second process comprises a channel-by-channel convolution; and carrying out third processing on the fifth feature to obtain a third feature of the target image.
2. The method of claim 1, wherein the encoding comprises at least one of a channel-wise convolution or a point-wise convolution.
3. The method of claim 1, wherein the first processing the preset query vector to obtain the fourth feature of the query vector comprises: Carrying out channel-by-channel convolution and point-by-point convolution on a preset query vector to obtain a sixth feature of the query vector; and adding the sixth feature and the query vector to obtain a fourth feature of the query vector.
4. A method according to claim 1 or 3, wherein said second processing said second feature and said fourth feature to obtain a fifth feature of said target image comprises: upsampling the fourth feature to obtain a seventh feature of the query vector; fusing the second feature and the seventh feature to obtain an eighth feature of the target image; carrying out channel-by-channel convolution on the eighth feature to obtain a ninth feature of the target image; And adding the ninth feature and the seventh feature to obtain a fifth feature of the target image.
5. A method according to any one of claims 1 to 3, wherein said performing a third process on said fifth feature to obtain a third feature of said target image comprises: Processing the fifth feature based on a feedforward neural network to obtain a tenth feature of the target image; adding the fifth feature and the tenth feature to obtain an eleventh feature of the target image; And pooling the eleventh feature to obtain a third feature of the target image.
6. The method of any of claims 1 to 3, wherein the channel-by-channel convolution comprises at least one of a channel-by-channel standard convolution, a channel-by-channel deformable convolution, or a channel-by-channel dynamic convolution.
7. A method according to any one of claims 1 to 3, wherein the point-wise convolution comprises at least one of a point-wise standard convolution, a point-wise deformable convolution, or a point-wise dynamic convolution.
8. A method according to any one of claims 1 to 3, wherein the query vector contains a number of parameters, the number of parameters being associated with the number of objects.
9. A method of model training, the method comprising: acquiring a training image, wherein the training image comprises an object to be detected; the training image is processed through a model to be trained to obtain a detection result of the training image, wherein the detection result is used for determining position information of the object and the type of the object, the model to be trained is used for carrying out feature extraction on the training image to obtain first features of the training image, encoding the first features to obtain second features of the training image, the encoding comprises at least one convolution and does not comprise attention-based processing, decoding the second features based on a preset query vector to obtain third features of the training image, the decoding comprises at least one convolution and does not comprise attention-based processing, and obtaining the detection result of the training image based on the third features; Training the model to be trained based on the detection result and the real detection result of the training image to obtain a target model; the decoding the second feature based on the preset query vector to obtain a third feature of the training image includes: performing first processing on a preset query vector to obtain a fourth characteristic of the query vector, wherein the first processing comprises channel-by-channel convolution and point-by-point convolution; performing a second process on the second feature and the fourth feature to obtain a fifth feature of the training image, wherein the second process comprises a channel-by-channel convolution; And carrying out third processing on the fifth feature to obtain a third feature of the training image.
10. The method of claim 9, wherein the encoding comprises at least one of a channel-wise convolution or a point-wise convolution.
11. The method of claim 9, wherein the first processing the preset query vector to obtain the fourth feature of the query vector comprises: Carrying out channel-by-channel convolution and point-by-point convolution on a preset query vector to obtain a sixth feature of the query vector; and adding the sixth feature and the query vector to obtain a fourth feature of the query vector.
12. The method according to claim 9 or 11, wherein said performing a second process on said second feature and said fourth feature to obtain a fifth feature of said training image comprises: upsampling the fourth feature to obtain a seventh feature of the query vector; fusing the second feature and the seventh feature to obtain an eighth feature of the training image; carrying out channel-by-channel convolution on the eighth feature to obtain a ninth feature of the training image; and adding the ninth feature and the seventh feature to obtain a fifth feature of the training image.
13. The method according to any one of claims 9 to 11, wherein performing a third process on the fifth feature to obtain a third feature of the training image comprises: Processing the fifth feature based on a feedforward neural network to obtain a tenth feature of the training image; adding the fifth feature and the tenth feature to obtain an eleventh feature of the training image; and pooling the eleventh feature to obtain a third feature of the training image.
14. The method of any one of claims 9 to 11, wherein the channel-by-channel convolution comprises at least one of a channel-by-channel standard convolution, a channel-by-channel deformable convolution, or a channel-by-channel dynamic convolution.
15. The method according to any one of claims 9 to 11, wherein the point-wise convolution comprises at least one of a point-wise standard convolution, a point-wise deformable convolution, or a point-wise dynamic convolution.
16. The method according to any of claims 9 to 11, wherein the query vector contains a number of parameters, the number of parameters being associated with the number of objects.
17. An object detection device, the device comprising an object model, the device comprising: the acquisition module is used for acquiring a target image, wherein the target image comprises an object to be detected; The extraction module is used for extracting the characteristics of the target image to obtain a first characteristic of the target image; An encoding module for encoding the first feature to obtain a second feature of the target image, the encoding comprising at least one convolution and not comprising an attention-based processing; The decoding module is used for decoding the second feature based on a preset query vector to obtain a third feature of the target image, wherein the decoding comprises at least one convolution and does not comprise attention mechanism-based processing; The detection module is used for acquiring a detection result of the target image based on the third characteristic, wherein the detection result is used for determining the position information of the object and the category of the object; The decoding module is used for: performing first processing on a preset query vector to obtain a fourth characteristic of the query vector, wherein the first processing comprises channel-by-channel convolution and point-by-point convolution; performing a second process on the second feature and the fourth feature to obtain a fifth feature of the target image, wherein the second process comprises a channel-by-channel convolution; and carrying out third processing on the fifth feature to obtain a third feature of the target image.
18. A model training apparatus, the apparatus comprising: The acquisition module is used for acquiring a training image, wherein the training image comprises an object to be detected; The processing module is used for processing the training image through a model to be trained to obtain a detection result of the training image, wherein the detection result is used for determining position information of the object and the type of the object, the model to be trained is used for carrying out feature extraction on the training image to obtain first features of the training image, encoding the first features to obtain second features of the training image, the encoding comprises at least one convolution and does not comprise attention mechanism-based processing, decoding the second features based on a preset query vector to obtain third features of the training image, the decoding comprises at least one convolution and does not comprise attention mechanism-based processing, and obtaining the detection result of the training image based on the third features; The training module is used for training the model to be trained based on the detection result and the real detection result of the training image to obtain a target model; the model to be trained is used for: performing first processing on a preset query vector to obtain a fourth characteristic of the query vector, wherein the first processing comprises channel-by-channel convolution and point-by-point convolution; performing a second process on the second feature and the fourth feature to obtain a fifth feature of the training image, wherein the second process comprises a channel-by-channel convolution; And carrying out third processing on the fifth feature to obtain a third feature of the training image.
19. An object detection device comprising a memory storing code and a processor configured to execute the code, the object detection device performing the method of any one of claims 1 to 16 when the code is executed.
20. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to implement the method of any one of claims 1 to 16.

Description

Target detection method and related equipment thereof Technical Field The embodiment of the application relates to artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), in particular to a target detection method and related equipment. Background Object detection is one of the most basic computer vision tasks, and is critical for many practical applications. In recent years, the transducer model and its variants exhibit outstanding performance in image classification tasks, so that they can be migrated to process target detection tasks to improve the processing effect of the target detection tasks. Currently, when an object included in a target image needs to be detected, the target image may be input into a transducer model. Then the transducer model may first extract the initial features of the target image. The transducer model may then process the initial features of the target image based on a self-attention (self-attention) mechanism, resulting in intermediate features of the target image. The transducer model may then perform a self-attention-based processing and a cross-attention (cross-attention) based processing on the intermediate features of the target image to obtain the final features of the target image. Finally, the transducer model can acquire the detection result of the target image based on the final characteristics of the target image, and can determine the position information of the object and the type of the object. In the above process, because the transducer model is mainly built based on the attention mechanism, a great deal of calculation cost is required for the process of executing the target detection by the transducer model, if the calculation force of the device carrying the transducer model is low, the target detection speed is low, the target detection efficiency is affected, and the user experience is low. Disclosure of Invention The embodiment of the application provides a target detection method and related equipment thereof, which can rapidly complete target detection tasks, improve the completion efficiency of the target detection tasks and further improve user experience. A first aspect of an embodiment of the present application provides a target detection method, which may be implemented by a target model, where the method includes: When target detection is required for the target image, the target image and a preset query vector can be acquired first. Wherein the content presented by the target image contains one or more objects to be detected. After the target image and the query vector are obtained, the target image and the preset query vector can be input into the target model. Then, the target model may first perform feature extraction on the target image, thereby obtaining the first feature of the target image. After the first feature of the target image is obtained, the target model may encode the first feature of the target image, thereby obtaining the second feature of the target image. After the second feature of the target image is obtained, the target model can decode the second feature of the target image by using a preset query vector, so as to obtain a third feature of the target image. After the third feature of the target image is obtained, the target model can further process the third feature of the target image, so that a detection result of the target image is obtained. It is noted that the encoding performed by the object model on the first feature of the object image may comprise at least one convolution but not any attention-based processing, and further that the decoding performed by the object model on the second feature of the object image may comprise at least one convolution but also not any attention-based processing. The detection result of the target image comprises the position information of at least one object detected by the model, the category of the at least one object and the confidence of the at least one object, so that the position information and the category of the object with lower confidence can be removed, the position information and the category of the object with higher confidence can be reserved, and the position information and the category of the object with higher confidence can be used as the position information and the category of the object finally detected by the model. Thus, target detection for the target image is completed. It can be seen from the above method that when target detection is required for a target image, the target image including the object to be detected can be acquired first, and the target image is input into the target model. The target model may then perform feature extraction on the target image, resulting in a first feature of the target image. The target model may then encode the first feature of the target image to obtain a second feature of the target image. The target model may then decode the second feature of the target image based on the preset query vector, resulting in a thi