EP-4742189-A1 - OBJECT DETECTION METHOD AND RELATED DEVICE THEREOF

EP4742189A1EP 4742189 A1EP4742189 A1EP 4742189A1EP-4742189-A1

Abstract

Embodiments of this application disclose an object detection method and a related device. During object detection, comprehensive factors are considered, and therefore object detection can be accurately completed. The method in this application includes: First, a target image including a to-be-detected object may be obtained, and the target image is input to a target model. Then the target model may perform feature extraction on the target image to obtain a first feature, and further perform feature extraction on the first feature to obtain a second feature. Then the target model may perform first fusion on the first feature and the second feature to obtain a first fusion result. Then the target model may enhance the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature. Finally, the target model may perform detection by using the enhanced first feature and the enhanced second feature to obtain location information of the object in the target image.

Inventors

WANG, Chengcheng
HE, WEI
NIE, YING
LIU, Chuanjian
WANG, YUNHE
HAN, KAI

Assignees

Huawei Technologies Co., Ltd.

Dates

Publication Date: 20260513
Application Date: 20240725

Claims (20)

An object detection method, wherein the method is implemented by a target model, and the method comprises: obtaining a target image, wherein the target image comprises a to-be-detected object; performing feature extraction on the target image to obtain a first feature, and performing feature extraction on the first feature to obtain a second feature; performing first fusion on the first feature and the second feature to obtain a first fusion result; enhancing the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature; and obtaining location information of the object in the target image based on the enhanced first feature and the enhanced second feature.
The method according to claim 1, wherein enhancing, based on the first fusion result, the first feature and the second feature to obtain the enhanced first feature and the enhanced second feature comprises: injecting the first fusion result into the first feature to obtain the enhanced first feature, and determining the second feature as the enhanced second feature; or injecting the first fusion result into the first feature to obtain the enhanced first feature, and injecting the first fusion result into the second feature to obtain the enhanced second feature; or determining the first feature as the enhanced first feature, and injecting the first fusion result into the second feature to obtain the enhanced second feature.
The method according to claim 2, wherein injecting the first fusion result into the first feature to obtain the enhanced first feature comprises: processing the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.
The method according to claim 2 or 3, wherein injecting the first fusion result into the second feature to obtain the enhanced second feature comprises: processing the first fusion result and the second feature based on the cross-attention mechanism to obtain the enhanced second feature.
The method according to claim 3, wherein the method further comprises: preprocessing the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing comprises at least one of the following: alignment, splicing, or convolution; and processing the first fusion result and the first feature based on the cross-attention mechanism to obtain the enhanced first feature comprises: processing the first fusion result and the preprocessed first feature based on the cross-attention mechanism to obtain the enhanced first feature.
The method according to claim 4, wherein the method further comprises: preprocessing the second feature based on the first feature to obtain a preprocessed second feature, wherein the preprocessing comprises at least one of the following: alignment, splicing, or convolution; and processing the first fusion result and the second feature based on the cross-attention mechanism to obtain the enhanced second feature comprises: processing the first fusion result and the preprocessed second feature based on the cross-attention mechanism to obtain the enhanced second feature.
The method according to any one of claims 1 to 6, wherein the first fusion comprises at least one of the following: alignment, splicing, or convolution.
The method according to any one of claims 1 to 7, wherein the method further comprises: performing second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; and obtaining the location information of the object in the target image based on the enhanced first feature and the enhanced second feature comprises: enhancing, based on the second fusion result, the enhanced first feature and the enhanced second feature to obtain a first feature with secondary enhancement and a second feature with secondary enhancement; and obtaining the location information of the object in the target image based on the first feature with secondary enhancement and the second feature with secondary enhancement.
The method according to claim 8, wherein the second fusion comprises at least one of the following: alignment, splicing, self-attention mechanism-based processing, feedforward network-based processing, or addition.
A model training method, wherein the method comprises: obtaining a training image, wherein the training image comprises a to-be-detected object; processing the training image by using a to-be-trained model to obtain location information of the object in the training image, wherein the to-be-trained model is configured to: perform feature extraction on the training image to obtain a first feature, perform feature extraction on the first feature to obtain a second feature, perform first fusion on the first feature and the second feature to obtain a first fusion result, enhance the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature, and obtain the location information based on the enhanced first feature and the enhanced second feature; and training the to-be-trained model based on the location information and real location information of the object in the training image to obtain a target model.
The method according to claim 10, wherein the to-be-trained model is configured to: inject the first fusion result into the first feature to obtain the enhanced first feature, and determine the second feature as the enhanced second feature; or inject the first fusion result into the first feature to obtain the enhanced first feature, and inject the first fusion result into the second feature to obtain the enhanced second feature; or determine the first feature as the enhanced first feature, and inject the first fusion result into the second feature to obtain the enhanced second feature.
The method according to claim 11, wherein the to-be-trained model is configured to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.
The method according to claim 11 or 12, wherein the to-be-trained model is configured to process the first fusion result and the second feature based on the cross-attention mechanism to obtain the enhanced second feature.
The method according to claim 12, wherein the to-be-trained model is further configured to preprocess the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing comprises at least one of the following: alignment, splicing, or convolution; and the to-be-trained model is configured to process the first fusion result and the preprocessed first feature based on the cross-attention mechanism to obtain the enhanced first feature.
The method according to claim 13, wherein the to-be-trained model is further configured to preprocess the second feature based on the first feature to obtain a preprocessed second feature, wherein the preprocessing comprises at least one of the following: alignment, splicing, or convolution; and the to-be-trained model is configured to process the first fusion result and the preprocessed second feature based on the cross-attention mechanism to obtain the enhanced second feature.
The method according to any one of claims 10 to 15, wherein the first fusion comprises at least one of the following: alignment, splicing, or convolution.
The method according to any one of claims 10 to 16, wherein the to-be-trained model is further configured to perform second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; and the to-be-trained model is configured to: enhance, based on the second fusion result, the enhanced first feature and the enhanced second feature to obtain a first feature with secondary enhancement and a second feature with secondary enhancement; and obtain the location information of the object in the training image based on the first feature with secondary enhancement and the second feature with secondary enhancement.
The method according to claim 17, wherein the second fusion comprises at least one of the following: alignment, splicing, self-attention mechanism-based processing, feedforward network-based processing, or addition.
An object detection apparatus, wherein the apparatus comprises a target model, and the apparatus comprises: an obtaining module, configured to obtain a target image, wherein the target image comprises a to-be-detected object; an extraction module, configured to perform feature extraction on the target image to obtain a first feature, and perform feature extraction on the first feature to obtain a second feature; a fusion module, configured to perform first fusion on the first feature and the second feature to obtain a first fusion result; an enhancement module, configured to enhance the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature; and a detection module, configured to obtain location information of the object in the target image based on the enhanced first feature and the enhanced second feature.
A model training apparatus, wherein the apparatus comprises: an obtaining module, configured to obtain a training image, wherein the training image comprises a to-be-detected object; a processing module, configured to process the training image by using a to-be-trained model to obtain location information of the object in the training image, wherein the to-be-trained model is configured to: perform feature extraction on the training image to obtain a first feature, perform feature extraction on the first feature to obtain a second feature, perform first fusion on the first feature and the second feature to obtain a first fusion result, enhance the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature, and obtain the location information based on the enhanced first feature and the enhanced second feature; and a training module, configured to train the to-be-trained model based on the location information and real location information of the object in the training image to obtain a target model.

Description

This application claims priority to Chinese Patent Application No. 202310940169.4, filed with the China National Intellectual Property Administration on July 27, 2023, and entitled "OBJECT DETECTION METHOD AND RELATED DEVICE", which is incorporated herein by reference in its entirety. TECHNICAL FIELD Embodiments of this application relate to artificial intelligence (artificial intelligence, AI) technologies, and in particular, to an object detection method and a related device. BACKGROUND As a basic computer vision task, an object detection task is needed in an increasing quantity of scenarios. To meet an object detection requirement of a user in various application scenarios, the object detection task may be completed by using a neural network model in the AI field, to provide an object detection result for the user to view and use, to improve user experience. In a related technology, when an object needs to be located in a scene, a target image for presenting the scene may be first obtained, and the target image is input to the neural network model. In this case, the neural network model may perform feature extraction on the target image to obtain features at different levels. Then the neural network model may fuse the features at different levels to obtain a feature fusion result. Then the neural network model may perform detection based on the feature fusion result to obtain location information of the object in the target image. This is equivalent to obtaining location information of the object in the scene. In the foregoing process, the neural network model directly obtains the location information of the object based on the feature fusion result, with a monotonous factor considered. Consequently, accuracy of the location information of the object that is finally output by the model is low, and object detection cannot be accurately completed. SUMMARY Embodiments of this application provide an object detection method and a related device. During object detection, comprehensive factors are considered. Therefore, finally obtained location information of an object is sufficiently accurate, and object detection can be accurately completed. A first aspect of embodiments of this application provides an object detection method. The method may be implemented by a target model, and the method includes: When object detection needs to be performed in a scene, the scene may be first photographed to obtain a target image for presenting the scene. The scene presented by the target image includes a to-be-detected object. After the target image is obtained, the target image may be input to the target model. Therefore, the target model may first perform feature extraction on the target image to obtain a first feature, and then further perform feature extraction on the first feature to obtain a second feature. It should be noted that the target model may extract features at a plurality of levels from the target image, and the first feature and the second feature may be features at two adjacent levels among the features at the plurality of levels. For example, the first feature is a feature at a second-to-last level, and the second feature is a feature at a last level. After obtaining the first feature and the second feature, the target model may perform first fusion on the first feature and the second feature to obtain a first fusion result. After obtaining the first fusion result, the target model may enhance the first feature and the second feature by using the first fusion result to obtain an enhanced first feature and an enhanced second feature. After obtaining the enhanced first feature and the enhanced second feature, the target model may perform detection by using the enhanced first feature and the enhanced second feature to obtain location information of the object in the target image, and output the location information. This is equivalent to obtaining a location of the object in the scene. It can be learned from the foregoing method that the target model obtains the location information of the object in the target image based on the enhanced first feature and the enhanced second feature, where the enhanced first feature is obtained based on the first feature and the first fusion result, the enhanced second feature is obtained based on the second feature and the first fusion result, the first feature and the second feature represent different local information of the target image, and the first fusion result represents low-dimensional global information of the target image. Therefore, the target model considers comprehensive factors during object detection, and the location information of the object that is finally output by the target model is sufficiently accurate, so that object detection can be accurately completed. In a possible implementation, enhancing the first feature and the second feature based on the first fusion result to obtain the enhanced first feature and the enhanced second feature includes: i