CN-121685938-B - Target detection method and device

CN121685938BCN 121685938 BCN121685938 BCN 121685938BCN-121685938-B

Abstract

The embodiment of the application discloses a target detection method and device, which can acquire a visible light image and a corresponding infrared image under the same scene, respectively perform feature extraction on the visible light image and the corresponding infrared image, perform saliency detection on the visible light image, generate three groups of spatial attention weights based on the saliency image, respectively input an initial visible light feature image and an initial infrared feature image into three parallel feature extraction branches to obtain a corresponding visible light multi-scale feature group and an infrared multi-scale feature group, respectively perform element-by-element weighted fusion on feature images output by branches in the visible light multi-scale feature group and the infrared multi-scale feature group based on the three groups of spatial attention weights to obtain enhanced visible light features and enhanced infrared features, perform feature cross fusion processing on the enhanced visible light features and the enhanced infrared features to obtain cross fusion images, and perform target detection according to the cross fusion images. Therefore, the embodiment of the application improves the accuracy of target detection.

Inventors

ZENG QINYONG
YIN XIAOJIE
LIANG SIPING

Assignees

成都浩孚科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (9)

1. A method of target detection, the method comprising: The method comprises the steps of obtaining a visible light image and a corresponding infrared image under the same scene, and respectively extracting features of the visible light image and the corresponding infrared image to obtain an initial visible light feature map and an initial infrared feature map; Performing significance detection on the visible light image, generating a corresponding significance map, and generating three groups of spatial attention weights based on the significance map, wherein the three groups of spatial attention weights are non-negative at each pixel position and the sum is 1; the generating three sets of spatial attention weights based on the saliency map includes: normalizing the saliency map to obtain a normalized saliency map; Processing the normalized saliency map by adopting a first preset function to obtain a first spatial attention weight, wherein the first spatial attention weight is used for enhancing the detail expression of a high-saliency region in an image, and the first preset function refers to a nonlinear mapping function; Processing the normalized saliency map by adopting a second preset function to obtain a second spatial attention weight, wherein the second spatial attention weight is used for extracting structural information of a salient region in an image, and the second preset function refers to a linear transformation function; Processing the normalized saliency map by adopting a third preset function to obtain a third spatial attention weight, wherein the third spatial attention weight is used for capturing the whole contour information of the low-saliency region, and the third preset function refers to a low-pass filtering function; Respectively inputting the initial visible light characteristic diagram and the initial infrared characteristic diagram into three parallel characteristic extraction branches to obtain a corresponding visible light multi-scale characteristic group and an infrared multi-scale characteristic group, wherein the three characteristic extraction branches comprise a depth separable convolution branch, a cavity convolution branch and a standard convolution branch, and the three characteristic extraction branches are in one-to-one correspondence with the three groups of space attention weights; Based on the three groups of spatial attention weights, respectively carrying out element-by-element weighted fusion on the feature graphs output by all branches in the visible light multi-scale feature group to obtain enhanced visible light features, and respectively carrying out element-by-element weighted fusion on the feature graphs output by all branches in the infrared multi-scale feature group to obtain enhanced infrared features; performing feature cross fusion processing on the enhanced visible light features and the enhanced infrared features to obtain a cross fusion diagram; and carrying out target detection according to the cross fusion graph to obtain a target detection result.
2. The method as recited in claim 1, further comprising: Performing feature projection processing on the initial visible light feature map based on the channel dimension of the enhanced visible light feature to obtain visible light projection features, and performing feature projection processing on the initial infrared feature map based on the channel dimension of the enhanced infrared feature to obtain infrared projection features; Feature fusion is carried out on the visible light projection features and the enhanced visible light features to obtain optimized enhanced visible light features, and feature fusion is carried out on the infrared projection features and the enhanced infrared features to obtain optimized enhanced infrared features; and performing feature cross fusion processing on the enhanced visible light features and the enhanced infrared features to obtain a cross fusion diagram, wherein the method comprises the following steps: And performing feature cross fusion processing on the optimized enhanced visible light features and the optimized enhanced infrared features to obtain a cross fusion diagram.
3. The method of claim 2, wherein performing a feature cross-fusion process on the optimized enhanced visible light features and the optimized enhanced infrared features to obtain a cross-fusion map comprises: Expanding the optimized enhanced visible light characteristics according to the row priority sequence from top to bottom and from left to right to obtain a first characteristic sequence; Expanding the optimized enhanced infrared features according to the sequence of priority from left to right and from top to bottom to obtain a second feature sequence; splicing the first characteristic sequence and the second characteristic sequence position by position in the characteristic dimension to form a combined characteristic sequence; and remolding the combined feature sequence into a two-dimensional tensor to obtain a cross fusion map.
4. The method of claim 3, wherein said reshaping the combined signature sequence into a two-dimensional tensor results in a cross-fusion map comprising: inputting the combined feature sequence into a selective state space model to carry out length Cheng Yilai modeling to obtain a selected feature; performing deep convolution operation on the combined feature sequence, and extracting local space context information to obtain local features; And adding the selected features and the local features, and then remolding the added features and the local features into a two-dimensional tensor to obtain a cross fusion graph.
5. The method of claim 1, wherein the performing the target detection according to the cross-fusion map to obtain a target detection result comprises: separating visible photon features related to visible light states and infrared sub-features related to infrared states from the cross-fusion map; Respectively calculating global average pooling for the visible photon feature and the infrared sub-feature to obtain a visible light global feature vector and an infrared global feature vector; Adding the visible light global feature vector and the infrared global feature vector by elements, and carrying out normalization processing to generate a gating chart, wherein the value of each pixel position in the gating chart is used for representing the fusion weight of the visible light mode information relative to the infrared mode information at the corresponding pixel position; Performing element-by-element weighted fusion operation on the visible photon feature and the infrared sub-feature based on the gating map to obtain a comprehensive feature map; And carrying out target detection according to the comprehensive feature map to obtain a target detection result.
6. The method of claim 1, wherein the performing saliency detection on the visible light image to generate a corresponding saliency map comprises: Performing image analysis processing on the visible light image, and determining the significance degree of each pixel position in the visible light image; A saliency map is generated based on the saliency level, the pixel values of the saliency map characterizing the likelihood that the corresponding pixel locations belong to the target relevant area.
7. The method of claim 1, wherein the performing the target detection according to the cross-fusion map to obtain a target detection result comprises: extracting multi-scale context characteristics from the cross fusion map to obtain a context perception characteristic map; Performing feature fusion on the context sensing feature map and a high-level feature map which is up-sampled through bilinear interpolation to generate a dynamic up-sampling feature map; Performing feature recombination on the dynamic up-sampling feature map to obtain a target feature map; And carrying out target detection processing on the target feature map to obtain a target detection result.
8. The method of claim 7, wherein performing multi-scale context feature extraction on the cross-fused graph to obtain a context-aware feature graph comprises: Adopting three groups of cavity convolution layers with different cavity rates to respectively perform feature extraction on the cross fusion graph to obtain three context feature graphs with different scales; And splicing the three context feature maps with different scales along the channel dimension to form a context perception feature map.
9. An object detection device, the device comprising: The image processing unit is used for acquiring a visible light image and a corresponding infrared image under the same scene, and respectively extracting the characteristics of the visible light image and the corresponding infrared image to obtain an initial visible light characteristic image and an initial infrared characteristic image; the weight generation unit is used for carrying out significance detection on the visible light image, generating a corresponding significance map, and generating three groups of spatial attention weights based on the significance map, wherein the three groups of spatial attention weights are non-negative at each pixel position and the sum is 1; the generating three sets of spatial attention weights based on the saliency map includes: normalizing the saliency map to obtain a normalized saliency map; Processing the normalized saliency map by adopting a first preset function to obtain a first spatial attention weight, wherein the first spatial attention weight is used for enhancing the detail expression of a high-saliency region in an image, and the first preset function refers to a nonlinear mapping function; Processing the normalized saliency map by adopting a second preset function to obtain a second spatial attention weight, wherein the second spatial attention weight is used for extracting structural information of a salient region in an image, and the second preset function refers to a linear transformation function; Processing the normalized saliency map by adopting a third preset function to obtain a third spatial attention weight, wherein the third spatial attention weight is used for capturing the whole contour information of the low-saliency region, and the third preset function refers to a low-pass filtering function; The characteristic extraction unit is used for respectively inputting the initial visible light characteristic diagram and the initial infrared characteristic diagram into three parallel characteristic extraction branches to obtain a corresponding visible light multi-scale characteristic group and an infrared multi-scale characteristic group, wherein the three characteristic extraction branches comprise a depth separable convolution branch, a cavity convolution branch and a standard convolution branch, and the three characteristic extraction branches are in one-to-one correspondence with the three groups of space attention weights; The weighting fusion unit is used for carrying out element-by-element weighting fusion on the feature graphs output by all branches in the visible light multi-scale feature group based on the three groups of spatial attention weights to obtain enhanced visible light features, and carrying out element-by-element weighting fusion on the feature graphs output by all branches in the infrared multi-scale feature group to obtain enhanced infrared features; The cross fusion unit is used for carrying out feature cross fusion processing on the enhanced visible light features and the enhanced infrared features to obtain a cross fusion diagram; and the target detection unit is used for carrying out target detection according to the cross fusion graph to obtain a target detection result.

Description

Target detection method and device Technical Field The application relates to the technical field of image processing, in particular to a target detection method and device. Background Object Detection (Object Detection) is a fundamental task in computer vision aimed at locating and identifying one or more Object objects of interest from a given image. Specifically, the object detection is not only to determine whether a specific class of object (such as a person, a car, an animal, etc.) exists in the image, but also to precisely mark the position of each object in the image, which is usually represented in the form of a bounding box (Bounding Box), and to assign a class label and a confidence score to each bounding box. The target detection is widely applied to the fields of intelligent monitoring, automatic driving, unmanned aerial vehicle inspection, remote sensing analysis, robot perception and the like, and is one of key technologies for realizing environment perception and intelligent decision. However, the existing method still has obvious limitations in multi-mode target detection, namely, on one hand, if independent detection models are respectively adopted for visible light images and infrared images, the resource occupation during model storage and operation can be obviously increased, and on the other hand, if two modes are input into the same model and are fused in a simple channel splicing or element-by-element addition mode, the contribution degrees of different areas in the images to detection tasks can not be distinguished, so that useful information among the modes can not be effectively integrated, and redundant or conflict features are equally processed. In addition, the existing single-model architecture generally adopts a fixed mode to aggregate the features, lacks the capability of dynamically adjusting a multi-branch feature fusion strategy according to image content, and is difficult to fully utilize the complementary characteristics of different scale features on local detail and structural expression, so that detection accuracy is limited. Disclosure of Invention The embodiment of the application provides a target detection method and device, which can improve the accuracy of target detection. The embodiment of the application provides a target detection method, which comprises the following steps: The method comprises the steps of obtaining a visible light image and a corresponding infrared image under the same scene, and respectively extracting features of the visible light image and the corresponding infrared image to obtain an initial visible light feature map and an initial infrared feature map; Performing significance detection on the visible light image, generating a corresponding significance map, and generating three groups of spatial attention weights based on the significance map, wherein the three groups of spatial attention weights are non-negative at each pixel position and the sum is 1; Respectively inputting the initial visible light characteristic diagram and the initial infrared characteristic diagram into three parallel characteristic extraction branches to obtain a corresponding visible light multi-scale characteristic group and an infrared multi-scale characteristic group, wherein the three characteristic extraction branches comprise a depth separable convolution branch, a cavity convolution branch and a standard convolution branch, and the three characteristic extraction branches are in one-to-one correspondence with the three groups of space attention weights; Based on three groups of spatial attention weights, respectively carrying out element-by-element weighted fusion on the feature graphs output by all branches in the visible light multi-scale feature group to obtain enhanced visible light features, and respectively carrying out element-by-element weighted fusion on the feature graphs output by all branches in the infrared multi-scale feature group to obtain enhanced infrared features; Performing feature cross fusion processing on the enhanced visible light features and the enhanced infrared features to obtain a cross fusion diagram; and carrying out target detection according to the cross fusion graph to obtain a target detection result. The embodiment of the application also provides a target detection device, which comprises: The image processing unit is used for acquiring a visible light image and a corresponding infrared image under the same scene, and respectively extracting the characteristics of the visible light image and the corresponding infrared image to obtain an initial visible light characteristic image and an initial infrared characteristic image; the weight generation unit is used for carrying out significance detection on the visible light image, generating a corresponding significance map, and generating three groups of spatial attention weights based on the significance map, wherein the three groups of spatial attention weights are no