CN-121982463-A - Multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion
Abstract
The invention provides a multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion, and belongs to the technical field of computer vision and multi-mode information processing. The method comprises the steps of obtaining a multi-mode image pair data set formed by an infrared image and a visible light image, preprocessing, dividing the multi-mode image pair data set into a training set, a verification set and a test set, constructing an end-to-end infrared visible light target detection model, inputting the preprocessed training set into the target detection model, carrying out end-to-end training on model parameters by minimizing target boundary box regression loss and target classification loss, and evaluating a training process by utilizing the verification set to obtain optimal model parameters. The method is suitable for the task of detecting the infrared and visible light multi-mode targets in the complex environment.
Inventors
- CHEN HAIHUA
- ZHOU XIAOLONG
- CHAN SIXIAN
- ZHOU WENHUI
Assignees
- 杭州电子科技大学
- 衢州学院
Dates
- Publication Date
- 20260505
- Application Date
- 20260115
Claims (7)
- 1. A multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion is characterized by comprising the following steps: step 1, acquiring a multi-mode image pair dataset formed by an infrared image and a visible light image acquired under different scenes, preprocessing, constructing a sample set containing target category labels and boundary box labels, and dividing the sample set into a training set, a verification set and a test set; Building an end-to-end infrared visible light target detection model, wherein the model comprises a feature extraction backbone network, a layered feature enhancement module, a cross-mode collaborative fusion module and a detection decoder; The multi-level image feature extraction system comprises a feature extraction backbone network, a hierarchical feature enhancement module, a cross-mode collaborative fusion module, a detection decoder and a target detection module, wherein the feature extraction backbone network is used for extracting multi-level and multi-scale image features from an infrared image and a visible light image respectively, the hierarchical feature enhancement module is used for adaptively enhancing the infrared features and the visible light features of different levels and outputting enhanced image features, the cross-mode collaborative fusion module is used for carrying out cross-mode and cross-scale collaborative fusion on the enhanced multi-scale infrared features and the visible light features to generate fusion feature representation, and the detection decoder is used for carrying out position regression and category prediction on a target based on the fusion features and outputting a target detection result; And 3, inputting the infrared image, the visible light image and the corresponding labels in the training set into the target detection model, performing end-to-end training on model parameters by minimizing a joint loss function of the regression loss of the target bounding box and the target classification loss, and evaluating the training process by using the verification set to obtain optimal model parameters.
- 2. The multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion according to claim 1, wherein the step 1 comprises the following sub-steps: step 1.1 IR image in data set And visible light image Performing spatial registration processing to ensure that images of different modes are aligned under the same pixel coordinate system; Step 1.2 for registered IR image And visible light image Performing unified scale adjustment to enable the image size to meet the requirement of the target detection network on input resolution; Step 1.3. For the multimodal image pair And performing data enhancement processing, wherein the data enhancement comprises at least one of random flipping, random scaling and random clipping with 50% probability.
- 3. The multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion according to claim 1, wherein the feature extraction backbone network in step 2 adopts a pre-trained ResNet image encoder, and comprises an infrared feature extraction branch and a visible light feature extraction branch; The infrared characteristic extraction branches and the visible light characteristic extraction branches are consistent in network structure and parameter configuration, and a step-by-step downsampling mode is adopted to extract the characteristics of the multi-scale original image Basic feature input is provided for subsequent hierarchical feature enhancement and cross-modal collaborative fusion.
- 4. The multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion according to claim 1, wherein the hierarchical feature enhancement module in step 2 is deployed between a feature extraction backbone network and a cross-mode collaborative fusion module, and is used for performing feature extraction on multi-scale original images Performing hierarchical enhancement processing, which includes different feature enhancement units set for different feature levels; wherein for low-level features The hierarchical feature enhancement module adopts an enhancement unit comprising a cascade structure, a cross-stage part connection structure and RepBlock, and outputs low-level features with enhanced semantic consistency For middle layer characteristics The hierarchical feature enhancement module adopts an enhancement unit comprising a cascade structure, a cross-stage part connection structure and C3kBlock, and outputs middle-layer features with enhanced semantic consistency For high-level features The hierarchical feature enhancement module adopts a multi-head self-attention mechanism to output high-level features with enhanced semantic consistency 。
- 5. The multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion according to claim 1, wherein the cross-mode collaborative fusion module in step 2 is configured to process multi-level image features processed by the hierarchical feature enhancement module Performing cross-modal and cross-scale collaborative fusion, wherein the collaborative fusion comprises a top-down information propagation path and a bottom-up characteristic enhancement path; the first type fusion unit is arranged in the top-down information propagation path and is used for carrying out up-sampling on the high-level visible light characteristics, fusing the high-level visible light characteristics with the infrared characteristics and the visible light characteristics of the same level, and outputting primary fusion characteristics The first fusion unit comprises a cascade structure, a cross-stage part connecting structure and RepBlock; A second type of fusion unit is arranged in the bottom-up feature enhancement path and is used for fusing the low-level fusion feature with the same-level visible light feature after downsampling and outputting a final fusion feature And the second type of fusion unit comprises a cascade structure, a cross-stage part connection structure and C3kBlock.
- 6. The multi-modal object detection method based on hierarchical feature enhancement and cross-modal collaborative fusion according to claim 1, wherein the detection decoder in step 2 includes a query selection module and a detection decoding module based on an attention mechanism; the query selection module is used for outputting multi-scale fusion features based on the cross-modal collaborative fusion module Generating an initial target query vector, screening according to target confidence or category prediction scores corresponding to the initial target query vector, and selecting a plurality of target query vectors with scores meeting preset conditions as effective detection queries; the detection decoding module is used for taking the effective detection query as input, carrying out decoding processing on the target based on the interaction relation between the attention mechanism and the fusion characteristic, gradually updating the query characteristic through multi-layer decoding processing, and outputting a category prediction result of the target and corresponding boundary frame position information, thereby realizing an end-to-end target detection decoding process.
- 7. The multi-modal target detection method based on hierarchical feature enhancement and cross-modal collaborative fusion according to claim 1, wherein the model training process in step 3 includes end-to-end optimization of a target detection model based on a multi-tasking joint loss function, the joint loss function including at least a target location regression loss and a target category classification loss; The target position regression loss is used for restraining the difference between the predicted target boundary frame position and the true labeling boundary frame position of the model to improve the target positioning accuracy, and the target category classification loss is used for restraining the consistency between the predicted target category probability distribution and the true target category label of the model to improve the target recognition accuracy.
Description
Multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion Technical Field The invention relates to the technical field of computer vision and multi-mode information processing, in particular to an infrared and visible light multi-mode target detection method, and especially relates to a multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion, which can be used for target perception and recognition in complex environments. Background The target detection is an important research direction in the field of computer vision, the existing target detection method is mostly dependent on visible light images, but is easily influenced by illumination change and environmental interference to cause target characteristic degradation, and the infrared imaging is based on the thermal radiation information of the target, so that the target contour information can be stably obtained. Therefore, the infrared image and the visible light image are fused, and the robustness of target detection in a complex scene is improved. The existing infrared and visible light target detection method generally adopts a feature level fusion strategy, however, in a complex environment, the semantic characterization of different modal features is easily caused to deviate due to environmental interference, invalid or noise information is introduced into the fusion features, so that the detection precision is affected, and meanwhile, the existing method is insufficient in describing the cross-modal semantic dependency relationship and the structural complementation relationship, so that the further improvement of the multi-modal target detection performance is limited. Therefore, a target detection method capable of enhancing feature semantic consistency at a multi-scale level and effectively modeling a cross-modal synergistic relationship between infrared and visible light features is needed to improve the accuracy and robustness of infrared and visible light target detection in a complex scene. Disclosure of Invention The invention provides a multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion, which aims to overcome the defects that multi-mode features are easy to be interfered, have inconsistent semantics and have limited fusion effect in a complex environment in the prior art. According to the method, a layered feature enhancement mechanism is introduced before feature fusion, infrared and visible light features of different scale levels are subjected to targeted enhancement and reconstruction, so that environmental noise interference is restrained, the semantic consistency of multi-mode features is improved, meanwhile, a cross-mode and cross-scale collaborative modeling is carried out on the enhanced multi-scale features by constructing a cross-mode collaborative fusion mechanism, so that semantic dependency relationship and structural complementation information between the infrared and visible light features are fully excavated, fusion feature representation which is more consistent and has stronger discriminant is obtained, and a reliable feature basis is provided for subsequent target detection. In order to achieve the above purpose, the invention provides a multi-mode target detection method based on hierarchical feature enhancement and cross-mode collaborative fusion, which comprises the following steps: Step 1, acquiring a multi-mode image pair dataset formed by an infrared image and a visible light image acquired in different scenes, preprocessing the multi-mode image pair dataset, constructing a sample set containing target category labels and boundary box labels, and dividing the sample set into a training set, a verification set and a test set; Building an end-to-end infrared visible light target detection model, wherein the model comprises a feature extraction backbone network, a layered feature enhancement module, a cross-mode collaborative fusion module and a detection decoder; The multi-level image feature extraction system comprises a feature extraction backbone network, a hierarchical feature enhancement module, a cross-mode collaborative fusion module, a detection decoder and a target detection module, wherein the feature extraction backbone network is used for extracting multi-level and multi-scale image features from an infrared image and a visible light image respectively, the hierarchical feature enhancement module is used for adaptively enhancing the infrared features and the visible light features of different levels and outputting enhanced image features, the cross-mode collaborative fusion module is used for carrying out cross-mode and cross-scale collaborative fusion on the enhanced multi-scale infrared features and the visible light features to generate fusion feature representation, and the detection decoder is used for carrying out position