CN-121639491-B - Infrared and visible light image fusion method based on text semantic consistency guidance
Abstract
The invention provides an infrared and visible light image fusion method based on text semantic consistency guidance, which relates to the technical field of multi-mode image fusion and comprises the steps of respectively carrying out fine-granularity text semantic generation on infrared and visible light images and mapping the infrared and visible light images to a unified embedded space; the text semantic is subjected to bidirectional compensation and enhancement through a cross-modal attention mechanism to construct a unified text semantic priori, a structure-intensity decoupling dual-branch encoder is adopted to respectively extract structural texture features of visible light and infrared intensity significant features, under the guidance of the text semantic priori, the dual-modal visual features are aligned in a shared semantic space through explicit semantic consistency constraint and implicit semantic distribution consistency constraint, and finally the text semantic priori is used as a global modulation signal, and the aligned features are subjected to self-adaptive weighted fusion and decoding to generate a fusion image. The method effectively solves the problems of insufficient semantic modeling and poor consistency of fusion results in the existing method.
Inventors
- LIN YUYANG
- XIAO JINGJING
- Qi Zhencheng
- WANG ZHAORUI
- CHEN XILIN
Assignees
- 厦门理工学院
Dates
- Publication Date
- 20260508
- Application Date
- 20260204
Claims (8)
- 1. The text semantic consistency guiding-based infrared and visible light image fusion method is characterized by comprising the following steps of: Obtaining a visible light image and an infrared image, performing fine-granularity text semantic generation processing on the visible light image and the infrared image to obtain a text description set, mapping the text description set into a unified text embedding space, and generating multi-mode text semantic features; Modeling the multi-modal text semantic features according to cross-modal semantic relation, wherein the semantic interactive relation between the visible light text and the infrared text is modeled to compensate and enhance the visible light text semantic and the infrared text semantic, and a unified text semantic priori is constructed and obtained, specifically: Text semantic features according to visible light And infrared light text semantic features The attention weight in the infrared mode direction and the visible mode direction is calculated, and the calculation formula is as follows: , , wherein, An infrared semantic attention weighted representation is aggregated for visible light text, As a function of the index of the values, 、 、 、 、 、 Are all a linear mapping matrix that can be learned, In order to embed the dimensions in-line, In order to scale-up the factor(s), Aggregating visible semantic attention weighted representations for infrared text; Aggregating infrared semantic attention weighted representations from visible light text And infrared text aggregate visible semantic attention weighted representation Generating semantic consistency compensation signals of visible light and infrared modes, wherein the calculation formula is as follows: , , wherein, 、 Are all linear mapping coefficients that can be learned, For the compensation signal of the visible light, Is an infrared compensation signal; compensating signals for visible light Infrared compensation signal Visible light text semantic features And infrared light text semantic features Adding to obtain visible light text semantics after semantic enhancement And semantically enhanced infrared text semantics ; Visual text semantics enhanced by meaning And semantically enhanced infrared text semantics Fusion processing is carried out to obtain unified text semantic priori , Is a lightweight mapping function; Based on a structure-intensity decoupling double-branch coding strategy, respectively carrying out visual feature coding processing on a visible light image and an infrared image, wherein the visible light visual features mainly comprising structure and texture information are extracted from the visible light image, and the infrared visual features mainly comprising intensity distribution and thermal target response are extracted from the infrared image; Under the guidance of the unified text semantic priori, carrying out bimodal semantic alignment processing on the visible light visual features and the infrared visual features, wherein the visible light visual features and the infrared visual features gradually converge in a unified semantic space through explicit semantic consistency constraint and implicit structure consistency constraint; And carrying out semantic modulation fusion processing according to the visible light visual characteristics after the bimodal semantic alignment processing, the infrared visual characteristics after the bimodal semantic alignment processing and the unified text semantic priori, generating fusion characteristics, and inputting the fusion characteristics into a preset decoder to obtain a final fusion image.
- 2. The text semantic consistency guidance-based infrared and visible light image fusion method according to claim 1, wherein a visible light image and an infrared image are obtained, fine-granularity text semantic generation processing is performed on the visible light image and the infrared image to obtain a text description set, and the text description set is mapped into a unified text embedding space to generate multi-mode text semantic features, which specifically are: the method comprises the steps of obtaining a visible light image and an infrared image as multi-mode input to be fused, wherein the visible light image is used for providing detail information, the detail information comprises scene structures, textures and edges, and the infrared image is used for providing target thermal radiation intensity distribution and remarkable target information; Inputting the visible light image and the infrared image into a large language model to obtain a visible light text expression with fine granularity And infrared light text expression And based on visible light text expressions And infrared light text expression Constructing a text description set; Mapping the text description set into a preset unified text embedding space by using a BLIP visual language encoder to generate visible light text semantic features And infrared light text semantic features And based on visible light text semantic features And infrared light text semantic features Obtaining multi-modal text semantic features, wherein, Is a mapping function.
- 3. The text semantic consistency guidance-based infrared and visible light image fusion method according to claim 2, wherein visual feature coding processing is performed on a visible light image and an infrared image respectively based on a structure-intensity decoupling double-branch coding strategy, wherein visible light visual features mainly comprising structure and texture information are extracted from the visible light image, and infrared visual features mainly comprising intensity distribution and thermal target response are extracted from the infrared image, specifically: The structure-intensity decoupling double-branch coding strategy adopts a visible light branch and an infrared branch, wherein the visible light branch focuses on extracting scene structure, edge and texture information through a multi-scale convolution and a spatial attention mechanism, and the infrared branch focuses on and extracts thermal target intensity distribution and significant region information; Inputting the visible light image into a visible light branch to extract texture and structural characteristics of different scales of the visible light image, wherein the visible light branch consists of three convolution layers with different convolution scales, and the scales are respectively , , Obtaining the output characteristics of three convolution layers with different convolution scales 、 、 ; Output characteristics of convolution layers with three different convolution scales 、 、 After addition, the visual characteristics of the visible light are obtained by inputting the visual characteristics into a preset spatial attention module ; Inputting the infrared image into an infrared branch to extract the intensity distribution of the infrared image and the characteristics of a thermal response target so as to obtain infrared visual characteristics Wherein the infrared branch consists of three identical infrared feature extraction convolution modules and a channel attention module, and each infrared feature extraction convolution module comprises one A convolution layer of convolution kernel size and a ReLU function.
- 4. The text-semantic-consistency-guide-based infrared and visible light image fusion method according to claim 3, wherein the bimodal semantic alignment process comprises two mechanisms of explicit semantic alignment and implicit semantic alignment, wherein the explicit semantic alignment is realized by constraining the similarity between visual features and corresponding text semantic features in a shared semantic space and is used for constraining consistency between visual features and corresponding text semantic features, and the implicit semantic alignment is realized by constraining the consistency of semantic distribution of visual features of visible light and infrared visual features under text guidance in the shared semantic space and is used for constraining consistency of semantic distribution of different-modality visual features under text guidance.
- 5. The text semantic consistency guidance-based infrared and visible light image fusion method according to claim 4, wherein the dual-mode semantic alignment processing is performed on the visible light visual features and the infrared visual features under the guidance of the unified text semantic priors, wherein the visible light visual features and the infrared visual features are gradually converged in the unified semantic space through explicit semantic consistency constraint and implicit structure consistency constraint, specifically: visual characteristics of visible light And infrared vision features Mapping the data into a preset shared semantic space, wherein the formula is as follows: , , 、 are all the coefficients of a linear transformation that can be learned, For the visual features of visible light mapped to the shared semantic space, For infrared light visual features mapped to a shared semantic space; in explicit semantic alignment, explicit penalty functions are computed by computing The similarity between the constraint visual features and the corresponding text semantic features is realized, and the method is specific: From visual features of visible light mapped to shared semantic space Infrared light visual features mapped to shared semantic space Visible light text semantic features And infrared light text semantic features The similarity is calculated by the formula: , , for similarity of visual features of visible light to semantic features of text, For the similarity of infrared visual features to text semantic features, Calculating a function for cosine similarity; based on similarity of visual features of visible light and semantic features of text And similarity of infrared visual features to text semantic features An explicit semantic consistency constraint is constructed, and the formula is as follows: 。
- 6. the text-semantic-consistency-guide-based infrared and visible light image fusion method according to claim 5, further comprising, in implicit semantic alignment, by calculating an implicit loss function The semantic distribution consistency of the visible light visual features and the infrared visual features under text guidance is realized, and the method specifically comprises the following steps: computing visible light visual features mapped to shared semantic space Infrared light visual features mapped to shared semantic space Visible light text semantic features And infrared light text semantic features Similarity between the two, obtain the consistency of semantic distribution, its formula is: 。
- 7. The text semantic consistency guidance-based infrared and visible light image fusion method according to claim 6, wherein semantic modulation fusion processing is performed according to the visible light visual characteristics after the bimodal semantic alignment processing, the infrared visual characteristics after the bimodal semantic alignment processing and the unified text semantic priori, fusion characteristics are generated, and the fusion characteristics are input into a preset decoder to obtain a final fusion image, specifically: the unified text semantic priors by spatial replication The method is expanded to the spatial dimension which is the same as the visual characteristics after the bimodal semantic alignment processing, and the formula is as follows: , For the global semantic interest information, Representing a priori expansion of the original text semantics to a shared space dimension; MLP linear mapping is respectively carried out on the visible light visual characteristics and the infrared visual characteristics after the bimodal semantic alignment processing, and visible light self-adaptive weights are generated And infrared adaptive weights , Is a linear mapping function; Adaptive weighting of visible light And infrared adaptive weights As a modulation factor, performing semantic guided gating and residual error compensation processing on visual features of corresponding modes, and constructing fusion features by combining global semantic attention information , For element-by-element multiplication; inputting fusion features Generating a final fusion image in a preset decoder , Is a decoder.
- 8. The text semantic consistency guidance-based infrared and visible light image fusion method according to claim 7, wherein an infrared intensity retention constraint is introduced during training to reduce the loss of infrared energy distribution in the fused image, and the formula is: , wherein, In order to achieve the infrared energy distribution loss, As an original image of the infrared ray, Is the L1 norm.
Description
Infrared and visible light image fusion method based on text semantic consistency guidance Technical Field The invention relates to the technical field of multi-mode image fusion, in particular to an infrared and visible light image fusion method based on text semantic consistency guidance. Background With the rapid development of computer vision and artificial intelligence technology, the multi-mode image fusion technology plays an increasingly important role in the fields of intelligent monitoring, automatic driving, remote sensing perception, military reconnaissance and the like. The infrared and visible light image fusion can generate fusion images with more comprehensive information and visual interpretation by integrating the remarkable detection capability of the infrared image on a thermal target and rich texture and structural details of the visible light image, so that the scene perception and understanding capability under complex environments (such as night, haze and camouflage) can be effectively improved. Early infrared and visible image fusion methods were based mainly on traditional image processing techniques such as multiscale transformations (e.g., pyramid decomposition, wavelet transformation), saliency analysis, and rule-based feature selection and weighted fusion. The method relies on the characteristics of manual design and fusion rules, and is intuitive in principle, but the performance of the method is seriously dependent on parameter tuning, the adaptability to complex and changeable scenes is poor, the detail textures of visible light are difficult to reconstruct with high quality while the saliency of an infrared target is reserved, and the problems of edge blurring, unbalanced contrast, loss of semantic information and the like often occur in fusion results. In recent years, with the advent of deep learning, fusion methods based on Convolutional Neural Networks (CNNs), generation of countermeasure networks (GAN), and transformers have become mainstream. The method automatically learns the mapping from the source image to the fusion image in an end-to-end mode, and can extract deeper and more robust characteristic representation, so that the objective index and the subjective visual quality are obviously improved. However, most of the existing deep learning methods are still essentially a "visual to visual" mapping process, and the fusion decision is mainly dependent on statistical rules learned from pixels or low-level features, and lack explicit modeling and utilization of high-level semantic information of images. In a complex scene, the fusion network is difficult to accurately distinguish a foreground object from a background clutter, so that semantic confusion is easy to cause, for example, a high-temperature background is fused into the object by mistake, or a low-temperature but structurally important object is weakened, so that the fusion image possibly damages the semantic consistency and the supporting capability of the fusion image for subsequent advanced visual tasks (such as object detection and recognition) while improving the visual effect. To introduce semantic guidance, some studies have attempted to make use of textual descriptions or cross-modal attention mechanisms. For example, the fusion process is provided with rough semantic cues by image annotation data or simple class labels. However, the method generally has obvious limitations that firstly, the used text description is often coarse-grained (such as a single object type) and cannot provide fine-grained semantic information about scene structures, target attributes and correlations, the guiding effect is limited, secondly, a cross-mode interaction mode is simpler (such as direct splicing or shallow attention), deep association and complementary relation of two heterogeneous modes of infrared and visible light at a semantic level cannot be deeply modeled, stable and accurate semantic consistency constraint is difficult to build, and thirdly, the existing method generally performs mixed coding on features of different modes and cannot effectively decouple structural texture information rich in the visible light mode and intensity significance information specific to the infrared mode. This feature coupling makes it difficult for the network to reconcile the two information streams at the time of fusion, which is prone to semantic information imbalance—overstressing strength may destroy structural integrity, while overstrain structure may weaken the target significance. In view of this, the present application has been proposed. Disclosure of Invention The invention provides an infrared and visible light image fusion method based on text semantic consistency guidance, which can at least partially improve the problems. In order to achieve the above purpose, the present invention adopts the following technical scheme: An infrared and visible light image fusion method based on text semantic consistency guidan