CN-122024239-A - Remote sensing multi-mode image description generation method with enhanced anti-facts and causal effects

CN122024239ACN 122024239 ACN122024239 ACN 122024239ACN-122024239-A

Abstract

The invention discloses a remote sensing multi-mode image description generation method with enhanced anti-facts and effects, and belongs to the technical field of remote sensing multi-mode intelligent understanding and generation. The method comprises the steps of obtaining a single optical remote sensing image, a paragraph type description text and an instruction text, setting a non-causal imaging disturbance type irrelevant to description semantics and sampling disturbance for a sample, generating a target area protection mask by utilizing a segmentation model, only applying disturbance construction anti-fact images to background areas to form a pair of samples with consistent semantics but different imaging conditions, respectively generating paragraph descriptions for an original image and the anti-fact images under the same instruction conditions, and generating a combined training model of supervision loss and semantic embedding consistency loss, and inputting the image and outputting the paragraph descriptions by an inference stage. The method reduces the dependence of the model on non-causal disturbance and improves the stability and the mobility of the description result under different imaging conditions.

Inventors

ZHAO ZUOPENG
LI LU
ZHENG XIANGYUN
Ning Maocai
Sun Shoudu

Assignees

中国矿业大学
江苏比特达信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260408

Claims (7)

1. The remote sensing multi-mode image description generation method with the anti-facts and effects enhanced is characterized by comprising the following steps of: s1, acquiring a training sample, wherein the training sample comprises a single optical remote sensing image and a corresponding target description text, and acquiring an instruction text corresponding to the optical remote sensing image, wherein the instruction text is used for limiting a content range or an expression form of an output description; S2, setting a non-causal imaging disturbance type set, and selecting a non-causal imaging disturbance type and disturbance degree thereof for the training sample; S3, performing target region segmentation on the optical remote sensing image based on a segmentation model to obtain a target region mask, performing morphological processing on the target region mask to generate a target region protection mask, and further determining a background region; under the constraint of a target area protection mask, applying the non-causal imaging disturbance to a background area to construct a counterfactual image, so that the content of the target area of the counterfactual image is kept the same as that of the target area of the optical remote sensing image, and the imaging conditions of the background area are different due to the disturbance application; s4, under the condition of the same instruction text, respectively inputting the optical remote sensing image and the inverse fact image into a remote sensing multi-mode large model to generate a corresponding description result; S5, supervising a description generation process based on the target description text, introducing a counterfactual consistency constraint for constraining the optical remote sensing image and the description result of the counterfactual image to keep semantic consistency, and jointly optimizing the remote sensing multi-mode large model; S6, generating an output description text under the condition of the instruction text by using the optimized remote sensing multi-mode large model to the single optical remote sensing image to be described.
2. The method for generating the remote sensing multi-mode image description with the augmented anti-facts and causal effect according to claim 1, wherein the step of obtaining training samples and instruction texts in S1 is: S1.1, constructing a training sample set, and organizing training data into a pairing form of 'image-target description-instruction text', so as to ensure consistency and traceability of monitoring signals and condition input in a subsequent training process; secondly, carrying out sample integrity verification on training samples, so that each sample at least comprises a single optical remote sensing image, a target description text corresponding to the single optical remote sensing image and an instruction text corresponding to the single optical remote sensing image, and a unified data interface is provided for counterfactual sample construction and consistency constraint calculation; Wherein, the A set of training samples is represented and, Represent the first A triplet of training samples is used to determine, Representing a single optical remote sensing image input, Representation and representation The corresponding object description text is provided with a plurality of object description texts, Representation and representation The text of the corresponding instruction is displayed, Representing the number of training samples to be used, Indexing for the samples; S1.2, carrying out paragraph structural representation on a target description text, so that the target description can explicitly describe a hierarchical structure of a paragraph consisting of multiple sentences and a sentence consisting of a word sequence, thereby being convenient for monitoring the whole paragraph semantics in subsequent training and carrying out consistency constraint on an inter-sentence organization mode and intra-sentence expression when required, and further improving the stability and controllability of paragraph generation; Wherein, the Representing the number of paragraph sentences, Represent the first The text of the sentence is displayed, Represent the first Sentence No The number of the word elements is the same, Represent the first The expression is used for carrying out unified modeling on paragraph level supervision signals in a training stage and providing standardized text objects for subsequent semantic consistency calculation; S1.3, acquiring and normalizing an instruction text, expressing the instruction text as a word element sequence, and carrying out format unification and field standardization processing, wherein the normalization processing is used for eliminating instruction noise and unifying instruction expression forms, so that the same task constraint has consistent condition semantics among different samples, and the model can generate a comparable paragraph description result under the same condition when an original image and a counter-fact image are input; Wherein, the Representing the first in the instruction text The number of the word elements is the same, The length of the word element of the instruction text is represented, The instruction normalization operator is represented as a function of the instruction, Representing the text of the original instruction, The normalized operator can be used for realizing the unified expression of instruction templating, redundant information reduction and key constraint fields; S1.4, analyzing the normalized instruction text to obtain structural constraint information, and using the structural constraint information as a condition control quantity in a paragraph description generation process to realize unified control on sentence number structures, paragraph organization modes, objects of interest and expression styles; Wherein, the The instruction parsing operator is represented by a sequence of instructions, Representing a structured constraint vector, wherein Representing the number of sentences or length constraint information, The paragraph structure constraint information is represented as such, Representing object-of-interest constraint information, Representing expression style constraint information, wherein the structured constraint vector is used for controlling paragraph description output form and content emphasis in a consistent manner in a training and reasoning stage, and then, a single optical remote sensing image is obtained Normalized instruction text Structured constraint vector Forming a standardized input unit; Wherein, the Representing standardized input unit composed of single optical remote sensing image Normalized instruction text Structured constraint vector Constructing; since the training data set takes the form of triplets Wherein Representation and representation Corresponding target description text so that each triplet can be used Mapping to supervised training equivalent units through instruction normalization and resolution Thus, the subsequent counterfactual sample construction, the two-way paragraph description generation and the consistency constraint calculation are completed under a unified input-output interface.
3. The method for generating a remote sensing multi-mode image description with anti-causal enhancement according to claim 1, wherein the step of setting a set of non-causal imaging disturbance types and selecting a disturbance type and a disturbance degree thereof for the training sample in S2 comprises the steps of: s2.1, constructing a non-causal imaging disturbance type set, wherein the non-causal imaging disturbance type set is used for representing imaging disturbance factors which are irrelevant to target semantics but can cause image appearance change, binding each type of disturbance to an executable disturbance operator family so as to ensure that corresponding disturbance can be called by a unified interface in subsequent anti-facts image construction; Wherein, the Represents a set of non-causal imaging disturbance types, Represent the first The type of disturbance is classified as a type of disturbance, Representing the number of disturbance types, Index for disturbance type; Representing a set of perturbation operator families, Representation and disturbance type The corresponding disturbance operator family or disturbance generation rule is used for applying the selected disturbance to the image designated area; S2.2, selecting at least one disturbance type in the non-causal imaging disturbance type set according to sample content and instruction conditions for each training sample, and determining disturbance degree corresponding to the disturbance type, wherein the disturbance degree is used for controlling influence amplitude of disturbance on background appearance so as to form anti-facts sample distribution covering different imaging condition changes; Wherein, the Denoted as the first The type of disturbance selected by the individual samples, Representing a disturbance degree control amount corresponding to the disturbance type; Is shown in a given image And instruction text The disturbance type under the conditions is selected to be distributed, The disturbance degree selection distribution under the given disturbance type and sample condition is represented, wherein the distribution can be realized by a preset rule, a random mechanism or a learning strategy so as to ensure that the disturbance selection has controllability and expandability; S2.3, based on the selected disturbance type and disturbance degree, calling a corresponding disturbance operator family to generate a disturbance operator instance for subsequent anti-fact construction, and taking the disturbance type, the disturbance operator instance and control quantity thereof as control conditions for generating anti-fact images so as to uniformly apply disturbance to a background area under the protection constraint of a target area; Wherein, the Represent the first Disturbance operator examples corresponding to the samples; representing the disturbance operator instance acting on the input image The obtained disturbance result; indicating a disturbance type of Disturbance operator family of (2) at disturbance level Under control to input images The obtained disturbance result is satisfied ; A set of executable perturbation operator instances is represented, The perturbation operator instance is used to apply a non-causal imaging perturbation to the background region in a subsequent step to construct a counterfactual image.
4. The method for generating a remote sensing multi-mode image description with enhanced anti-facts and effects according to claim 1, wherein the step of constructing the anti-facts image under the constraint of the target area protection mask in S3 is as follows: S3.1, carrying out region division on the optical remote sensing image based on a target region protection mask, defining the space range of a target region and a background region, taking the target region as a protected region for semantic preservation, and taking the background region as an action region for applying non-causal imaging disturbance, wherein the region division aims to ensure that a counterfactual structure only changes imaging appearance factors irrelevant to target semantics, so that a counterfactual sample is consistent with an original sample in a target semantic layer but has controllable difference in an imaging condition layer; Wherein, the Represent the first A single optical remote sensing image of a sample, Representing the target area protection mask, Representing the mask of the background area, Representing an all 1 matrix of the same size as the image, The expression is used for dividing the image into a target area and a background area according to masks, wherein the background area mask and the target area protection mask are complementary area masks, and a constraint basis is provided for applying disturbance to the background area only later; S3.2, calling a disturbance operator instance corresponding to the selected non-causal imaging disturbance type, and applying the non-causal imaging disturbance to a background area under the constraint of a disturbance degree control quantity to obtain image content after the background disturbance, wherein the operation of applying the non-causal imaging disturbance only acts on the pixel position covered by a background area mask so as to avoid damaging a key structure and a semantic clue of a target area, and enabling a disturbance effect to be multiplexed consistently to construct a comparable anti-real sample pair; Wherein, the Represent the first Background perturbation results obtained after the application of non-causal imaging perturbation to individual samples, Represent the first The executable perturbation operator instance corresponding to the individual sample, Representing the disturbance operator instance as a disturbance type Disturbance operator family of (2) at disturbance level Under control of input image Transforming to obtain the background disturbance result For fusing with the target area content in a subsequent step to construct a counterfactual image; S3.3, the original image content of the target area and the disturbed image content of the background area are fused to obtain a counterfactual image, wherein the fusion process follows the combination rule of 'target area maintenance and background area disturbance' on the pixel/feature level, so that the counterfactual image maintains consistency to the target semantics, and imaging condition change is explicitly introduced to support counterfactual contrast training and causal reinforcement learning; Wherein, the Represent the first The inverse image corresponding to each sample is displayed, Representing the image content after background disturbance, and obtaining the image content by the fusion And original image The method comprises the steps of keeping consistency in a target area, and reflecting imaging appearance changes controlled by disturbance types and disturbance degrees in a background area so as to form a negative real sample pair which can be used for consistency training; S3.4, forming the anti-facts image and the original image into sample pairs, establishing a corresponding relation under the same instruction condition, further ensuring that condition inputs of the original branch and the anti-facts branch are consistent in subsequent description generation and consistency constraint calculation, and enabling model training to be carried out around the goal of 'description semantics are unchanged from non-causal disturbance'; Wherein, the Representing a set of anti-facts sample pairs, Expressed in the same instruction text And the sample pair set is used for carrying out two-way description generation and inverse fact consistency constraint calculation in the subsequent steps.
5. The method for generating a remote sensing multi-mode image description with enhanced anti-facts and effects according to claim 1, wherein the step of generating a paragraph type description result for the original image and the anti-facts image respectively under the same instruction text condition in S4 is as follows: s4.1, generating structure planning information of paragraph descriptions under the constraint of instruction texts, wherein the structure planning information is used for explicitly characterizing the organization mode and content emphasis point among paragraphs and sentences, so that the subsequent generation process can keep consistent in terms of sentence number structure, inter-sentence connection, concerned objects, expression styles and the like, thereby ensuring that the output of an original branch and a counterfactual branch has comparability and providing a stable semantic alignment foundation for the counterfactual consistency constraint; Wherein, the Represent the first Paragraph structure planning information for each sample, Representing multi-modal large model parameters by remote sensing The determined plan generation operator is used to generate a plan for the user, Representing a single optical remote sensing image, The planning information can be used for limiting an inter-sentence organization mode, a content coverage range or an expression form of the paragraph so as to improve the structural stability and the condition consistency of paragraph generation; S4.2, carrying out multi-mode representation extraction and fusion on the image and the instruction text, mapping the image content and the instruction constraint into a unified conditional semantic space, so that the generator can simultaneously utilize the image semantic clue and the instruction constraint in a decoding stage, wherein the fused conditional expression is used as conditional input generated by paragraph description so as to ensure the controllability and consistency of the generated content under the same instruction condition; Wherein, the The visual coding operator is represented by a sequence of code words, The instruction text encoding operator is represented as, The representation of the image is shown as such, The instruction representation is represented by a sequence of instructions, Representing the cross-modal fusion operator, The conditional semantic representation after the fusion is represented, The fusion representation is used for simultaneously restraining 'generated content' and 'paragraph organization mode' in a decoding stage, so that stable output capacity of paragraph description is improved; S4.3, outputting paragraph description text by adopting a sequence generation mode under the constraint of the fusion condition representation and paragraph structure planning information, wherein the paragraph description text consists of multiple sentences and can be further expressed as a hierarchical structure of a sentence sequence-a word element sequence so as to perform unified supervision on paragraph-level semantics and inter-sentence structures in training; Wherein, the Representing the text decoding generation operator, Represent the first The paragraph-type description result generated by the individual samples, Representing the number of sentences of the generated paragraph, Represent the first In the sentence, the first sentence, The word elements in the sentence are represented, Represent the first The hierarchy represents inter-sentence organization and intra-sentence expression for describing paragraph output, so that subsequent consistency constraint can be stably aligned on a paragraph semantic level; S4.4, respectively executing the paragraph generation process on the original image and the counterfactual image under the same instruction text condition to obtain an original branch description result and a counterfactual branch description result, and establishing a one-to-one correspondence relationship between the original branch description result and the counterfactual branch description result under the same condition constraint for the calculation and the joint optimization of the follow-up counterfactual consistency constraint; Wherein, the Conditional paragraph generation mapping representing the remote sensing multi-modal large model, Representing an original image In instruction text The paragraphs under the conditions describe the results, Representing a counterfactual image And generating two paths of description results under the same instruction text condition, so that the difference between the two paths of description results mainly comes from imaging condition change rather than instruction difference, and reliable input is provided for subsequent anti-facts consistency training.
6. The method for generating a remote sensing multi-modal image description with anti-facts causally enhanced according to claim 1, wherein the step of supervising the paragraph description generation process based on the target description text and introducing anti-facts consistency constraints to jointly optimize the remote sensing multi-modal large model in S5 is as follows: S5.1, constructing description generation supervision constraint based on a generation result of a target description text and an original image branch, so that a model can generate a description text which is consistent with a target semantic and has a reasonable paragraph structure under the condition of an instruction text, wherein the supervision constraint takes a word element sequence of the paragraph text as a supervision object, and constrains the conditional probability generated in each step under an autoregressive generation framework so as to ensure the alignment of generated contents and the target description at a semantic level; Wherein, the The representation-description generates a supervision loss, Trainable parameters representing a remote sensing multi-modal large model, A set of training samples is represented and, Representing the model in the input image And instruction text Generating a description of a target paragraph under conditions Is a function of the conditional probability of (1), The supervision loss is used for restricting the consistency of model output and target description text and providing a stable generated baseline for the subsequent introduction of the inverse fact consistency; S5.2, constructing a counterfactual consistency constraint based on two sections of description results generated by the original image and the counterfactual image under the same instruction text condition, so that the model keeps semantic invariance to the non-causal imaging disturbance in causal sense, wherein the counterfactual consistency constraint is realized by mapping the two sections of description to a unified semantic space and restricting consistency of semantic embedded representation of the two sections of description, thereby reducing the sensitivity of the model to imaging appearance change and avoiding taking the non-causal disturbance as a description decision basis; Wherein, the The meaning of the semantic embedded encoder is that, Which is indicative of the parameters of the same, The paragraph description result representing the original image branch, Paragraph description results representing branches of the anti-facts image, And (3) with The semantic embedded encoder can be realized by a text encoding module of the model and is used for providing a comparable semantic representation; Wherein, the Representing a semantic embedded consistency constraint loss, Representing a set of anti-facts sample pairs, Representing the degree of cosine similarity, The loss promotes the model to keep paragraph description semantics stable when the background imaging condition changes by punishing inconsistency of semantic embedding, thereby realizing the training target of the anti-facts causal enhancement; S5.3, combining the generation supervision constraint and the inverse fact consistency constraint, and updating model parameters based on a combined target to simultaneously consider the generation accuracy and the causal robustness, wherein the combined target controls the relative contribution of the two types of constraints through a balance item, so that the model can learn a reliable image-text mapping relation and can keep invariance of semantic output when non-causal imaging disturbance exists; Wherein, the Representing the goal of the joint training, Representing consistency weight coefficients, jointly updating model parameters by minimizing joint objectives And optionally updating semantically embedded encoder parameters Thereby completing the joint optimization of the remote sensing multi-mode large model; Wherein, the And the parameter solution is used for generating paragraph type descriptive text in the subsequent reasoning stage under the condition of the instruction text, and maintaining semantic consistency when imaging conditions change.
7. The method for generating a remote sensing multi-mode image description with enhanced anti-facts and causal effects according to claim 1, wherein the step of generating the output paragraph type description text for the single optical remote sensing image to be described under the condition of the instruction text by using the remote sensing multi-mode large model after the joint optimization in S6 is as follows: S6.1, acquiring an instruction text corresponding to a single optical remote sensing image to be described, and executing standardization and analysis processing consistent with a training stage on the instruction text to obtain structural constraint information for controlling paragraph output form, wherein the structural constraint information is used for limiting a sentence number structure, an inter-sentence organization mode, an object of interest and an expression style of paragraph output in an reasoning stage, so that a model can stably output paragraph description results meeting requirements in different application scenes; Wherein, the Representing the instruction text entered in the reasoning phase, The instruction normalization operator is represented as a function of the instruction, Representing the normalized instruction text of the instruction, The instruction parsing operator is represented by a sequence of instructions, The method comprises the steps of analyzing instruction texts to obtain structured constraint vectors, wherein the structured constraint vectors are used for controlling output forms and content emphasis points of paragraph descriptions and keeping consistency with conditional semantics of training stages; S6.2, forming a standardized input unit by the image to be described, the normalized instruction text and the structural constraint information thereof, and inputting a remote sensing multi-mode large model after joint optimization to generate a paragraph type description text under the same condition control, wherein the standardized input unit is used for ensuring that a data interface of an reasoning stage is consistent with a training stage, facilitating the model to directly multiplex the condition generating capability obtained by training, and avoiding unstable output caused by input format difference; Wherein, the Representing a single optical remote sensing image to be described, Standardized input unit for the reasoning stage, which is composed of images Normalized instruction text Structured constraint vector The standardized input unit is used as a condition input to trigger a paragraph type description generating process; S6.3, calling the remote sensing multi-mode large model subjected to joint optimization based on the standardized input unit to generate a paragraph type description text, and outputting the paragraph type description text as an reasoning result, wherein the generation process outputs a plurality of sentence paragraphs under the constraint of structure planning information and multi-mode fusion condition representation, so that the output text accords with instruction constraint on inter-sentence structure and semantic content, and the stability and consistency of semantic expression are maintained when imaging conditions change; Wherein, the Representing the parameters as Is a remote sensing multi-mode large model after the joint optimization, And outputting the paragraph type description text into a multi-sentence structure under the constraint of the instruction text, and taking the paragraph type description text as a final description result of the image to be described.

Description

Remote sensing multi-mode image description generation method with enhanced anti-facts and causal effects Technical Field The invention relates to the technical field of remote sensing image understanding and natural language generation, in particular to a remote sensing multi-mode image description generation method with enhanced anti-facts and effects. Background Along with the rapid development of remote sensing observation platforms, sensor systems and multi-mode large model technologies, remote sensing image understanding and natural language generation show more and more important application values in the scenes of earth surface coverage monitoring, disaster emergency assessment, urban fine management, resource investigation and the like. Meanwhile, multisource auxiliary information (such as task instructions, priori knowledge texts and the like) can provide semantic constraints for description generation, so that the model meets application requirements in aspects of 'objects of interest, description granularity, expression style' and the like. Therefore, the multi-mode image description technology oriented to single optical remote sensing images becomes an important direction for pushing remote sensing intelligent interpretation and automatic report generation. The traditional remote sensing image description method is mostly dependent on a manually designed feature extraction and templated language generation strategy, so that the multi-scale structure and space relation of ground feature elements in a complex scene are difficult to fully describe, and the problems of single expression, incomplete coverage, semantic one-sided and the like of a generated text are easy to occur. In recent years, the development of deep learning and visual language models provides a new solution for remote sensing image description, namely, a generating model based on an encoder-decoder structure can automatically learn image characterization and generate natural language description, and the correlation between the attention capability of a key region and descriptive contents can be improved to a certain extent by combining an attention mechanism and a cross-mode interaction module. However, there are significant appearance differences in remote sensing images under different areas, seasons and imaging conditions, such as atmospheric scattering, radiation and illumination changes, resolution degradation and blurring, compression and resampling artifacts, sensor noise and banding, etc., and non-causal imaging disturbances change the image appearance but should not determine scene semantics, which makes the model easy to learn "pseudo-relevant features" or "shortcut cues" related to the target description during training, resulting in unstable descriptions, detail drift, and even semantic expressions inconsistent with real scenes when applied across domains. Furthermore, the paragraph descriptions generally need to meet requirements of inter-sentence structure organization, information coverage integrity, semantic consistency and the like at the same time, and are more susceptible to imaging disturbance and data distribution offset than the single sentence descriptions. The conventional data enhancement or simple consistency regularization is mostly adopted in the existing method to improve the robustness, but strict consistency of the enhancement sample and the original sample in a target semantic level is difficult to ensure, and especially under the condition that the semantics of a target area are easily damaged, extra noise supervision is easily introduced, so that the stability and the mobility of the model are weakened. Therefore, a training method capable of performing explicit modeling and intervention on non-causal imaging disturbance on the premise of keeping the target semantics unchanged is needed, so that the model learns causal robust representation of 'description semantics remain stable under imaging condition change', and the reliability and generalization capability of remote sensing multi-modal image description under complex imaging environment and cross-domain distribution are improved. Disclosure of Invention The invention aims to provide a remote sensing multi-mode image description generation method with enhanced anti-facts and causal effects, which is used for carrying out explicit modeling and intervention on non-causal imaging disturbance such as atmospheric scattering, radiation and illumination change, resolution degradation and blurring, compression and resampling artifacts, sensor noise and strips and the like on the premise of keeping target semantics unchanged by introducing anti-facts contrast training and causal consistency constraint, solving the problems that the conventional remote sensing image description model is easily influenced by imaging condition change, has insufficient cross-domain generalization, has poor paragraph description stability and the like, and improving the