CN-122023831-A - Significance target ordering method based on cross-modal diagram structure reasoning

CN122023831ACN 122023831 ACN122023831 ACN 122023831ACN-122023831-A

Abstract

The invention discloses a significance target ordering method based on cross-modal diagram structure reasoning. The saliency target ordering method comprises the steps of extracting multi-scale instance-level visual features of images for each image in a basic data source, utilizing EMAM modules to achieve cross-modal semantic injection for the images in the basic data source, utilizing SA-CAG modules to conduct cross-modal semantic guidance on instance features for the images in the basic data source, constructing an ordering graph network for all the images in the basic data source, training a cross-modal ordering model by adopting the images in the basic data source and semantic description text data thereof for the cross-modal ordering model, and training to obtain an optimal saliency ordering model. In the saliency target ordering method, the defects of the traditional visual model in complex scene ordering are overcome through cross-mode semantic injection, attention guidance and graph structure reasoning, and the saliency ordering precision and the model robustness are improved.

Inventors

WANG JING
HE NING
Xie Wensi
ZHONG CHENGZHI
ZHANG ZHEN
MA PING
HAN YANLING
SONG GE

Assignees

上海海洋大学

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. A saliency target ordering method based on cross-modal diagram structure reasoning is characterized by comprising the following steps: S1, constructing a basic data source for model training based on a significant target ordering data set, and setting semantic description text for each image in the basic data source; s2, extracting multi-scale instance-level visual features of the images aiming at each image in the basic data source; s3, aiming at each image in the basic data source, realizing cross-mode semantic injection by utilizing EMAM modules; S4, aiming at each image in the basic data source, utilizing an SA-CAG module to conduct cross-mode semantic guidance on example features; S5, constructing a ranking graph network aiming at all images in the basic data source; s6, aiming at the cross-modal ordering model, training the cross-modal ordering model by adopting the image and semantic description text data thereof in the basic data source, so as to train and obtain the optimal significance ordering model.
2. The method for ranking significance targets based on cross-modal structure reasoning of claim 1, wherein the EMAM module implements hierarchical, multi-scale stepwise injection of visual features and image-text interaction features from the BLIP model by embedding a series of lightweight adapters at each stage of the Swin transform, Each adapter is composed of a visual feature injection module and a Cross-modal semantic injection module and is respectively used for introducing BLIP Vision Encoder pure visual semantic features and Cross-modal Encoder image-text interactive representation, and through collaborative injection of two types of semantics, the backstone obtains Cross-modal consistency on different spatial resolutions and semantic levels, so that a representation which is more sensitive to the semantics of a significant object is formed.
3. The method for ranking significance targets based on cross-modal graph structure reasoning of claim 2, wherein the SA-CAG module comprises: a) Based on the instance level prediction Mask output by the Mask2Former decoder, B) A cross-modal attention heat map extracted by the BLIP vision-language model, for characterizing a spatial region of interest for text semantics, C) A linear projection and feature fusion unit for mapping and injecting the salient responses into the decoder query representation; The SA-CAG module performs weighted aggregation on the prediction mask corresponding to each instance query by taking the cross-modal attention of text guidance as a space weight, so as to explicitly measure the significance response strength of the instance under the text semantic concern, and the significance response is further mapped to the query feature space and fused with the original query characterization, so that each query can obtain the significance guidance from the text semantic while keeping the self visual discrimination capability.
4. The method for ranking significance targets based on cross-modal structure reasoning as claimed in claim 3, wherein the step S1 further comprises: The construction of the semantic description text adopts a consistency text generation strategy, and a large-scale vision-language model is guided to generate the semantic consistency text through double-graph comparison prompt.
5. The method for ranking salient objects based on cross-modal structure reasoning as recited in claim 4, wherein step S2 further comprises: When the multi-scale example level visual feature extraction is carried out, mask2 force is adopted to encode the input image, EMAM modules are embedded in each stage of the Swin transform backbone network, Wherein, the The cross-mode semantic injection module performs dimension alignment, normalization and linear projection processing on the image-text interaction characteristics from the BLIP cross-mode encoder, so that the image-text interaction characteristics can be injected into the spatial characteristics of the visual trunk; The visual characteristic injection module performs Patch-berging and linear adjustment processing on the middle layer output of the BLIP visual encoder, so that the middle layer output is matched with different scale characteristics of the Swin transform backbone network, and the semantic sensitivity of the salient region is enhanced in a residual connection mode.
6. The method for ranking significance targets based on cross-modal structure reasoning as set forth in claim 5, wherein the step S3 further comprises: The lightweight E-Adapter module and the E-Injector module are inserted into each layer of the Swin Transformer backbone network, visual features output by the BLIP visual encoder and text-image interaction features of the cross-mode encoder are injected into visual features of different scales in stages, and the sensitivity of the model to obvious target semantics is enhanced.
7. The method for ranking salient objects based on cross-modal structure reasoning as recited in claim 6, wherein step S4 further comprises: the method comprises the steps of realizing semantic guidance of a salient region by using cross-modal attention, generating Attention Guided CAM thermodynamic diagrams through a BLIP model, mapping the thermodynamic diagrams to an instance query space to form a semantic aligned salient attention weight, and fusing the semantic aligned salient attention weight with instance features of a Mask2Former decoding stage, so that the model can focus a real salient region under the constraint of text semantics under the multi-scale features.
8. The method for sequencing significance targets based on cross-modal graph structure reasoning of claim 7, wherein the SA-CAG module is realized by upsampling Attention Guided CAM thermodynamic diagrams generated by a BLIP text encoder to enable spatial dimensions of the thermodynamic diagrams to be consistent with example Mask predictions output by a Mask2Former decoder, performing Softmax normalization processing on the thermodynamic diagrams to serve as cross-modal significance weight distribution, multiplying the attention weight and the example Mask point by point to obtain feature responses of semantic guidance, projecting the responses to query feature spaces through linear mapping, and finally fusing the responses with original example query features in a residual mode.
9. The method for ranking salient objects based on cross-modal structure reasoning as recited in claim 8, wherein step S5 further comprises: The example characteristics after multiscale fusion are used as graph nodes, and semantic competition relation and relative importance among the examples are learned by a graph attention mechanism; the specific construction method of the ordering diagram network comprises the following steps: the SA-CAG processed multi-scale instance features are averaged together to form node features, The semantic relation weight between any two instance nodes is calculated by adopting a graph attention mechanism (GAT), and the adjacent attention matrix is formed by Softmax normalization, The example nodes are weighted and aggregated according to the attention weight, the modeling of the significant competition relationship is realized, Through the residual structure and LayerNorm steady-state graph reasoning process, The saliency score for each instance is finally output using a linear regression head.
10. The method for ranking salient objects based on cross-modal structure reasoning as recited in claim 1, wherein step S6 further comprises: During model training, freezing all parameters of a Mask2Former trunk and a BLIP model, and training only EMAM modules and a ranking graph network; In the training process, a AdamW optimizer and a cosine annealing learning rate strategy are used, so that the loss value gradually converges after a plurality of epochs; stopping training after SA-SOR and SOR indexes are stabilized on the verification set to obtain optimal model weights; the training results output by the final system comprise a significant target ordering score, an example mask prediction result, a training log and a performance curve.

Description

Significance target ordering method based on cross-modal diagram structure reasoning Technical Field The invention relates to a multi-mode intelligent perception and computer vision technology, in particular to a saliency target ordering method based on cross-mode diagram structure reasoning. Background With the rapid development of computer vision technology, machine vision systems have been able to identify and understand image content in complex scenes. Among them, saliency detection and saliency target ordering (Salient Object Ranking, SOR) are used as core tasks for simulating human visual attention mechanisms, and gradually become an important research direction in the visual field. Conventional saliency detection (Salient Object Detection, SOD) is mainly concerned with segmenting the most salient foreground region from the image, the result of which is typically a binary saliency map. The method can locate the most easily noticeable area of human eyes in the image, but can not further distinguish the relative importance among a plurality of obvious objects, so that the method is difficult to meet a multi-level attention allocation mechanism of human visual cognition. As research progresses, it is increasingly recognized that salience is not an absolute quantity, but a relative attribute with a hierarchical relationship, and when there are a plurality of salient objects in an image, a salience competition and ranking relationship, that is, a Salience Object Ranking (SOR), is formed between them. Existing SOR methods can be broadly divided into two categories: The first class infers the target rank by computing the saliency intensity of the segmented region based on the pixel-level saliency map. The method has a simple structure, lacks modeling capability for semantic relation and context dependence among targets, and is easy to generate confusion among targets with similar semantics or close scales. The second class is based on instance-level feature relation modeling, takes a salient object as a node, deduces the importance level of the target through a graph structure, an attention mechanism or a sorting network, and represents a method such as ASSR, IRSR, QAGNet and the like. The method can infer the significance difference among objects to a certain extent, but still mainly relies on single visual information to judge, and is difficult to capture the inherent high-level semantic clues in natural language. Meanwhile, with the development of a multi-modal large model (VLMs), the cross-modal semantic alignment capability of images and texts is rapidly improved. Numerous studies have shown that visual language models have significant advantages in terms of object semantic recognition, contextual understanding, and attention distribution. However, the dominant ranking algorithm in the current mainstream still mainly depends on the single-mode visual characteristics, and the auxiliary expression of text semantics on the importance of the targets is not fully utilized, so that the ranking performance under complex scenes (such as multi-object shielding, high semantic similarity, large scale difference and the like) still has obvious defects. For example, studies have proposed enhancing the modeling of relationships between instances by means of transformations, graph networks or attention transfer, but these approaches still suffer from two core problems: (1) Lacking semantic alignment signals, semantic roles, importance degrees and scene logic among objects cannot be effectively understood; (2) The lack of a cross-modal guided dataset, existing SOR datasets such as ASSR, do not contain textual descriptive information, making cross-modal training difficult. In addition, the current SOR evaluation system mainly focuses on the sorting precision, but lacks indexes for measuring the stability and the robustness of the model. For example, in a multi-objective scenario, some models may produce valid ranking results on only a small number of samples, but their scores are still high, which does not truly reflect the reliability of the model in large-scale applications. Thus, in a multi-objective saliency ranking task, relying on visual features alone has made it difficult to meet the requirements of high-precision semantic reasoning and ranking in complex scenarios. In this context, a new significant object ordering method is needed that has the following capabilities: 1. high-level semantic information of vision and text is fused, and insufficient expression of vision characteristics is complemented; 2. introducing cross-modal guidance in a feature coding stage, and improving semantic sensitivity of the model to a significant object; 3. explicitly modeling semantic competition relationships when sequencing among the instances; 4. And a more perfect performance and robustness evaluation mode is provided. In summary, the main problems to be solved by the present invention are as follows: In the existing significa