CN-121997069-A - Semantic matching method for image and text

CN121997069ACN 121997069 ACN121997069 ACN 121997069ACN-121997069-A

Abstract

The invention discloses a semantic matching method for images and texts, and belongs to the field of artificial intelligence. The invention particularly provides an image text matching method based on trusted heuristic graph learning, which aims to solve the problems of weak interpretability, poor portability and insufficient reliability of matching results commonly existing in the existing image text matching method. The method comprises the steps of firstly mapping images and texts to a public graph representation space, then searching and mining causal structures related to cross-modal semantics through a heuristic anti-facts graph, and finally integrating evidence and quantifying matching credibility through a cross-modal credibility alignment mechanism. The method is used for mutual retrieval of images and texts, cross-modal semantic understanding and downstream task enhancement, and has the advantages of interpretable decision process, convenient and fast module plug and play integration, robust and reliable matching result and the like.

Inventors

LI BO
HE LUDAN
WEI XING
RAO XUEFENG
LI BIAO
YANG HUA
YAN LIANG
LI XUNZHANG
CHEN JINGWEN
LI NING

Assignees

桂林航天工业学院

Dates

Publication Date: 20260508
Application Date: 20260122

Claims (10)

1. The semantic matching method for the image and the text is characterized by comprising the following steps of: step S1, acquiring an image sample and a text sample to be matched, and respectively extracting image region characteristics of the image sample and text character characteristics of the text sample; S2, constructing a visual map based on the image region characteristics, constructing a text map based on the text character characteristics, and mapping the visual map and the text map to a public map representation space to obtain a public map; Step S3, performing a counterfactual graph search on the public graph to generate at least one counterfactual graph, and acquiring counterfactual evidence based on the counterfactual graph, wherein the counterfactual graph and the public graph have opposite semantic classification results in a preset graph classification model; And S4, fusing the initial evidence obtained based on the public graph and the counterfactual evidence, performing cross-mode trusted alignment, and outputting a matching result and corresponding credibility of the image sample and the text sample.
2. The method for semantic matching of images and text according to claim 1, further comprising a pooling step: Respectively carrying out descending order rearrangement on the image region characteristics and the text character characteristics according to values; inputting the rearranged features into a graph convolution network to learn the weight of each feature element; And carrying out weighted fusion on the rearranged features based on the learned weights, and using the fused optimized features to update or construct the public map.
3. The method of semantic matching of images and text according to claim 1, wherein said performing a counterfactual graph search on said public graph comprises: inputting the public map to a map neural network classifier to obtain a first classification result; Iteratively executing editing operation on the edge set of the public graph until a second classification result of the edited candidate graph obtained in the graph neural network classifier is opposite to the first classification result; The editing operation comprises adding edges or deleting edges, and each editing operation selects edges based on a preset weight function.
4. A method of semantic matching of images and text according to claim 3 wherein the weight function is defined based on consistency of appearance of edges in the history data with classification tags: , Wherein, for the public graph G= (E, N), E is its edge set, N is its vertex set, E is one edge of E or N 2 \E; representing a training graph set which comprises an edge e and has a label consistent with the first classification result; representing a set of training graphs containing edge e but the labels are inconsistent with the first classification result.
5. A method of semantic matching of images and text according to claim 3, wherein said iteratively performing editing operations on the set of edges of the public map comprises: Before each editing operation, determining to execute the operations of adding edges or deleting edges with equal probability through a random decision function; Maintaining a list of operated edges to avoid repeated editing of the same edge; in the editing operation of adding or deleting an edge to the edge set of the public graph, the operation of adding an edge is preferentially executed.
6. The method for semantically matching an image and text according to claim 4, wherein: the aim of the anti-facts graph search is to find the anti-facts graph with the smallest editing distance from the public graph, wherein the editing distance is measured by the symmetry difference of the edge sets of the two graphs.
7. The method for semantic matching of an image and text according to claim 1, wherein in step S4, the initial evidence is obtained by: mapping similarity score s between the image region features and the text logographic features to non-negative evidence values by an evidence transformation function g () ; ; Wherein P (-) is a nonlinear activation function, including a ReLU function, an exponential function or Softplus functions, τ is a temperature parameter, and the value range is (0, 1).
8. The method for matching semantics of image and text according to claim 7, wherein the cross-modality trusted alignment in step S4 specifically includes: adding the initial evidence and the counterfactual evidence to obtain total evidence; Based on the total evidence, respectively calculating the consistency loss L i2t from the image to the text direction and the consistency loss L t2i from the text to the image direction, wherein the L i2t is calculated based on the difference between the evidence output by the image to the text retrieval model and the evidence output by the text to the image retrieval model on the same task; wherein the consistency penalty is used to constrain the uncertainty estimates of the bi-prediction to tend to be consistent; And adding the consistency loss of the image to the text direction and the consistency loss of the text to the image direction to obtain total loss, and updating model parameters by optimizing the total loss.
9. An electronic device, comprising: One or more processors; a memory for storing one or more programs; The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of semantic matching an image with text as claimed in any one of claims 1 to 8.
10. A computer-readable storage medium having stored thereon a computer program, characterized by: The computer program, when executed by a processor, implements a method of semantic matching of images and text as claimed in any one of claims 1 to 8.

Description

Semantic matching method for image and text Technical Field The invention relates to the field of artificial intelligence, in particular to a semantic matching method of images and texts. Background Along with the explosive growth of internet multimedia data, the realization of accurate semantic matching between images and texts has become a core technology for numerous applications such as information retrieval, content recommendation, intelligent question-answering and the like. The technology is used as a core scene of image text retrieval, and aims to mine semantic association of heterogeneous images and inherent semantics in a text carrier and return matching samples of the semantic association. The image text matching is widely applied to internet information retrieval, and the advanced technology of the image text matching also provides important references for downstream cross-modal tasks such as image question-answering, image generation and the like. The existing image text matching method mainly comprises two technical paradigms, namely a global level method and a regional level method. The global level method utilizes the deep neural network to learn the global dense representation of the image and the text sample, and makes the cross-modal pair of semantic matching more approximate by carrying out coarse granularity alignment in the potential public space. Such methods have the advantages of simple structure and high computational efficiency, but are difficult to capture semantic association with fine granularity. In contrast, the region-level method focuses on establishing fine-grained correspondence between image regions and text words, and explicitly learns and integrates cross-modal associations between local features by designing complex attention mechanisms or interaction models. The regional level method generally has better matching performance because the detailed information of the sample can be more effectively mined, and has become the main research direction in the current field. Although the region-logographic fine-grained approach has made significant progress in matching accuracy, it still faces challenges and inherent drawbacks in the process of practical high-reliability applications. Most of the existing methods rely on deep fine granularity classifiers to mine statistical correlation from data, but lack modeling and analysis of intrinsic causal relationships. Although the learning paradigm can capture surface association, the causal mechanism behind semantic matching is difficult to touch, the generalization capability is insufficient when the model faces to sample or noise data outside distribution, the model is easy to be misled by false association in the data, and the interpretation of the decision process is poor. Secondly, how to effectively characterize complex semantic links between massive samples has become an important bottleneck restricting the development of technology. Existing methods typically treat the image region and text token as separate feature vectors or sets and simulate their association by means of a attention weight matrix. This approach fails to explicitly model the inherent structured semantic relationships inside and across modalities, such as spatial and functional relationships between objects in images, grammatical and logical dependencies between words in text. For matching tasks requiring understanding of complex scenes, multiple entities and their interactions, this flattened representation limits the ability of the model to understand deep semantics. Furthermore, existing methods generally lack the ability to quantitatively evaluate the confidence of matching decisions. Most existing models output only one determined matching score and cannot evaluate the uncertainty or confidence of the decision itself. When processing samples with semantic ambiguity, information loss or noise, the model cannot give a reliability measure, which may bring potential risks in the scene of semantic ambiguity or information loss, limiting the application of the technology in the field of high reliability requirements. From an engineering practice perspective, most of the current advanced matching models are highly customized end-to-end architectures, with model designs closely coupled with specific tasks, datasets, and even hardware environments. Such tight coupling results in poor portability and flexibility of the model, and when new technologies need to be integrated into existing systems, or to adapt to new business fields, extensive model reconstruction and parameter retuning, even complete retraining, is often required. This not only brings high cost for development and deployment, but also hinders rapid transformation and large-scale application of technical results. Particularly, in the context of rapid development of multi-mode large-model technology, how to enhance advanced semantic reasoning and trusted computing capability for the existing system