CN-122020207-A - Multi-level collaborative selection image-text matching method based on linguistic guidance and computer system

CN122020207ACN 122020207 ACN122020207 ACN 122020207ACN-122020207-A

Abstract

The application relates to the technical field of computer vision, in particular to a multi-level collaborative selection image-text matching method based on linguistic guidance and a computer system. On one hand, the method adopts interpretable word importance weight based on linguistic guidance as a control signal for guiding the whole process, and realizes cross-modal enhancement, self-adaptive region selection and feature dimension selection, thereby improving the adaptability of the model to different text complexity, and on the other hand, the method simultaneously suppresses irrelevant regions and redundant feature dimensions through a multi-level collaborative selection mechanism combining region dimensions and feature dimensions, thereby improving fine granularity discrimination capability. The method aims at solving the problem of how to reduce the risk of erroneous judgment in the image-text matching process of different text complexity.

Inventors

ZHANG ZETAO
Ma Dinan
FAN JING
MA WEI
WANG YIXIN

Assignees

云南日报报业集团

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. The multi-level collaborative selection image-text matching method based on linguistic guidance is characterized by comprising the following steps of: s10, respectively preprocessing and encoding an input image and an input text to obtain an image region feature vector set and a text word level feature sequence; S20, generating word importance weight values of words in the input text according to the text word-level feature sequence, wherein the word importance weight values are obtained by weighting preset word-level priori weights, inverse document frequency weights and multi-layer perceptron network weights generated based on the text word-level feature sequence; s30, carrying out weighted aggregation on the text word-level feature sequences according to the word importance weight values, mapping the text word-level feature sequences to a visual space to generate text global features, calculating image global features corresponding to the image region feature vector sets, splicing the text global features with the image global features and the image region feature vector sets respectively, and obtaining a joint importance tensor after gate control network learning; S40, screening enhanced image features in the image region feature vector set based on the joint importance tensor and the word importance weight value; S50, carrying out segmentation fusion updating on the text word-level feature sequence according to the enhanced image features to obtain an enhanced text word-level feature sequence; S60, constructing an image-text similarity matrix based on the enhanced text word-level feature sequence and the image region feature vector set, and performing image-text matching based on the image-text similarity matrix.
2. The linguistic guidance-based multi-level collaborative selection pattern matching method according to claim 1, wherein in S10, preprocessing encoding comprises: S11, performing word segmentation and indexing processing on the input text to obtain a word identification sequence of the text and corresponding length information, and performing filling processing on the word identification sequence to align to a preset maximum length; S12, word vector embedding is carried out on the word identification sequence after filling processing, and context modeling is carried out on the embedded sequence based on a context coding network, so that the text word-level feature sequence is obtained; and S13, determining candidate salient regions in the input image, extracting visual features from each salient region, and mapping the extracted visual features to preset feature dimensions to obtain a preset number of image region feature vectors.
3. The method for matching multiple layers of collaborative selection graphics based on linguistic guidance according to claim 1, wherein S20 specifically comprises: S21, extracting word vector features, part-of-speech embedded features, absolute position embedded features and relative position embedded features according to each word mark in the text word-level feature sequence, and splicing the four types of features to obtain a multi-source linguistic feature sequence; The part-of-speech embedding feature is obtained by acquiring part-of-speech labels corresponding to each word through a pre-constructed part-of-speech dictionary and embedding the part-of-speech labels into a fixed dimension, wherein the absolute position embedding feature is obtained by performing position coding on the basis of an absolute position index of the word in a sentence; s22, inputting the multisource linguistic feature sequence into a bidirectional gating circulation unit network to perform bidirectional context coding, so as to obtain a linguistic feature representation of context perception; S23, performing long-distance dependency modeling on the context-aware linguistic feature representation by using a multi-head self-attention mechanism, and performing residual connection with the original feature to obtain an enhanced linguistic feature representation ; S24, inputting the enhanced linguistic feature representation into a multi-layer perceptron network, and obtaining multi-layer perceptron network weights of each word : ; Wherein: Representing a multi-layer perceptron network; s25, obtaining the first Part-of-speech prior weights for individual words : Wherein the part-of-speech prior weight Is the first The method comprises the steps that the preset importance weights associated with the part-of-speech categories corresponding to the individual words in a pre-built part-of-speech dictionary are associated; and, calculate the first Inverse document frequency weighting of individual words : ; Wherein: The total document number is the total document number of the training corpus; to include the first The number of documents of the individual word; S26, according to the multi-layer perceptron network weight, the part-of-speech priori weight and the inverse document frequency weight Calculate the first Word importance weight value of individual word : ; Wherein: 、、 is a learnable fusion coefficient; the fusion coefficients are adaptively updated through back propagation in the training process, and normalization constraint is carried out before fusion, so that each fusion coefficient is non-negative and the sum is 1.
4. The linguistic guidance-based multi-level collaborative selection graph-text matching method of claim 1, wherein the expression of the joint importance tensor is: ; In the formula, Representing a joint importance tensor; representing the first step of inputting the image region features respectively spliced with the text global features into a space gating network and outputting Spatial importance of individual regions; The method comprises the steps of splicing the image global features and the text global features, inputting the spliced image global features and text global features into a channel gating network, and outputting Channel importance of individual features; Wherein: ; ; In the formula, Representing a global feature of the text; Representing global features of the image; is the first Feature vectors of the individual image regions; Representing characteristic stitching; Is a channel gating network; The function is activated for Sigmoid, Is a space gating network.
5. The linguistic guidance-based multi-level collaborative selection pattern-text matching method of claim 4, wherein the text global feature The calculated expression of (2) is: ; In the formula, Is the first Text features of individual words; Is sentence length; Is a linear projection layer; Represent the first The individual word importance weight; The image global feature The calculated expression of (2) is: ; In the formula, Is the total number of image areas.
6. The linguistic guidance-based multi-level collaborative selection pattern matching method according to claim 1, wherein S40 comprises: s41, normalizing the importance weights of the words into probability distribution And calculates its information entropy as semantic complexity Wherein, the larger the information entropy value is, the more complex the text semantics is: ; ; s42, normalizing the semantic complexity to obtain And according to Determining the area dimension retention quantity Number of reservations with feature dimension : ; ; ; Wherein: And (3) with Respectively the minimum entropy value and the maximum entropy value in the current batch; And (3) with Representing the maximum and minimum number of regions allowed to remain in the region dimension, respectively; And (3) with Representing the maximum feature dimension number and the minimum feature dimension number allowed to be reserved in the feature dimension respectively; Representing a downward rounding; s43, averaging the joint importance tensor along the characteristic dimension to obtain the first Comprehensive importance weighting of individual regions Selecting the highest weight Each region is given a weight of 1.0, and the remaining regions are given a decaying weight : ; ; Wherein: Soft masking for a region; Is the attenuation coefficient; Is a feature dimension; S44, averaging the joint importance tensor along the regional dimension to obtain the first Comprehensive importance weighting of individual feature dimensions Selecting the highest weight Each characteristic dimension is given a weight of 1.0, and the remaining dimensions are given attenuation weights : ; ; Wherein: soft masking for feature dimensions; s45, soft masking the region Soft mask with feature dimension Combining the features of the image areas to obtain the enhanced image features subjected to collaborative screening : ; Wherein: Is the first in the original image area Area number Values of the individual feature dimensions.
7. The linguistic guidance-based multi-level collaborative selection pattern matching method according to claim 1, wherein S50 comprises; S51, calculating self-similarity matrix between image areas based on enhanced image features Cross-modal similarity matrix for text and image : ; In the formula, The method comprises the steps of (1) obtaining an original image area feature matrix; The image feature matrix is an enhanced image feature matrix after collaborative screening; a text word-level feature matrix; S52, respectively performing leachable projective transformation on the self-similarity matrix and the cross-modal similarity matrix, and performing matrix multiplication operation to obtain cross-modal structure propagation characteristics : ; Wherein: is a leachable projective transformation; s53, calculating statistics of the cross-modal structure propagation characteristics, normalizing the statistics, and classifying word-level characteristics into three types of stable areas, boundary areas and noise areas according to the absolute value of the normalized result: ; Wherein, the In order to provide a stable region of the device, Is a boundary region of the wafer, Is a noise area; In the formula, Is the mean value of the cross-modal structure propagation characteristics; Standard deviation of propagation characteristics for cross-modal structures; S54, respectively adopting different fusion coefficients to carry out weighted fusion on the cross-modal structure propagation characteristics and the text word-level characteristic sequence to obtain the enhanced text word-level characteristic sequence : ; In the formula, A text word-level feature sequence; 、、 fusion coefficients of three types of regions; 、 A threshold is divided for the region.
8. The linguistic guidance-based multi-level collaborative selection pattern matching method according to claim 1, wherein S60 comprises: s61, calculate the first Text and the first Word-region attention moment array of individual images : ; Wherein: First, the An enhanced text word-level feature sequence of the individual text; is the first Features of the individual images; S62, averaging the attention matrix according to word dimension to obtain the first Text and the first Similarity of individual images : ; Wherein: To pay attention to moment matrix Is the first of (2) Line vector, representing the first First of the text Word pair number Attention weight vectors for all regions of the individual images; is the first The length of the individual text; for the attention aggregation function, positive and negative similarity weighting is included; and S63, sorting the candidate images or the candidate texts according to the similarity, and outputting an image-text matching result.
9. The linguistic guidance-based multi-level collaborative selection pattern matching method according to claim 8, wherein S60 further comprises: Optimizing the graph-text matching result by adopting a bidirectional sorting loss function, wherein the expression of the bidirectional sorting loss function is as follows: ; Wherein: ; ; In the formula, Retrieval loss for image to text; Retrieval loss for text to image; Is a boundary super parameter; Represent the first Similarity between the individual images and their corresponding matching text, Represent the first Similarity between each text and its corresponding matching image.
10. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the linguistic guidance based multi-level collaborative selection teletext matching method according to any one of claims 1-9.

Description

Multi-level collaborative selection image-text matching method based on linguistic guidance and computer system Technical Field The application relates to the technical field of computer vision, in particular to a multi-level collaborative selection image-text matching method based on linguistic guidance and a computer system. Background The image-text cross-mode retrieval, namely the image-text matching, aims at mapping the image and the natural language description to a comparable common representation space, thereby realizing the functions of text retrieval of the image, and the like. In the traditional common image-text matching method, a double coding structure of an image side and a text side is usually adopted, namely the image side extracts global vector representation, a text side codes sentences to obtain global vector representation, then cosine similarity is used for calculating matching scores, and interval-based contrast loss or sequencing loss training is adopted. The method has high retrieval efficiency and is convenient for off-line database establishment and quick recall. However, in the method, because the images and sentences are respectively compressed into single vectors, fine-granularity semantics are easy to weaken in the compression process, and in the process of actually carrying out image-text matching of different text complexity, the situations that the semantics are similar but the details are not identical and still are judged to be high in similarity are easy to occur. In view of the above, the application provides a multi-level collaborative selection image-text matching method based on linguistic guidance, which is characterized in that on one hand, cross-modal processing of a text side is guided through linguistic guidance, so that compatible image-text matching of different text complexity is improved, and on the other hand, irrelevant areas and redundant feature dimensions are restrained through an innovative multi-level collaborative selection mechanism, so that the fine granularity distinguishing capability of a model is improved. Disclosure of Invention The application mainly aims to provide a multi-level collaborative selection image-text matching method based on linguistic guidance, which aims to solve the problem of how to reduce the misjudgment risk in the image-text matching process of different text complexity. In order to achieve the above purpose, the application provides a multi-level collaborative selection image-text matching method based on linguistic guidance, which comprises the following steps: s10, respectively preprocessing and encoding an input image and an input text to obtain an image region feature vector set and a text word level feature sequence; S20, generating word importance weight values of words in the input text according to the text word-level feature sequence, wherein the word importance weight values are obtained by weighting preset word-level priori weights, inverse document frequency weights and multi-layer perceptron network weights generated based on the text word-level feature sequence; s30, carrying out weighted aggregation on the text word-level feature sequences according to the word importance weight values, mapping the text word-level feature sequences to a visual space to generate text global features, calculating image global features corresponding to the image region feature vector sets, splicing the text global features with the image global features and the image region feature vector sets respectively, and obtaining a joint importance tensor after gate control network learning; S40, screening enhanced image features in the image region feature vector set based on the joint importance tensor and the word importance weight value; S50, carrying out segmentation fusion updating on the text word-level feature sequence according to the enhanced image features to obtain an enhanced text word-level feature sequence; S60, constructing an image-text similarity matrix based on the enhanced text word-level feature sequence and the image region feature vector set, and performing image-text matching based on the image-text similarity matrix. Optionally, in S10, the preprocessing encoding includes: S11, performing word segmentation and indexing processing on the input text to obtain a word identification sequence of the text and corresponding length information, and performing filling processing on the word identification sequence to align to a preset maximum length; S12, word vector embedding is carried out on the word identification sequence after filling processing, and context modeling is carried out on the embedded sequence based on a context coding network, so that the text word-level feature sequence is obtained; and S13, determining candidate salient regions in the input image, extracting visual features from each salient region, and mapping the extracted visual features to preset feature dimensions to obtain a preset number of