CN-116628246-B - Image-text retrieval method based on semantic filtering and self-adaptive adjustment

CN116628246BCN 116628246 BCN116628246 BCN 116628246BCN-116628246-B

Abstract

The invention relates to the technical field of image-text retrieval, in particular to an image-text retrieval method based on semantic filtering and self-adaptive adjustment, which comprises the steps of respectively extracting features of an input image and an input sentence based on a feature representation module to obtain each image region and each word feature representation, carrying out image-text matching from text to image direction and image to text direction based on a filtering attention module, simultaneously filtering uncorrelated image regions and word feature representations to obtain global image region-word correlation representation, guiding image region-word alignment in a cyclic manner based on the self-adaptive adjustment module, gradually optimizing and updating pairs Ji Quan between the image regions-words to obtain global similarity score of an image text. The invention can reduce the interference of irrelevant samples on correlation learning, and can highlight key information in important focused data.

Inventors

JIN RAN
Hou Tengda
JIN TAO
YUAN JIE
GU XIAOZHE

Assignees

浙江万里学院

Dates

Publication Date: 20260512
Application Date: 20230629

Claims (7)

1. The text retrieval method based on semantic filtering and self-adaptive adjustment is characterized by comprising the following steps of: Constructing an adaptive attention filtering model, wherein the adaptive attention filtering model comprises a characteristic representation module, a filtering attention module and an adaptive adjustment module; Respectively extracting the characteristics of the input image and the input sentence based on the characteristic representation module to obtain image characteristic representation of each image area and text characteristic representation of each word; based on the filtering attention module, performing image-text matching from text to image direction and image to text direction, and filtering irrelevant image area characteristic representation and word characteristic representation simultaneously to obtain global image area-word correlation representation; Guiding image region-word alignment in a cyclic manner based on the self-adaptive adjustment module, and gradually optimizing and updating pairs Ji Quan between the image region-word to obtain a global similarity score of the image text; In the filtering attention module, fixing each word as a shared semantic for the text-to-image direction, matching each word to a related image area, filtering out an irrelevant image area to obtain related image area characteristics of each word; Obtaining a global image region-word correlation representation based on the filtered related image region features and the filtered related word features; for text-to-image directions, the step of the filter attention module matching each word to an associated image region comprises: For each word text feature, carrying out cosine similarity calculation on each word text feature and each image area one by one to obtain similarity values of a certain word for different image areas; Normalizing the obtained similarity degree value to a [0,1] interval through a softmax activation function, and completing the pre-allocation of the attention score; Comparing the relative importance between the two image areas, if the pre-allocation attention score of the compared image area is larger than the image area compared with the pre-allocation attention score, the score of the compared image area is larger than 0, and the compared image area is a relevant image area, otherwise, the compared image area is an irrelevant image area; According to the calculation of the function H, the image areas with the score smaller than 0 are classified into 0, the image areas with the score larger than 0 are classified into 1, and a reassigned attention matrix is obtained; multiplying the image characteristics of each image area with a new attention moment array, and filtering out irrelevant image areas to obtain relevant image areas of each word; for the image-to-text direction, the step of the filter attention module matching to the relevant word for each image region is similar to the text-to-image direction process.
2. The text retrieval method based on semantic filtering and adaptive adjustment according to claim 1, wherein extracting features of an input image based on the feature representation module to obtain image feature representations of different regions comprises: extracting image features of each image area of the input image based on a pre-trained fast R-CNN model; Mapping the image features into d-dimensional vectors through the full connection layer to generate feature representations of the local areas; A self-attention mechanism is adopted to act on the local area, the self-attention mechanism adopts average characteristics as inquiry, and the characteristic representations of all image areas are aggregated to generate a global image characteristic representation of the input image.
3. The text retrieval method based on semantic filtering and adaptive adjustment according to claim 1, wherein extracting features of an input text based on the feature representation module to obtain text feature representations of each word comprises: splitting an input sentence into a plurality of words, and orderly embedding the words into a Bi-GRU model; The text feature representation for each word is obtained by averaging the forward and backward hidden states for each time step.
4. The text-to-image direction based on semantic filtering and adaptive adjustment of claim 1, wherein the pre-allocation calculation formula of the attention score is: Wherein, the Representing a softmax activation function; A scaling factor that further increases the gap between related and unrelated image areas, u i represents the text feature of the i-th word, T represents the transpose; v j denotes the image feature of the jth image region, m denotes that the input sentence contains m words in total, n denotes that the input image contains n image regions in total; The relative importance between the two image areas is given by the following formula: Wherein, the Representation for the ith word, the th Image area pair number The relative attentiveness of the individual image areas; representing confidence of the compared image region, the first to be compared Confidence score for each region is set to be equal to that of the first Correlation of individual words; redistributed attention moment array The expression of (2) is: Wherein H (v ij ) represents whether the jth image region selected based on the ith word is correlated, and if correlated, it is 1, and if uncorrelated, it is 0; for the ith word, the matched relevant image area The method comprises the following steps: The global teletext correlation is expressed as: Wherein, the Representing a relevance function representing a relevance score between the u, v modalities.
5. The semantic filtering and adaptive scaling based teletext retrieval method according to claim 4, wherein the step of the filtered attention module matching to related words for each image area for an image to text direction includes: for the image characteristics of each image area, cosine similarity calculation is carried out between each image area and each word one by one, so as to obtain similarity values of a certain image area for different words; Normalizing the obtained similarity degree value to a [0,1] interval through a softmax activation function, and completing the pre-allocation of the attention score; Comparing the relative importance between the two words, if the pre-assigned attention score of the compared word is greater than the word compared with the pre-assigned attention score, the score of the compared word is greater than 0, and the compared word is a related word, otherwise the compared word is an irrelevant word; According to the calculation of the function H, the words with the scores smaller than 0 are classified into 0, the words with the scores larger than 0 are classified into 1, and a reassigned attention matrix is obtained; the text feature of each word is multiplied by the new attention moment array, and irrelevant words are filtered out to obtain relevant word features of each image area.
6. The semantic filtering and adaptive adjustment-based teletext retrieval method according to claim 5, wherein for an image-to-text direction, a pre-allocation calculation formula for an attention score is: The relative importance between two words is given by the following formula: Wherein, the Representing the relative attention of the ith word to the t-th word for the jth image region; representing confidence of the compared word; redistributed attention moment array The expression of (2) is: Wherein, the Indicating whether the i-th word matched with the j-th image area is relevant or not, when the value is 1, the i-th word is relevant, and when the value is 0, the i-th word is irrelevant; for the jth image region, the matched word The method comprises the following steps: The global teletext correlation is expressed as: Wherein, the Representing the similarity between a certain image region in the image and the corresponding word in the sentence.
7. The text retrieval method based on semantic filtering and adaptive adjustment according to claim 1, wherein the overall objective function of the adaptive attention filtering model is: Wherein, S () represents a similarity function for calculating a similarity score between two modalities; , representing an image region-word alignment example, , The method comprises the steps of representing word refractory cases and image region refractory cases; Edge parameters representing control of the difference between positive and negative pairs, the similarity of positive should be higher than the similarity of negative pairs, wherein the function Equivalent to 。

Description

Image-text retrieval method based on semantic filtering and self-adaptive adjustment Technical Field The invention relates to the technical field of image-text retrieval, in particular to an image-text retrieval method based on semantic filtering and self-adaptive adjustment. Background The context matching refers to semantic similarity between the metric image and text, which is increasingly important for various visual and linguistic tasks. Humans evaluate whether an image and a sentence are similar in the brain, measured on the object (or region) of interest to the human. Such as according to "a dog running on green grass near a wood barrier". When the words are associated with the corresponding pictures, the human beings think of the images through keywords such as a dog, a wood fence, a green grass and running. At present, a method based on an attention mechanism is mostly adopted for image-text retrieval, but the current method based on the attention mechanism ignores semantic association in a single mode. In addition, effective filtering of irrelevant information in the image data is lacking, and the irrelevant information can cause certain interference to matching of the image text. Therefore, many subsequent works put forward a semantic filtering thought, irrelevant area-word pairs are filtered, and similarity reasoning is carried out, so that data in the modes and data among the modes can be better interacted, and the matching accuracy is further improved. However, in the alignment after semantic filtering, a part of the alignment is weaker in correlation with the subject, and although the alignment plays a role in the matching process, the importance of the alignment is relatively lower than that of other key elements, so that a part of resources are wasted in the alignment, but the alignment cannot be completely abandoned. Therefore, how to further mine the deep semantics between the data pairs and focus on the semantics in the data so as to realize a more accurate cross-modal matching result becomes a technical problem to be solved by those skilled in the art. Disclosure of Invention In view of the above, the invention provides a text-to-text retrieval method based on semantic filtering and self-adaptive adjustment, which can reduce the interference of irrelevant samples on correlation learning and can highlight key information in important focused data. In order to achieve the above purpose, the present invention adopts the following technical scheme: A text retrieval method based on semantic filtering and self-adaptive adjustment comprises the following steps: Constructing an adaptive attention filtering model, wherein the adaptive attention filtering model comprises a characteristic representation module, a filtering attention module and an adaptive adjustment module; Respectively extracting the characteristics of the input image and the input sentence based on the characteristic representation module to obtain image characteristic representation of each image area and text characteristic representation of each word; based on the filtering attention module, performing image-text matching from text to image direction and image to text direction, and filtering irrelevant image area characteristic representation and word characteristic representation simultaneously to obtain global image area-word correlation representation; And guiding the image region-word alignment in a cyclic manner based on the self-adaptive adjustment module, and gradually optimizing and updating the pairs Ji Quan between the image region-word to obtain the global similarity score of the image text. Further, extracting features of the input image based on the feature representation module to obtain image feature representations of different areas, including: extracting image features of each image area of the input image based on a pre-trained fast R-CNN model; Mapping the image features into d-dimensional vectors through the full connection layer to generate feature representations of the local areas; A self-attention mechanism is adopted to act on the local area, the self-attention mechanism adopts average characteristics as inquiry, and the characteristic representations of all image areas are aggregated to generate a global image characteristic representation of the input image. Further, extracting features of the input text based on the feature representation module to obtain text feature representations of each word, including: splitting an input sentence into a plurality of words, and orderly embedding the words into a Bi-GRU model; The text feature representation for each word is obtained by averaging the forward and backward hidden states for each time step. Further, in the filtering attention module, each word is fixed as a shared semantic for the text-to-image direction, each word is matched with a related image area, and an irrelevant image area is filtered to obtain the related image area characteristic of ea