CN-121982746-A - Image block replacement and cross-modal identity alignment-based image-text pedestrian retrieval method

CN121982746ACN 121982746 ACN121982746 ACN 121982746ACN-121982746-A

Abstract

The invention belongs to the field of computer vision and cross-modal retrieval, and particularly relates to a graph-text pedestrian retrieval method based on image block replacement and cross-modal identity alignment, which is used for obtaining a public dataset of image and text description, constructing a pedestrian re-recognition model PRCIA, inputting the dataset into a PRCIA model for training and verification, carrying out iterative updating on a training weight file through forward and backward propagation to obtain a trained PRCIA model, constructing an inference stage model, reserving a double encoder and fusing global features in an inference stage to ensure calculation efficiency, inputting a testing set into the inference stage model to obtain a detection result, and realizing pedestrian detection based on texts. According to the invention, the PR module is used for establishing fine granularity association between the image block and the text phrase, and the CIA module is used for strengthening cross-mode identity feature expression, so that the accuracy and the robustness of text-to-image pedestrian retrieval are remarkably improved, and the requirements of actual scenes are more easily met.

Inventors

GENG XIA
JIA XIBEI
YANG ZHI

Assignees

江苏大学

Dates

Publication Date: 20260505
Application Date: 20260127

Claims (6)

1. The image-text pedestrian retrieval method based on image block replacement and cross-modal identity alignment is characterized by comprising the following steps of: s1, acquiring a public data set containing images and corresponding text descriptions, and preprocessing; s2, constructing a pedestrian re-identification model PRCIA based on the CLIP; the PRCIA model comprises a backbone network, a PR module and a CIA module; The backbone network comprises a double encoder, identity loss, semantic alignment loss and mask language modeling loss which are sequentially arranged, the double encoder comprises an image encoder and a text encoder, the identity loss is used for intra-class clustering of identity features in a single mode, the semantic alignment loss is used for optimizing cross-mode feature distribution matching, and the mask language modeling loss is used for enhancing fine granularity understanding of a model on text and image details; The PR module comprises a cross-mode fusion encoder and a PRhead module, wherein the cross-mode fusion encoder is used for fusing image block characteristics and text characteristics, and the PRhead module is used for judging whether the image block is replaced or not; The CIA module comprises a cross-mode fusion encoder and a CIAhead module, wherein the cross-mode fusion encoder is used for deep interaction image global features and text global features, and the CIAhead module is used for outputting pedestrian identity classification results; s3, dividing the preprocessed data set into a training set, a verification set and a test set according to a proportion; S4, inputting a training set and a testing set into the PRCIA model for training and verification, carrying out iterative updating on a training weight file through forward and backward propagation, stopping training after the maximum iterative times are reached, and obtaining a PRCIA model after training; S5, constructing an inference stage model, and loading a trained optimal weight file after model training is completed, wherein the model architecture of the inference stage is adjusted by removing PR modules only used in the training stage and corresponding classification heads thereof, and reserving the trained double encoders; S6, inputting the test set into an inference stage model, and obtaining a picture-text pedestrian retrieval result through similarity calculation of text features and image features, thereby realizing pedestrian detection based on texts.
2. The image block replacement and cross-modal identity alignment-based image-Text pedestrian retrieval method as claimed in claim 1, wherein the dual encoder is formed by sequentially stacking an image encoder and a Text encoder, wherein the image encoder and the Text encoder are directly inherited by a pre-training CLIP weight, parameters are updated end to end in subsequent training, the image encoder adopts a CLIP-ViT-B/16, and comprises 12 layers of transformers, each layer comprises multiple heads of attention and FFN, and the Text encoder adopts a CLIP-Text-Transformer, and comprises 12 layers.
3. The method for searching the image-text pedestrians based on the image block replacement and cross-modal identity alignment according to claim 1, wherein after the PR module performs random patch replacement operation on the image, the authenticity of the image block is judged through text information, the corresponding relation between the image block and the word is established without introducing semantic tags, and the replacement operation comprises the following steps: Acquiring an image set with a batch size of B from the public data set, determining whether patch replacement is performed on the image i or not according to probability P for each image i epsilon {1, 2,., B } in the batch, and modeling the image i through Bernoulli test: Bernoulli(P) Wherein the method comprises the steps of Indicating variables for whether image i is to be replaced or not, if For each patch k e {1, 2, & p }, p is the total number of image blocks in a single image that do not contain [ CLS ] features, and a probability R is used to determine whether to replace: Bernoulli(R) Wherein the method comprises the steps of Is an indicator variable of whether the kth patch in image i is replaced or not, if =1, Another image j, j not equal i is uniformly extracted from the batch, ) And using the corresponding kth patch feature in image j K-th patch feature of replacement image i Labeling each patch position with 0 representing the true patch being replaced and 1 representing the patch being replaced, and setting Real tag of jth patch for ith image, loss function of PR module The calculation formula of (2) is as follows: ; Wherein the method comprises the steps of For the total number of substituted patches in the entire batch, N represents the total number of images in the batch that participate in the substitution operation, P represents the number of patches of the image, Representing the album logits value of the model to the j-th patch in image i, A prediction logits value indicating that the patch belongs to class k (0 or 1).
4. The method for searching the image-text pedestrians based on the image block replacement and the cross-modal identity alignment according to claim 1, wherein the cross-modal fusion encoder consists of a multi-head cross-attention layer and M Transformer blocks, M is a natural number, and the working process of the fusion attention module comprises the following steps: inputting an image v, dividing the image v into a series of image blocks which are equal in size and do not overlap, replacing part of the image blocks by the same patch of the random image in the same patch through the replacement operation of a PR module, and setting the replacement proportion as Obtaining the total number of the replaced image blocks = Wherein B is the batch size, and P is the number of image blocks of a single image; Inputting the divided and replaced image blocks into an image encoder in a double encoder to obtain a replaced image characteristic representation And text feature representation Wherein L is the length of the input text, Representing the total number of images; Representing the replaced image features And text feature representation Input cross-modal fusion encoder, wherein image block feature representation As query Q, text feature representation As a bond K and a value V; After Q, K, V is normalized, inputting a multi-head cross attention (MCA) layer for feature interaction, and fusing depth features to obtain fused context feature representation : Wherein the method comprises the steps of The context representation representing the fused text and replaced image, LN (·) represents layer normalization, MCA (·) represents multi-headed cross-attention, and the calculation formula for multi-headed cross-attention is as follows: ( )V where d is the embedded dimension of the replaced mark, Representing the pairwise similarity between image blocks and text labels.
5. The image block replacement and cross-modal identity alignment-based teletext pedestrian retrieval method according to claim 1, wherein a workflow of the CIA module is as follows: feature input and fusion, namely, replacing the image features And text features Inputting the image text interaction information into a cross-mode fusion encoder, and generating a fusion feature matrix containing the image text interaction information through multi-head cross attention and transform block processing Wherein D is a feature dimension; Global feature pooling, merging feature matrices Performing pooling operation to remove local feature dimensions to obtain a global fusion feature matrix ; Sorting head composed of full connection layer Mapping to the dimension of the pedestrian ID to obtain a prediction feature matrix Where N is the total number of pedestrian IDs, the classification head enables the model to output probability distribution belonging to each pedestrian ID for each sample; And in the training stage, the difference between G and the real pedestrian ID label is measured by adopting a cross entropy loss function, and the loss is calculated as follows: ; Wherein the method comprises the steps of Representing the actual tag of sample i with respect to pedestrian ID (k), To predict probability.
6. The image block replacement and cross-modal identity alignment-based teletext pedestrian retrieval method according to claim 1, wherein the loss calculation and total loss optimization flow is as follows: identity loss calculation by outputting global features from image encoder And global features of text encoder output Identity classifiers of shared weights are respectively input, the identity classifiers are composed of a full connection layer and a Softmax layer, predicted identity probability distribution is obtained, and cross entropy loss of the predicted distribution and a real identity label y is calculated according to the following formula: ; Wherein the method comprises the steps of And The image and text global CLS feature of the ith sample respectively, Representing that classifier prediction samples belong to true identities Probability of (2); Semantic alignment loss calculation, calculating the output of image encoders within a batch Global features with text encoder output Cosine similarity matrix S between and divided by the temperature parameter: ; Softmax normalization is carried out on the similarity matrix S to obtain prediction matching distribution And build a target distribution I.e. real labels, where A corresponding first image The individual text is positive sample, probability =1, The rest are negative samples, probability =0, ; Calculation of And (3) with KL divergence between as loss: ; Wherein the method comprises the steps of Representing the predicted probability that the i-th image matches the j-th text, Representing a real label; Mask language modeling penalty calculation, namely performing random masking operation on the input text T to generate mask text Will be Inputting a text encoder, performing cross-modal interaction with image features to obtain fusion features, predicting original words at the masked positions through an MLM prediction head (full connection layer), and calculating cross entropy loss of predicted word distribution and real words: ; Where M is the masked Token set, As a real word at the position t, Predicting the probability of the correct word for the model at that location; Calculating total loss by weighting and summing all the loss functions to obtain the final total loss function for back propagation : 。

Description

Image block replacement and cross-modal identity alignment-based image-text pedestrian retrieval method Technical Field The invention belongs to the technical field of computer vision and cross-modal retrieval, and particularly relates to a picture and text pedestrian retrieval method based on picture block replacement and cross-modal identity alignment, which is suitable for monitoring scenes such as security protection and social media search. Background Text-to-image people retrieval is an important research area of artificial intelligence, and its practical application relates to multiple fields of monitoring systems, social media monitoring, and image databases. In a surveillance scenario, this technique enables law enforcement agencies to locate and identify individuals using witness or textual descriptions in reports. The method and the system can also enable the user to search for the specific image on the social media platform efficiently by using natural language query, so that the user experience and content management are improved. As an emerging and extremely challenging field, text-to-image person retrieval aims to identify and retrieve specific pedestrian images from text descriptions, potentially improving the accuracy and efficiency of image searching. This technology is particularly important in applications such as social media monitoring, image recognition and security, and thus becomes an important research area in the field of artificial intelligence. Conventional solutions typically employ deep learning and neural network techniques to map images and text to a unified embedding space, thereby enabling more efficient image retrieval. Still other approaches indirectly implement local feature learning through a attention mechanism. However, text-to-image person retrieval itself is complex. It requires models that not only match the character image to a given textual description, but also distinguish subtle differences in detail and attributes between highly similar categories, thereby increasing the complexity of the search. Although recent approaches have made recent progress in multimodal feature fusion and global registration, there are limitations in fine-grained registration of local features and retrieval using pedestrian identity information. Specifically, these methods fail to adequately capture local changes in the image, nor do they adequately utilize pedestrian identity information that is critical to improving retrieval accuracy. In order to overcome the limitations, the invention provides a picture and text pedestrian retrieval method based on image block replacement identification and cross-modal identity alignment. Disclosure of Invention The invention aims to solve the defects of the existing text-to-image pedestrian retrieval method in the aspects of fine granularity feature alignment and identity information utilization, and provides a retrieval method capable of effectively capturing image local feature changes and fully utilizing pedestrian identity information, so that retrieval precision is improved. In order to realize the characteristics of the invention, the technical scheme of the invention is that the image-text pedestrian retrieval method based on image block replacement and cross-modal identity alignment comprises the following steps: s1, acquiring a public data set containing images and corresponding text descriptions, and preprocessing; s2, constructing a pedestrian re-identification model PRCIA based on the CLIP; the PRCIA model comprises a backbone network, a PR module and a CIA module; The backbone network comprises a double encoder, an identity loss (id loss), a semantic alignment loss (sdm loss) and a mask language modeling loss (mlm loss) which are sequentially arranged, wherein the double encoder comprises an image encoder and a text encoder, the identity loss is used for intra-class clustering of identity features in a single mode, the semantic alignment loss is used for optimizing cross-mode feature distribution matching, and the mask language modeling loss is used for enhancing fine granularity understanding of a model on texts and image details; The PR module comprises a cross-mode fusion encoder and a PRhead module, wherein the cross-mode fusion encoder is used for fusing image block characteristics and text characteristics, and the PRhead module is used for judging whether the image block is replaced or not; The CIA module comprises a cross-mode fusion encoder and a CIAhead module, wherein the cross-mode fusion encoder is used for deep interaction image global features and text global features, and the CIAhead module is used for outputting pedestrian identity classification results; s3, dividing the preprocessed data set into a training set, a verification set and a test set according to a proportion; S4, inputting a training set and a testing set into the PRCIA model for training and verification, carrying out iterative updating on a training