CN-121980051-A - Weak supervision pedestrian retrieval method, device and equipment based on text anchoring

CN121980051ACN 121980051 ACN121980051 ACN 121980051ACN-121980051-A

Abstract

The invention discloses a weak supervision pedestrian retrieval method, device and equipment based on text anchoring, which comprises the steps of S101, obtaining a training data set, S102, constructing an alignment network model based on text anchoring, training the alignment network model through the training data set, and obtaining a trained alignment network model, wherein the alignment network model comprises an image encoder, a text encoder, a selective semantic mining module and a distribution consistency regularization module, and S103, the pedestrian image retrieval task based on text description is realized according to the trained alignment network model. The method solves the problem of clustering deviation caused by shielding of text semantics by visual features in a weak supervision environment, and remarkably improves alignment accuracy and robustness of images and texts on fine granularity attributes and global distribution.

Inventors

DU XIA
Bao Zhuosen
Xie Wangze
XIE XIAOZHU
XU QIZHEN
ZHU SHUNZHI

Assignees

厦门理工学院

Dates

Publication Date: 20260505
Application Date: 20260109

Claims (10)

1. A text anchoring-based weak supervision pedestrian retrieval method, comprising the steps of: s101, acquiring a training data set, wherein the training data set comprises a plurality of groups of image-text pairs formed by pedestrian images and descriptive texts; S102, constructing an alignment network model based on text anchoring, and training the alignment network model through the training data set to obtain a trained alignment network model; the alignment network model comprises an image encoder, a text encoder, a selective semantic mining module and a distribution consistency regularization module, wherein the image encoder is used for extracting image features from a pedestrian image, the text encoder is used for extracting word-level features from text descriptions which are physically paired with the pedestrian image, in the model training process, the selective semantic mining module calculates the semantic relevance of each word-level feature in the image features and the text through selective semantic mining, and dynamically selects key semantic words according to the semantic relevance to construct a supervision signal so as to guide fine granularity alignment of the image features and the key semantic words of the text, and accordingly calculates semantic alignment loss of image-text pairs; And S103, according to the trained alignment network model, the pedestrian image retrieval task based on the text description is realized.
2. The text-anchoring-based weak supervisor retrieval method according to claim 1, wherein in step S102, in order to alleviate inter-modality label granularity inconsistencies, image features are clustered only to generate initial pseudo labels, which are then transferred into a text description to which they are physically paired, in particular: Before the beginning of each training round, an image encoder is used to extract a sequence of image features for pedestrian images in a training batch containing N image-text pairs, denoted as ; For a pair of Applying a density-based clustering algorithm to assign pseudo tags, wherein for an ith non-clustered image feature in the training batch Its pseudo tag is set as If not, the first part of the first part is connected with the second part, Setting the cluster number as the distributed cluster number; Using the one-to-one correspondence between the pedestrian image and the descriptive text to transmit the pseudo tag of the pedestrian image to the text mode to obtain the pseudo tag of the descriptive text Wherein: ; thus, for each training batch containing N image-text pairs, the corresponding pseudo tag pairs Ensure that the satisfaction of 。
3. The text anchor-based aligned network model construction method according to claim 2, wherein the semantic alignment loss is calculated by: extraction of image features and word-level feature sequences by a pre-trained image encoder and text encoder, respectively T is the number of samples in the training dataset; for each image feature Calculating cosine similarity between the feature and each word level feature And converts it into a similarity distribution by a Softmax function : Calculating the square entropy of similarity distribution And defining confidence weights for semantic alignment accordingly , wherein, Is a small constant for preventing numerical instability; Top-K words are selected from high to low according to confidence weights to serve as the most reliable visual semantic anchor points, and dynamic marking masks are constructed To determine key semantic locations; guiding fine-grained conversion from image classification features to text semantics by using anchor point mechanism to characterize the image Copying to form a multi-input representation, mapping the multi-input representation to a vocabulary space through a shared linear projection layer In dynamic masking Predicting corresponding text marks at specified positions to obtain predicted logarithmic probability The prediction result is the mapping representation of the image classification feature on the text semantic anchor point and is used for calculating the semantic alignment loss of the image-text pair by combining the real text mark.
4. A text-anchoring-based weak supervision pedestrian retrieval method according to claim 3, wherein the semantic alignment penalty The calculation formula of (2) is as follows: Wherein, the Is the first True text marking of the individual locations.
5. The text-anchoring-based weak supervision pedestrian retrieval method according to claim 3, wherein the distribution consistency regularization module is configured to: Constructing a cross-modal soft matching distribution, including In training batches of samples, calculate the first Individual image features And the first Personal word level features Normalized cosine similarity between the two to obtain cross-modal matching probability distribution : Wherein, the Is the temperature coefficient; Constructing a pseudo tag-based target distribution by using pseudo tags generated by image clustering And its propagation results in text modalities Constructing target alignment distribution within training batch : By calculating a predictive distribution Distribution with target The two-way KL divergence between the two-way KL divergence, obtaining the sum of two-way KL divergence loss To restrict the consistency of the image and the text in the global semantic space, wherein the sum of the two-way KL divergence losses KL divergence loss including image-to-text And KL divergence loss from text to image : Wherein, the 。
6. The text-based anchor weak supervision pedestrian retrieval method according to claim 5, wherein the distribution consistency regularization module is further for: for each image anchor point sample, searching positive samples with the same pseudo labels and difficult negative samples with different pseudo labels and highest similarity in a training batch, and constructing difficult sample contrast loss The image-to-text direction is defined as follows: Wherein, the As a boundary-super-parameter, the method comprises the steps of, As an anchor point for the anchor point, As a positive sample of the sample, Symmetrically calculating the contrast loss of the difficult sample from the text to the image Finally, the contrast loss of the difficult sample is obtained: 。
7. the text-anchoring-based weak supervision pedestrian retrieval method according to claim 6, wherein the model total loss : By minimizing total model loss The joint iteration optimizes parameters of the image encoder and the text encoder.
8. The text-based anchor weak supervision pedestrian retrieval method according to claim 1, wherein the training dataset comprises at least one of CUHK-PEDES, ICFG-PEDES or RSTPReid.
9. A text-anchoring-based weak supervision pedestrian retrieval device comprising: a data set acquisition unit for acquiring a training data set including a plurality of groups of image-text pairs composed of pedestrian images and descriptive text; An alignment network model construction unit, configured to construct an alignment network model based on text anchoring, and train the alignment network model through the training data set to obtain a trained alignment network model; the alignment network model comprises an image encoder, a text encoder, a selective semantic mining module and a distribution consistency regularization module, wherein the image encoder is used for extracting image features from a pedestrian image, the text encoder is used for extracting word-level features from text descriptions which are physically paired with the pedestrian image, in the model training process, the selective semantic mining module calculates the semantic relevance of each word-level feature in the image features and the text through selective semantic mining, and dynamically selects key semantic words according to the semantic relevance to construct a supervision signal so as to guide fine granularity alignment of the image features and the key semantic words of the text, and accordingly calculates semantic alignment loss of image-text pairs; and the pedestrian retrieval unit is used for realizing the pedestrian image retrieval task based on the text description according to the trained alignment network model.
10. A text-based, weakly-supervised pedestrian retrieval device comprising a memory, a processor, and computer program instructions stored on the memory and executable by the processor, when executing the computer program instructions, being capable of implementing the text-based, weakly-supervised pedestrian retrieval method as recited in any of claims 1-8.

Description

Weak supervision pedestrian retrieval method, device and equipment based on text anchoring Technical Field The invention relates to the field of natural language models, in particular to a weak supervision pedestrian retrieval method, device and equipment based on text anchoring. Background Text-based pedestrian searches (Text-based Person Search) are intended to match target pedestrians in a cross-camera scene through natural language descriptions. In practical application, the weak supervision learning method gradually becomes a research hotspot due to extremely high acquisition cost of the identity tag (Identity Labels). The existing weak supervision method generally relies on image-text joint clustering to generate pseudo tags so as to realize cross-mode matching. However, the prior art suffers from significant distribution bias problems in that dense visual features tend to obscure (Overshadow) coarse-grained text semantics. This results in a severe bias of the clustering structure towards visual distribution, so that the text supervision signal is degenerated to be a pure visual clustering prior, severely weakening the fine-grained semantic alignment effect between the image and the text. Therefore, how to enhance the dominance of text semantics in pseudo tag generation and feature alignment is a key challenge in currently improving weak supervised pedestrian search performance. Disclosure of Invention In view of the above, the present invention aims to provide a method, a device and a device for searching for a weak supervision pedestrian based on text anchoring, so as to improve the above-mentioned problems. The invention provides a weak supervision pedestrian retrieval method based on text anchoring, which comprises the following steps: s101, acquiring a training data set, wherein the training data set comprises a plurality of groups of image-text pairs formed by pedestrian images and descriptive texts; S102, constructing an alignment network model based on text anchoring, and training the alignment network model through the training data set to obtain a trained alignment network model; the alignment network model comprises an image encoder, a text encoder, a selective semantic mining module and a distribution consistency regularization module, wherein the image encoder is used for extracting image features from a pedestrian image, the text encoder is used for extracting word-level features from text descriptions which are physically paired with the pedestrian image, in the model training process, the selective semantic mining module calculates the semantic relevance of each word-level feature in the image features and the text through selective semantic mining, and dynamically selects key semantic words according to the semantic relevance to construct a supervision signal so as to guide fine granularity alignment of the image features and the key semantic words of the text, and accordingly calculates semantic alignment loss of image-text pairs; And S103, according to the trained alignment network model, the pedestrian image retrieval task based on the text description is realized. Preferably, in step S102, to mitigate inter-modality label granularity inconsistencies, image features are simply clustered to generate an initial pseudo label, which is then transferred into a textual description with which it is physically paired, in particular: Before the beginning of each training round, an image encoder is used to extract a sequence of image features for pedestrian images in a training batch containing N image-text pairs, denoted as ; For a pair ofApplying a density-based clustering algorithm to assign pseudo tags, wherein for an ith non-clustered image feature in the training batchIts pseudo tag is set asIf not, the first part of the first part is connected with the second part,Setting the cluster number as the distributed cluster number; Using the one-to-one correspondence between the pedestrian image and the descriptive text to transmit the pseudo tag of the pedestrian image to the text mode to obtain the pseudo tag of the descriptive text Wherein: ; thus, for each training batch containing N image-text pairs, the corresponding pseudo tag pairs Ensure that the satisfaction of。 Preferably, the semantic alignment loss is calculated by the following method: image feature extraction by pre-trained image encoder and text encoder, respectively Sum word level feature sequenceT is the number of samples in the training dataset; computing cosine similarity between image features and each word-level feature And converts it into a similarity distribution by a Softmax function: Calculating the square entropy of similarity distributionAnd defining confidence weights for semantic alignment accordingly, wherein,Representing a small constant for preventing numerical instability; Top-K words are selected from high to low according to confidence weights to serve as the most reliable visual semantic anchor points, a