CN-121982649-A - Unsupervised pedestrian re-identification method based on local enhancement converter and multi-granularity contrast learning

CN121982649ACN 121982649 ACN121982649 ACN 121982649ACN-121982649-A

Abstract

The invention discloses an unsupervised pedestrian re-recognition method based on a local enhancement converter and multi-granularity contrast learning, which mainly comprises the following steps of inputting a pedestrian re-recognition data set into a constructed model, extracting global attention and local enhancement features through a dual coding feature enhancement layer through a local enhancement converter feature extraction network, generating preliminary feature representation by combining a double-path feature dynamic aggregation module, dividing feature regions through an adaptive region dividing module, optimizing feature expression by combining spatial attention and a constraint mechanism, inhibiting background interference, adaptively adjusting updating momentum of memory bank features according to sample confidence through a confidence perception updating unit, generating a pseudo tag by combining clustering, carrying out multi-granularity contrast learning, and designing multi-granularity contrast loss function iteration optimization network parameters to complete pedestrian image matching. The pedestrian image multi-granularity characterization splitting method solves the problem of pedestrian image multi-granularity characterization splitting, and improves accuracy and robustness of pedestrian re-identification.

Inventors

LIAN WEIQI
ZHANG YUNZUO
WANG HUI
GENG PENG

Assignees

石家庄铁道大学

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (9)

1. An unsupervised pedestrian re-identification method based on a local enhancement converter and multi-granularity contrast learning is characterized by comprising the following steps: S1, acquiring a pedestrian re-identification data set to be processed, and inputting the pedestrian re-identification data set into a constructed model; S2, in a local enhancement converter feature extraction network, a dual coding feature enhancement layer is used for simultaneously extracting global attention features and local enhancement features, multi-scale feature representation is generated, and a dual-path feature dynamic aggregation module is used for combining the global features and the local features to generate a primary feature representation; The dual coding feature enhancement layer comprises a global associated modeling branch, a multi-scale local space coding branch and a two-way feature dynamic aggregation module; the global association modeling branch performs layer normalization on the input features, then performs multi-head self-attention operation, and establishes association among positions in a feature sequence; the multi-scale local space coding branch remodels the sequence feature into a two-dimensional space format, extracts local features through multi-scale depth separable convolution and performs local space feature enhancement through channel recalibration; s3, dividing the feature map into K area features by using an adaptive area dividing module, optimizing feature expression by using a spatial attention mechanism and a constraint mechanism, focusing on a pedestrian local key area and inhibiting background interference; The self-adaptive region dividing module comprises a spatial attention network, a spatial constraint unit and a diversity region characteristic constraint unit; the space attention network analyzes the space distribution of the feature map by multi-scale context fusion and simultaneously considers local detail and global context information to generate attention force maps reflecting the importance of different areas, the space constraint unit applies space smoothness constraint to the attention force maps to ensure continuous change of attention values of adjacent areas, applies contour constraint to strengthen the attention degree of the pedestrian contour area, and applies area constraint to ensure the relative balance of the areas; s4, using a confidence perception updating unit, adaptively adjusting updating momentum according to sample confidence of the characteristics in the memory bank, and generating a final pseudo tag by combining clustering; s5, in the multi-granularity contrast learning module, iteratively adjusting network parameters by designing a multi-granularity contrast loss function so as to optimize the performance of the model; The multi-granularity comparison learning module comprises multi-granularity similarity measurement and a multi-granularity comparison loss function, wherein the multi-granularity similarity measurement comprises semantic similarity, apparent similarity and structural similarity, and the three similarities are fused by weight to form comprehensive similarity measurement; and S6, in each training stage, comparing the similarity of the queried pedestrian image with the pedestrian images in the image library, and searching the images of the same pedestrian.
2. The unsupervised pedestrian re-recognition method based on the local enhancement converter and the multi-granularity contrast learning according to claim 1, wherein the overall structure comprises a local enhancement converter feature extraction network, an adaptive region division module and the multi-granularity contrast learning.
3. The method of unsupervised pedestrian re-recognition based on local enhancement transformers and multi-granularity contrast learning of claim 1, wherein the multi-scale local spatial coding branches comprise 3x3 depth separable convolutions, 5x5 depth separable convolutions, 7x7 depth separable convolutions, stitching, channel recalibration, and linear projections.
4. The method of unsupervised pedestrian re-recognition based on locally enhanced transformer and multi-granularity contrast learning of claim 3, wherein the channel re-calibration comprises averaging pooling, sigmoid, reLU, full-connection layer and element level multiplication.
5. The method for unsupervised pedestrian re-recognition based on local enhancement converter and multi-granularity contrast learning according to claim 1, wherein the two-way feature dynamic aggregation module comprises average pooling, sigmoid, element level multiplication and element level addition.
6. The method of unsupervised pedestrian re-recognition based on locally enhanced transformer and multi-granularity contrast learning of claim 1, wherein the spatial attention network comprises 1x1 convolution, 3x3 convolution, averaging pooling, feature expansion and stitching.
7. The method for unsupervised pedestrian re-recognition based on local enhancement converter and multi-granularity contrast learning according to claim 1, wherein the spatial constraint unit comprises smooth convolution, sobel convolution and gradient calculation.
8. The unsupervised pedestrian re-recognition method based on the local enhancement converter and the multi-granularity contrast learning according to claim 1, wherein the diversity region feature constraint unit comprises weighted pooling, element level multiplication, L2 regularization and matrix multiplication.
9. The method for unsupervised pedestrian re-recognition based on the local enhancement converter and the multi-granularity contrast learning according to claim 1, wherein the loss function is calculated by using the multi-granularity contrast loss and the auxiliary constraint loss, and model parameter optimization training is performed by using the obtained result.

Description

Unsupervised pedestrian re-identification method based on local enhancement converter and multi-granularity contrast learning Technical Field The invention relates to an unsupervised pedestrian re-identification method based on a local enhancement converter and multi-granularity contrast learning, and belongs to the technical field of computer vision. Background With the acceleration of the urban process and the increase of public safety demands, intelligent video monitoring systems play an increasingly important role in social safety management. The pedestrian re-recognition technology is used as a core component of the intelligent monitoring system, and aims to retrieve all occurrence records of specific pedestrians from images or videos shot by different cameras of a large-scale monitoring network. The technology has wide application prospect in the fields of public security investigation, intelligent security protection, smart cities and the like. With the expansion of the monitoring network scale and the proliferation of data volume, supervised learning Fang Famian relying on a large number of manual labels is a bottleneck for high cost and scalability. Therefore, unsupervised pedestrian re-recognition techniques, i.e., learning robust feature representations from data without the use of identity tags, have become a key direction of current research. In recent years, deep learning-based methods have made significant progress in this area, but the prior art has significant drawbacks and challenges in many respects. Vision Transformer and variants thereof demonstrate potential beyond conventional convolutional neural networks in multiple visual tasks by effectively modeling global context information of an image through self-attention mechanisms. However, there are significant shortcomings to the standard Transformer architecture in fine-grained recognition tasks such as pedestrian re-recognition. Its global self-attention mechanism, while able to capture long-range dependencies, lacks explicit modeling capabilities for local detail features that are critical to distinguishing pedestrian identities. The existing Transformer variant mostly adopts image block division with fixed size, lacks self-adaptive perception of semantic structures of pedestrian images, and cannot dynamically adjust the granularity of feature extraction according to the postures and the appearances of different pedestrians. Contrast learning has become a mainstream framework of unsupervised learning by learning feature representations by pulling up positive pairs of samples and pushing away negative pairs of samples. However, most methods compare based on global features only, ignoring the inherent multi-granularity nature of pedestrian images. Although some researches attempt to introduce local features, a fixed and predefined partitioning strategy is generally adopted, and the rigid partitioning cannot adapt to the posture change and the visual angle difference of pedestrians, so that the same physical part may fall into different partitions in different images, and the consistency of the local features is destroyed. Secondly, the existing method lacks explicit constraint on consistency among different granularity features, so that the global features and the local features learn inconsistent representation, and the discrimination capability of the model is reduced. Finally, when a comparison sample pair is constructed, a fine measurement of sample semantic similarity is lacked, and only simple cosine similarity is generally used, so that multi-aspect information such as color, texture, structure and the like cannot be comprehensively considered. In summary, in the field of unsupervised pedestrian re-recognition, although research based on deep learning has made remarkable progress, the current method generally has the core bottlenecks of insufficient local detail capturing capability and multi-granularity semantic characterization splitting, so that the model is difficult to cope with actual challenges such as posture change, shielding interference and the like, and the discrimination performance and generalization capability of the model under a complex scene are limited. Therefore, an unsupervised pedestrian re-recognition method focusing on local detail enhancement and multi-granularity semantic perception is needed to solve the above problems. Disclosure of Invention The invention aims to provide an unsupervised pedestrian re-identification method based on a local enhancement converter and multi-granularity contrast learning, which comprises the following steps: S1, acquiring a pedestrian re-identification data set to be processed, and inputting the pedestrian re-identification data set into a constructed model; S2, in a local enhancement converter feature extraction network, a dual coding feature enhancement layer is used for simultaneously extracting global attention features and local enhancement features, multi-scale featu