CN-121982759-A - Cross-mode pedestrian re-identification method and device

CN121982759ACN 121982759 ACN121982759 ACN 121982759ACN-121982759-A

Abstract

The invention discloses a cross-mode pedestrian re-identification method and device, and relates to the technical field of computer vision. A cross-mode pedestrian re-recognition method comprises the steps of obtaining a visible light image and an infrared image to be recognized, preprocessing the visible light image and the infrared image to obtain an input image, inputting the input image into a trained pedestrian re-recognition network to obtain a recognition result, wherein the pedestrian re-recognition network comprises a convolutional neural network, a structural perception enhancement module arranged on a shallow layer of the convolutional neural network and a cooperative low-rank decomposition module arranged at the tail end of the convolutional neural network, the structural perception enhancement module is used for extracting shallow layer features, and the cooperative low-rank decomposition module is used for extracting deep layer features. The method effectively solves the problems of misalignment of the cross-modal characteristics and noise interference, and can remarkably improve the accuracy and the robustness of the cross-modal pedestrian re-identification.

Inventors

WANG YI
XIE TAO

Assignees

南京邮电大学

Dates

Publication Date: 20260505
Application Date: 20260115

Claims (10)

1. The cross-mode pedestrian re-identification method is characterized by comprising the following steps of: obtaining visible light images and infrared images to be identified, constructing training batches and preprocessing to obtain input images; Inputting the input image into a trained pedestrian re-recognition network to obtain a recognition result; the pedestrian re-recognition network comprises a convolutional neural network, a structural perception enhancement module and a collaborative low-rank decomposition module, wherein the structural perception enhancement module is arranged between a first convolutional layer and a second convolutional layer of the convolutional neural network, and the collaborative low-rank decomposition module is arranged at the tail end of the convolutional neural network; the structure perception enhancement module comprises a parallel explicit gradient feature extraction branch, an implicit complementary feature extraction branch, a shallow feature fusion layer, a convolution layer, a batch normalization layer, an attention weight mapping layer and a feature output layer connected with an input feature residual error, and is used for extracting shallow features; The collaborative low-rank decomposition module comprises a coefficient generation branch and a shared matrix, wherein the coefficient generation branch comprises a convolution layer, a batch normalization layer, an attention weight mapping layer and a characteristic output layer connected with characteristic residual errors of the input collaborative low-rank decomposition module and is used for extracting deep characteristics.
2. The cross-mode pedestrian re-recognition method according to claim 1 is characterized in that the training process of the pedestrian re-recognition network comprises the steps of inputting the input image into a convolutional neural network, obtaining structural enhancement features through a structural perception enhancement module, sending the structural enhancement features into a subsequent level and a collaborative low-rank decomposition module to obtain output features, and updating parameters of the pedestrian re-recognition network until convergence according to the output features through a mixed loss function to obtain the trained pedestrian re-recognition network.
3. The cross-modal pedestrian re-recognition method of claim 1, wherein the process of constructing the training batch is: And randomly selecting P pedestrian identities from the visible light image and infrared image data sets by adopting a PK sampling strategy based on the identities, and randomly extracting K visible light images and K infrared images for each identity to form batch data with the size of 2 multiplied by P multiplied by K.
4. The cross-modal pedestrian re-recognition method of claim 1, wherein the specific processing procedure of the structural perception enhancement module is as follows: Inputting the input image into convolutional neural network to obtain input characteristic tensor ; Explicit gradient extraction branch utilizes Scharr detection operator to construct horizontal direction convolution kernel and vertical direction convolution kernel, and for Performing convolution operation, respectively calculating horizontal gradient response and vertical gradient response, and extracting gradient amplitude diagram For extracting explicit contours; implicit complementary feature learning branching utilizes leachable depth separable convolution pairs Processing and extracting implicit complementary feature map The method is used for mining potential structure information which cannot be covered by a preset operator; introducing a learnable adaptive balance factor Generating dynamic fusion weights through activating functions, and carrying out weighted fusion on the features of implicit complementary feature learning branches of the explicit gradient extraction branches to obtain fusion features : ; Finally, generating a structural attention weight A by the fusion feature through a convolution layer, a batch normalization layer and a Sigmoid activation function, and acting on the input feature through residual connection Obtaining structural reinforcement features : ; In the formula, Is an element-wise multiplication operation.
5. The cross-modal pedestrian re-recognition method according to claim 1, wherein the specific processing procedure of the collaborative low-rank decomposition module comprises three stages of coefficient prediction, shared basis reconstruction and residual fusion; Features to be enhanced by a structural perception enhancement module Feeding the deep feature tensor into a subsequent level of the structural perception enhancement module to obtain the deep feature tensor ; First, the channel dimension C of the input feature X of dimension C X H W is compressed to , Constructing a characteristic low-rank bottleneck, generating a coefficient matrix through batch normalization and Sigmoid activation functions ; Secondly, predefining a cross-modal shared base matrix D with one dimension of CxK as a general dictionary for connecting two modes, storing shared human body structural elements, combining the base matrix with a coefficient matrix by utilizing matrix multiplication, and reconstructing a low-rank feature map : ; In this process, the low rank approximation principle is utilized and cannot be used The high-frequency background noise linearly represented by the personal basis vectors is automatically filtered, and only obvious human body structural characteristics are reserved; Finally, a learnable weighting parameter is introduced Denoising the low-rank characteristic map And (5) superposing the original input features X to obtain an output feature diagram Y: 。
6. The cross-modal pedestrian re-recognition method of claim 2, wherein the hybrid loss function consists of identity feature learning process loss and collaborative decomposition constraint loss; The identity characteristic learning process loss comprises a label smooth cross entropy loss and a weighted regularization triplet loss; the collaborative decomposition constraint loss includes a coefficient consistency loss and an orthogonality constraint loss.
7. The cross-modal pedestrian re-recognition method of claim 6 wherein the label smoothing cross entropy loss utilizes the identity label of the pedestrian for accurate supervised learning focusing on the classification capability of the model, formulated as: ; In the formula, In order to be able to lose the identity, For model pair number The prediction probability of a category is determined, For a real label after the label smoothing process, Is the total number of categories; The weighted regularized triplet loss establishes a direct optimization relationship between positive and negative samples, and the formula is as follows: ; In the formula, In the event of a loss of a triplet, 、 And Anchor point, positive sample and negative sample features respectively, Representing the euclidean distance.
8. The method for identifying the cross-modal pedestrian re-recognition according to claim 6, wherein the coefficient consistency loss adopts an alignment strategy based on identity centroids, calculates global average pooled centroids of all sample coefficient matrixes of the same identity in a visible light state and an infrared state, and minimizes euclidean distance between the visible light state centroids and the infrared state centroids, wherein the formula is as follows: ; In the formula, In order to achieve a loss of consistency of the coefficients, As the number of identities of the pedestrians, And Respectively the first The center of mass of the coefficients of the personal identity in the visible light and infrared modes; The orthogonality constraint loss is used for constraining the mutual orthogonality between column vectors of the shared base matrix D, and the formula is as follows: ; In the formula, In order to be a loss of orthogonality constraint, Is a matrix of units which is a matrix of units, Indicating the Frobenius norm.
9. The cross-modal pedestrian re-recognition method of claim 6 wherein the hybrid loss function Expressed as: ; In the formula, In order to be able to lose the identity, In the event of a loss of a triplet, In order to achieve a loss of consistency of the coefficients, In order to be a loss of orthogonality constraint, And The hyper-parametric weight factors of coefficient consistency loss and orthogonality constraint loss, respectively.
10. A cross-modality pedestrian re-identification device, comprising: The data acquisition module is used for acquiring visible light images and infrared images to be identified, constructing training batches and preprocessing the training batches to obtain input images; the pedestrian re-recognition module is used for inputting the input image into a trained pedestrian re-recognition network to obtain a recognition result; the pedestrian re-recognition network comprises a convolutional neural network, a structural perception enhancement module and a collaborative low-rank decomposition module, wherein the structural perception enhancement module is arranged between a first convolutional layer and a second convolutional layer of the convolutional neural network, and the collaborative low-rank decomposition module is arranged at the tail end of the convolutional neural network; the structure perception enhancement module comprises a parallel explicit gradient feature extraction branch, an implicit complementary feature extraction branch, a shallow feature fusion layer, a convolution layer, a batch normalization layer, an attention weight mapping layer and a feature output layer connected with an input feature residual error, and is used for extracting shallow features; The collaborative low-rank decomposition module comprises a coefficient generation branch and a shared matrix, wherein the coefficient generation branch comprises a convolution layer, a batch normalization layer, an attention weight mapping layer and a characteristic output layer connected with characteristic residual errors of the input collaborative low-rank decomposition module and is used for extracting deep characteristics.

Description

Cross-mode pedestrian re-identification method and device Technical Field The invention relates to a cross-mode pedestrian re-identification method and device, and belongs to the technical field of computer vision. Background With the rapid construction of safe cities and intelligent security systems, pedestrian re-recognition technology is used as a core technology for tracking target pedestrians across cameras, and plays an increasingly important role in the fields of intelligent criminal investigation, missing population searching, security and protection control and the like. The traditional pedestrian re-identification method mainly relies on visible light images, and discriminative identity characterization is constructed by extracting appearance features such as colors, textures and the like, so that identity matching across cameras is realized. However, in night or under-illuminated environments, the visible light camera cannot capture a clear image, resulting in system failure. In order to realize all-weather monitoring, the existing security system is usually automatically switched to an infrared camera for shooting. Therefore, how to realize the cross-mode pedestrian re-recognition between the visible light image and the infrared image has become a difficulty to be solved in realizing all-weather intelligent monitoring. Although the cross-mode pedestrian re-identification has important application value, the infrared image lacks color information and only contains single-channel heat radiation information, so that a huge 'mode gap' exists between two modes in data distribution, texture expression and visual characteristics. The prior art mainly solves the problems through three ways, namely an image generation method based on a generation countermeasure network (GAN) to attempt to convert an infrared image into a pseudo-color image in a unified style, a method based on measurement learning, such as designing heterogeneous center loss and the like to shorten intra-class distances, and a method based on feature alignment to extract sharing features of modes by using a double-flow network. However, the prior art has the obvious defects that the generated method (GAN) can alleviate visual difference, but artifacts and noise are often introduced, model training is unstable, and the generated characteristics are difficult to ensure to have enough discriminant. Existing feature alignment methods typically calculate the relevance of global pixels based on euclidean distance. All information in the default image of the Full-Rank modeling strategy has potential value, so that the model can not distinguish deterministic human semantic information from random background noise information while enhancing feature interaction. When the background of the infrared image is complex or is blocked, background clutter (such as trees and vehicle heat sources) can be wrongly given high weight and participate in feature alignment, and the purity of the features is seriously damaged. Although the pixel distribution of visible light is quite different from that of infrared images, the semantic structure (such as human body gesture and limb proportion) of the bottom layer has mathematical Low Rank (Low-Rankness), namely can be linearly represented by a small quantity of basis vectors, and background noise and artifacts special to modes are generally represented as High Rank or sparse distribution. The prior art lacks a mechanism for characteristic decoupling by utilizing the rank attribute difference, and is difficult to automatically strip high rank noise in a characteristic extraction stage. Disclosure of Invention The invention aims to provide a cross-modal pedestrian re-recognition method and device, which solve the problems of cross-modal feature misalignment and noise interference by means of multi-task combined training such as a structural perception enhancement module, a collaborative low-rank decomposition module, collaborative decomposition constraint loss and the like, and improve the recognition accuracy of a model. In order to achieve the above purpose, the invention is realized by adopting the following technical scheme. In one aspect, the invention provides a cross-modal pedestrian re-recognition method, comprising the following steps: obtaining visible light images and infrared images to be identified, constructing training batches and preprocessing to obtain input images; Inputting the input image into a trained pedestrian re-recognition network to obtain a recognition result; the pedestrian re-recognition network comprises a convolutional neural network, a structural perception enhancement module and a collaborative low-rank decomposition module, wherein the structural perception enhancement module is arranged between a first convolutional layer and a second convolutional layer of the convolutional neural network, and the collaborative low-rank decomposition module is arranged at the tail end of the conv