CN-119942636-B - Pedestrian re-recognition method and system based on dual-attention network

CN119942636BCN 119942636 BCN119942636 BCN 119942636BCN-119942636-B

Abstract

The invention relates to the technical field of computer vision, and particularly provides a pedestrian re-identification method and system based on a dual-attention network. The method comprises the steps of extracting image data from a pedestrian re-recognition data set, carrying out standardized processing on the extracted image data to obtain a processed training set and a test set, constructing a basic convolution network for pedestrian re-recognition, constructing a dual-attention module which comprises a first attention module and a second attention module, inserting the dual-attention module into the basic convolution network to obtain a dual-attention network, training the dual-attention network according to the processed training set, and verifying the trained dual-attention network through the processed test set to realize pedestrian re-recognition.

Inventors

LI ZHIHUI
SHI MING
Miao Jipu
HU WENLI
DING XIAOMIN

Assignees

齐鲁工业大学(山东省科学院)
山东省人工智能研究院

Dates

Publication Date: 20260508
Application Date: 20241227

Claims (8)

1. A pedestrian re-recognition method based on a dual-attention network, the method comprising: Step 1, extracting image data from a pedestrian re-identification data set, and carrying out standardized processing on the extracted image data to obtain a training set and a testing set after processing; Step 2, constructing a basic convolution network for pedestrian re-identification; step 3, constructing a dual attention module which comprises a first attention module and a second attention module; Step 4, inserting the dual-attention module into the basic convolution network to obtain a dual-attention network; Step 5, training the dual-attention network according to the processed training set, and verifying the trained dual-attention network through the processed test set so as to realize the re-recognition of pedestrians; the first attention module in the step 3 includes: firstly, inputting an image characteristic diagram x, obtaining a characteristic y through adaptive average pooling and 1×1 convolution operation, wherein the expression is as follows: Then the feature y is subjected to a1×1 convolution and activation function Softmax to obtain a channel weighting coefficient A1, which is expressed as: ; Then, constructing an identity matrix A0 and a parameter matrix A2 for adjusting the weighting coefficient of each channel, wherein the final weighting coefficient matrix is as follows: Multiplying the characteristic y with the weighting coefficient matrix, and sequentially performing 1×1 convolution, activation function ReLU, 1×1 convolution and activation function Sigmoid to obtain final weighting matrix The expression is: ; finally, the feature map x is multiplied by a weighting matrix y 'to obtain a final weighted feature x'.
2. The method according to claim 1, wherein the step 2 comprises: The basic convolution network is a depth residual network, and comprises a plurality of residual blocks, wherein each residual block comprises a plurality of residual units, each residual unit comprises two 3 multiplied by 3 convolution layers and is activated by adopting a ReLU activation function, the main structure of the basic convolution network sequentially comprises an input layer, an initial convolution layer, a maximum pooling layer, a first residual block, a second residual block, a third residual block, a fourth residual block, a global average pooling layer, a full connection layer and an output layer, wherein the first residual block comprises 3 residual units, the output channel of each unit is 256, the second residual block comprises 4 residual units, the output channel of each unit is 512, the third residual block comprises 6 residual units, the output channel of each unit is 1024, the fourth residual block comprises 3 residual units, and the output channel of each unit is 2048.
3. The method of claim 1, wherein the second attention module in step 3 comprises a channel attention and a spatial attention; for spatial attention, the number of incoming input channels d, the number of groups g, and the image Firstly, grouping the channel number of the image according to the preset group number by grouping operation, wherein the channel number processed by each group is the input channel number d divided by the group number g, and the grouped characteristics are obtained Then, global average pooling is carried out on the images of each group in the height direction and the width direction, and the specific formula is as follows: Wherein The input features of the c-th channel are represented, H and W respectively represent the height and width of the features, two pooling results are spliced together, and two channel weighting coefficients are generated through 1X 1 convolution, an activation function Sigmoid and a group normalization operation, and are respectively And Finally multiplying the weighting coefficient with the original feature map to obtain a new weighted feature map The specific formula is as follows: ; for channel attention, for grouped features Processing by 3×3 convolution to extract feature map of image Next, for the feature map respectively And feature map Global average pooling and Softmax operation are carried out to obtain feature graphs respectively And feature map Wherein the specific formula of Softmax is: Wherein Is the pooling result of the c-th channel, Is the pooling result of the ith channel, then, for the feature map And feature map Performing remolding operation to obtain a characteristic diagram And feature map And combining the coefficients by matrix multiplication to generate a weighting matrix The specific formula is as follows: Finally, the obtained weighting matrix Features after activation of function Sigmoid and grouping Multiplying and remolding to obtain a final weighted feature map 。
4. The method according to claim 1, wherein the step 4 comprises: The method comprises the steps of setting the number of input channels of a first attention module as 256 and 2048 respectively, inserting the input channels of the first attention module into a first residual block and a fourth residual block of a basic convolution network respectively, setting the number of input channels of a second attention module as 512 and 1024 respectively, inserting the input channels of the second attention module into a second residual block and a third residual block of the basic convolution network respectively, and taking the basic convolution network after the first attention module and the second attention module are inserted as a final dual attention network, so that pedestrian re-identification is realized through the dual attention network.
5. The method according to claim 1, wherein the step 5 comprises: Inputting the processed training set into a dual attention network for training, and capturing key channel information and space information in an image through the combination of a first attention module and a second attention module; the triplet loss formula is: ; Wherein the method comprises the steps of Representing a sample And Euclidean distance of (c); Selecting a reference image for an anchor image, namely an image library; The image is a positive sample image, namely an image belonging to the same person as the anchor point image; is a negative sample image, namely an image belonging to a different person from the anchor point image; Is a superparameter representing the minimum distance difference between positive and negative samples; the cross entropy loss function formula is: ; Wherein N represents the total number of images in the dataset; a real label representing image i; Representing the prediction probability that the input image belongs to each category, U represents the total category number, Representing features of the input image; The final loss function formula is: ; Wherein the method comprises the steps of As a weight coefficient, two losses are balanced; and visualizing the performance of the model on the test set by showing the results of the K images before the test image is matched with the test image.
6. A dual-attention network-based pedestrian re-recognition system for implementing the dual-attention network-based pedestrian re-recognition method of claim 1, the system comprising: the data processing module is used for extracting image data from the pedestrian re-identification data set, and carrying out standardized processing on the extracted image data to obtain a processed training set and a processed testing set; The system comprises a model construction module, a dual attention module, a model analysis module and a model analysis module, wherein the model construction module is used for constructing a basic convolution network for pedestrian re-identification; the model training module is used for training the dual-attention network according to the processed training set; and the result visualization module is used for verifying the trained dual-attention network through the processed test set so as to realize pedestrian re-recognition.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, controls a device in which the computer-readable storage medium is located to perform the dual-attention network-based pedestrian re-recognition method of any one of claims 1 to 5.
8. An electronic device comprising one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the device, cause the device to perform the dual-attention network-based pedestrian re-identification method of any of claims 1-5.

Description

Pedestrian re-recognition method and system based on dual-attention network Technical Field The invention relates to the technical field of computer vision, in particular to a pedestrian re-identification method and system based on a dual-attention network. Background With the rapid development of science and technology, it is particularly critical to process a large amount of video and picture data. The large-scale processing can not be realized by manpower, and the task of re-identifying pedestrians by using a computer is generated. And (3) by giving the person image to be queried, searching in the data set by utilizing the pedestrian re-identification system, and visualizing the same person image. The traditional pedestrian re-identification system mainly relies on supervised learning, needs to consume a large amount of manpower for marking, and cannot be deployed on a large scale. Therefore, an unsupervised learning method is greatly popular. The non-supervision learning method can distribute pseudo labels through clustering and other methods under the condition of no labeling data, and then training is carried out by utilizing a neural network to carry out recognition. In most of the existing non-supervision training methods, aiming at noise information such as inherent background in the picture, the important information in the picture is usually captured by a single attention mechanism or a feature learning method, but the influence of the noise information cannot be eliminated more comprehensively by a single enhancement method. Disclosure of Invention In view of the above, the invention provides a pedestrian re-recognition method and system based on a dual-attention network, which are used for improving accuracy and stability of re-recognition and improving model performance. In a first aspect, the present invention provides a pedestrian re-recognition method based on a dual-attention network, the method comprising: Step 1, extracting image data from a pedestrian re-identification data set, and carrying out standardized processing on the extracted image data to obtain a training set and a testing set after processing; Step 2, constructing a basic convolution network for pedestrian re-identification; step 3, constructing a dual attention module which comprises a first attention module and a second attention module; Step 4, inserting the dual-attention module into the basic convolution network to obtain a dual-attention network; and step 5, training the dual-attention network according to the processed training set, and verifying the trained dual-attention network through the processed test set so as to realize the re-recognition of pedestrians. Optionally, the step2 includes: The basic convolution network is a depth residual network, and comprises a plurality of residual blocks, wherein each residual block comprises a plurality of residual units, each residual unit comprises two 3 multiplied by 3 convolution layers and is activated by adopting a ReLU activation function, the main structure of the basic convolution network sequentially comprises an input layer, an initial convolution layer, a maximum pooling layer, a first residual block, a second residual block, a third residual block, a fourth residual block, a global average pooling layer, a full connection layer and an output layer, wherein the first residual block comprises 3 residual units, the output channel of each unit is 256, the second residual block comprises 4 residual units, the output channel of each unit is 512, the third residual block comprises 6 residual units, the output channel of each unit is 1024, the fourth residual block comprises 3 residual units, and the output channel of each unit is 2048. Optionally, the first attention module in step 3 includes: firstly, inputting an image characteristic diagram x, obtaining a characteristic y through adaptive average pooling and 1×1 convolution operation, wherein the expression is as follows: Then the feature y is subjected to a1×1 convolution and activation function Softmax to obtain a channel weighting coefficient A1, which is expressed as: ; Then, constructing an identity matrix A0 and a parameter matrix A2 for adjusting the weighting coefficient of each channel, wherein the final weighting coefficient matrix is as follows: Multiplying the characteristic y with the weighting coefficient matrix, and sequentially performing 1×1 convolution, activation function ReLU, 1×1 convolution and activation function Sigmoid to obtain final weighting matrix The expression is:; finally, the feature map x is multiplied by a weighting matrix y 'to obtain a final weighted feature x'. Optionally, the second attention module in the step 3 includes channel attention and spatial attention; for spatial attention, the number of incoming input channels d, the number of groups g, and the image Firstly, grouping the channel number of the image according to the preset group number by grouping operation, where