CN-121617047-B - Crowd density estimation method and system based on feature perception weighting comparison learning

CN121617047BCN 121617047 BCN121617047 BCN 121617047BCN-121617047-B

Abstract

The invention provides a crowd density estimation method and a system based on feature perception weighting contrast learning, which relate to the technical field of crowd density estimation methods and comprise the steps of respectively inputting basic feature images into a constructed multistage parallel cavity convolution layer, normalizing important scores to obtain a fusion weight coefficient of each branch, and obtaining a final fusion feature image; constructing a regression network to generate a low-resolution density map, inputting the density map into a lightweight convolution network to generate a scaling mask, and multiplying the bilinear upsampled density map by the scaling mask element by element to generate a predicted density map. The multistage parallel cavity convolution layer enables the network to capture local details and large-range context information at the same time, channel attention automatically strengthens characteristic channels related to people, spatial attention accurately locates high-density areas, density images are calibrated into actual people number of each pixel through loss functions, and the total number of the images can be obtained through direct summation.

Inventors

WANG JIAOYU
SONG JUNFANG
Fan Sinuo
ZHANG XIN

Assignees

西藏民族大学

Dates

Publication Date: 20260508
Application Date: 20260203

Claims (9)

1. The crowd density estimation method based on feature perception weighting contrast learning is characterized by comprising the following steps: S1, carrying out standardization processing on an image to be analyzed, and inputting the standardized image to a pre-trained feature extraction network, wherein the feature extraction network generates a basic feature map corresponding to the image to be analyzed by analyzing spatial features and channel features; S2, inputting the basic feature map into a multi-level parallel cavity convolution layer, carrying out convolution operation on each parallel branch by using cavity convolution kernels with different scales to generate a branch feature map, and carrying out pooling operation on the branch feature map output by each branch; S3, respectively processing the pooled branch feature images through an attention mechanism, generating enhanced feature images, and calculating corresponding importance scores based on response intensity of each enhanced feature image; normalizing the importance scores of all branches to obtain a fusion weight coefficient, and carrying out weighted summation on the fusion weight coefficient and the corresponding enhancement feature map to obtain a fusion feature map corresponding to the image to be analyzed; S4, inputting the fusion feature map into a pre-trained regression network to generate a low-resolution prediction density map, performing up-sampling treatment on the low-resolution prediction density map to obtain a high-resolution density map, inputting the low-resolution prediction density map into a lightweight network to generate a scaling mask, and multiplying the high-resolution prediction density map and the scaling mask element by element to obtain a prediction density map; S5, carrying out non-negative constraint on the generated prediction density map, and summing crowd density values of all pixel points to obtain crowd density estimation output in the image to be analyzed; the total crowd number of the whole image is obtained by the following steps: Non-negative constraint is carried out on the predicted density map, so that the crowd density value of each pixel point is ensured to be not less than zero: Wherein, the Representing a non-negative constraint coefficient; Representing pixel points A predicted density value at; Representing pixel points Crowd density value at the location; Representing a predicted density map; A horizontal pixel index number representing a predicted density map; showing the index number of the vertical flat pixel points of the predicted density map; summing the crowd density values of all the pixel points to obtain the crowd total number of the whole image: Wherein, the Representing the width of the predicted density map; Representing the height of the predicted density map; representing the total number of people.
2. The crowd density estimation method based on feature perception weighted contrast learning according to claim 1, wherein the step of obtaining a basic feature map comprises the following specific steps: the method comprises the steps of obtaining an original image of a crowd, decomposing pixel values of the original image according to red, green and blue color channels, obtaining pixel values of each channel, subtracting a preset target standard average value from the pixel values of each channel, dividing the pixel values by the preset target standard average value, obtaining a standardized image, and inputting the standardized image into a feature extraction network, wherein the feature extraction network is a pre-trained ResNet-50 network, and a feature image output by a last convolution layer is extracted from the ResNet-50 network to serve as a basic feature image.
3. The crowd density estimation method based on feature perception weighted comparison learning according to claim 2, wherein the step of obtaining a branch feature map of the output of each branch comprises the following specific steps: The void ratio of the basic feature map is 2 The convolution kernel operation, wherein the convolution output carries out zero-mean and unit variance standardization on the characteristics of each channel through a batch normalization layer, and the pixel points after batch normalization are scaled by adopting a ReLU activation function to obtain a first branch characteristic diagram; the void ratio of the basic feature map is 4 The convolution kernel operation, wherein the convolution output carries out zero-mean and unit variance standardization on the characteristics of each channel through a batch normalization layer, and the pixel points after batch normalization are scaled by adopting a ReLU activation function to obtain a second branch characteristic diagram; The void ratio of the basic feature map is 6 And (3) performing convolution kernel operation, namely performing zero-mean and unit variance standardization on the characteristics of each channel through a batch normalization layer by convolution output, and scaling the batch normalized pixel points by adopting a ReLU activation function to obtain a third branch characteristic diagram.
4. The crowd density estimation method based on feature perception weighted comparison learning according to claim 3, wherein the branch feature map is processed by an attention mechanism, and the specific steps are as follows: Respectively carrying out global average pooling and global maximum pooling on the feature graphs output by each branch to obtain corresponding feature graphs, sequentially inputting the two feature graphs into a multi-layer perceptron with shared weight for processing, wherein the multi-layer perceptron consists of two full-connection layers, carrying out nonlinear transformation through a ReLU activation function in the middle, carrying out corresponding element addition on the feature graphs processed by the multi-layer perceptron, and finally normalizing an addition result to be between 0 and 1 through a Sigmoid function to generate a channel attention weight vector for representing the feature graphs of each branch; simultaneously, splicing the pooled feature images along the channel dimension, and respectively passing the spliced feature images through a cavity rate of 1 and a cavity rate of 3 The convolution layers are respectively normalized in batches to respectively obtain a local detail characteristic diagram and a context detail characteristic diagram; The local detail characteristic diagram and the context detail characteristic diagram are subjected to weighted fusion through a preset leachable scalar parameter 0.5, specifically, the leachable scalar parameter 0.5 is multiplied by each pixel point of the local detail characteristic diagram, then 1 minus the leachable scalar parameter 0.5 is multiplied by each pixel point of the context detail characteristic diagram, and the two are added to obtain a weighted fusion characteristic diagram; inputting the weighted fusion characteristic diagram into the characteristic channel number compressed to be 1 And (3) a convolution layer, and carrying out Sigmoid function normalization on the result to obtain a space attention weight matrix of each branch characteristic diagram.
5. The crowd density estimation method based on feature perception weighted contrast learning of claim 4, wherein the step of generating an enhanced feature map comprises the specific steps of: based on the output feature diagram of each branch, carrying out outer product expansion on the channel attention weight vector and the space attention weight matrix, and then carrying out element-by-element multiplication to realize attention enhancement: Wherein, the Represent the first Enhancement feature maps of the individual branches; Represent the first A branch feature map; Representing an outer product expansion operation; Representing element-by-element multiplication; Represent the first Channel attention weight vectors for the individual branch feature graphs; Index representing the number of branches, and ; Represent the first A spatial attention weighting matrix of the individual branch feature graphs.
6. The crowd density estimation method based on feature perception weighted contrast learning according to claim 5, wherein the step of calculating the importance score of the enhanced feature map of each branch comprises the following specific steps: calculating pearson correlation coefficients of the spatial attention weight and the channel attention weight of each branch feature map: Wherein, the Represent the first Pearson correlation coefficients of the individual branch feature graphs; Representing a vectorization operation; Representing covariance calculation; the channel attention weight standard deviation representing all branches; standard deviation representing the spatial attention mean of all branches; Representing the mean of the spatial attention weights; calculating an energy focusing index of each branch characteristic diagram: Wherein, the Representing norms; Represent the first Energy focus index of each branch characteristic diagram; representing the height of the enhanced feature map; representing the width of the enhanced feature map; A pixel level index representing an enhanced feature map; a pixel vertical index representing an enhanced feature map; Represent the first Enhancement feature map of each branch is at pixel point A value at; calculating the channel activation index of each branch feature map: Wherein, the Wherein, the Represent the first Entropy of the channel attention weights of the individual branch feature graphs; representing the total number of channels; Indicating the index number of the channel number; Represent the first The number of the channels of the branch characteristic diagram is Channel attention weights of (2); Represent the first Channel activation index of each branch feature map; Geometric averaging the pearson correlation coefficient, the energy focusing index and the channel activation index of each branch characteristic diagram to generate an importance score of the enhancement characteristic diagram of each branch: Wherein, the Represent the first The branches enhance the importance scores of the feature map.
7. The crowd density estimation method based on feature perception weighted contrast learning of claim 6, wherein the step of generating a fused feature map comprises the following specific steps: normalizing importance scores of all branches to obtain a fusion weight coefficient: Wherein, the Represent the first Fusion weight coefficients of the branches; and carrying out weighted summation on the fusion weight coefficient and the corresponding enhancement feature map to obtain a fusion feature map corresponding to the image to be analyzed: Wherein, the Representing a fusion feature map; Representation of A convolution operation with a uniform number of channels.
8. The crowd density estimation method based on feature perception weighted contrast learning of claim 7, wherein the generating of the predicted density map comprises the specific steps of: Inputting the fusion feature map into a pre-trained regression network to generate a low-resolution prediction density map, and performing bilinear upsampling on the low-resolution prediction density map to obtain a high-resolution density map, wherein the method specifically comprises the following steps of: Performing 32 times up-sampling on the low-resolution prediction density map through bilinear interpolation, and recovering the spatial size of the low-resolution prediction density map to be consistent with the original input image to obtain a preliminary high-resolution density map; Inputting the same low-resolution prediction density map into a lightweight convolution network, wherein the network consists of a single 3X 3 convolution layer and a Sigmoid activation function, extracting local context information of density distribution through the convolution layer, and generating a space self-adaptive scaling mask with a value between 0 and 1; and finally, multiplying the high-resolution density map by a scaling mask element by element to generate a prediction density map.
9. A crowd density estimation system based on feature perception weighted contrast learning is characterized in that the estimation system is used for executing the estimation method of any one of claims 1-8, and comprises the following steps: The image acquisition module is used for carrying out standardized processing on the image to be analyzed and inputting the standardized processing image to the pre-trained feature extraction network, and the feature extraction network generates a basic feature map corresponding to the image to be analyzed by analyzing the space features and the channel features; The pooling processing module is used for inputting the basic feature map into a multi-level parallel cavity convolution layer, carrying out convolution operation on each parallel branch by using cavity convolution kernels with different scales to generate a branch feature map, and carrying out pooling operation on the branch feature map output by each branch; The fusion feature acquisition module is used for processing the pooled branch feature images through an attention mechanism respectively to generate enhanced feature images, and calculating corresponding importance scores based on response intensities of the enhanced feature images; normalizing the importance scores of all branches to obtain a fusion weight coefficient, and carrying out weighted summation on the fusion weight coefficient and the corresponding enhancement feature map to obtain a fusion feature map corresponding to the image to be analyzed; The prediction density acquisition module is used for inputting the fusion feature map into a pre-trained regression network to generate a low-resolution prediction density map, performing up-sampling processing on the low-resolution prediction density map to obtain a high-resolution density map, inputting the low-resolution prediction density map into a lightweight network to generate a scaling mask, and multiplying the high-resolution prediction density map and the scaling mask element by element to obtain a prediction density map; and the people counting module is used for carrying out non-negative constraint on the generated prediction density map, and summing the crowd density values of all the pixel points to obtain the crowd density estimation output of the image to be analyzed.

Description

Crowd density estimation method and system based on feature perception weighting comparison learning Technical Field The invention relates to the technical field of crowd density estimation methods, in particular to a crowd density estimation method and a crowd density estimation system based on feature perception weighting comparison learning. Background The existing crowd density estimation method generally relies on a single convolutional neural network to extract characteristics, and is difficult to effectively cope with multi-scale characteristics and spatial non-uniformity of crowd distribution in a complex scene. Meanwhile, the application of the attention mechanism is often limited to single dimension, and modeling of the synergetic effect of the attention mechanism is lacking, so that the detail perception capability of the model on a dense area is insufficient, the adaptability to a sparse and dense coexisting scene is limited, and the precision of a density map and the accuracy of crowd counting are finally affected. In the prior art, a method for constructing a crowd density estimation model by constructing a difference texture module and a multichannel threshold replacement attention module and simultaneously selecting the first 10 convolution layers of a VGG-16 network as a front-end network is specifically described in a document with the publication number CN117351414A, but the method does not effectively capture multi-scale space features of crowd images through a multistage parallel cavity convolution structure and combines a space-channel joint attention mechanism to realize dual enhancement of key information, does not self-adaptively integrate contributions of different scale branches through an importance score weighted fusion strategy, remarkably improves the perception and characterization capability of the model on complex crowd distribution, does not generate a high-resolution density map with richer details and clearer boundaries, and remarkably improves crowd counting accuracy and robustness under sparse, dense and mixed scenes while keeping lower calculation cost, so that a crowd density estimation method and a crowd density estimation system based on feature perception weighted contrast learning are needed. The above information disclosed in the background section is only for enhancement of understanding of the background of the disclosure and therefore it may include information that does not form the prior art that is already known to a person of ordinary skill in the art. Disclosure of Invention The invention aims to provide a crowd density estimation method and a crowd density estimation system based on feature perception weighting comparison learning, so as to solve the problems in the background technology. In order to achieve the above purpose, the present invention provides the following technical solutions: The crowd density estimation method based on feature perception weighting contrast learning specifically comprises the following steps: S1, carrying out standardization processing on an image to be analyzed, and inputting the standardized image to a pre-trained feature extraction network, wherein the feature extraction network generates a basic feature map corresponding to the image to be analyzed by analyzing spatial features and channel features; S2, inputting the basic feature map into a multi-level parallel cavity convolution layer, carrying out convolution operation on each parallel branch by using cavity convolution kernels with different scales to generate a branch feature map, and carrying out pooling operation on the branch feature map output by each branch; S3, respectively processing the pooled branch feature images through an attention mechanism, generating enhanced feature images, and calculating corresponding importance scores based on response intensity of each enhanced feature image; normalizing the importance scores of all branches to obtain a fusion weight coefficient, and carrying out weighted summation on the fusion weight coefficient and the corresponding enhancement feature map to obtain a fusion feature map corresponding to the image to be analyzed; S4, inputting the fusion feature map into a pre-trained regression network to generate a low-resolution prediction density map, performing up-sampling treatment on the low-resolution prediction density map to obtain a high-resolution density map, inputting the low-resolution prediction density map into a lightweight network to generate a scaling mask, and multiplying the high-resolution prediction density map and the scaling mask element by element to obtain a prediction density map; and S5, carrying out non-negative constraint on the generated prediction density map, and summing the crowd density values of all the pixel points to obtain the crowd density estimation output of the image to be analyzed. Further, the basic feature map is obtained, and the specific steps are as follows: The method comp