CN-122024169-A - Semi-supervised crowd counting method based on asymmetric co-evolution network

CN122024169ACN 122024169 ACN122024169 ACN 122024169ACN-122024169-A

Abstract

The invention discloses a semi-supervised crowd counting method of an asymmetric co-evolution network, which comprises the following steps of 1, obtaining a public crowd counting data set, 2, constructing a semi-supervised crowd counting model based on the asymmetric co-evolution network, wherein the semi-supervised crowd counting model comprises a strong student model and a stable teacher model, 3, designing a dynamic targeting excitation mechanism, dynamically constructing a targeting reasoning task by using a prediction result of the stable teacher model as input of the strong student model, and simultaneously designing a targeting consistency loss function, and 4, training the model constructed in the step 2 by using the data set of the step 1 by using a training mechanism proposed in the step 3 to obtain a trained semi-supervised crowd counting model. The method solves the contradiction between model learning ability and supervision stability in the existing counting method, and the problems that the model is difficult to learn high-order scene semantics from unlabeled data and effectively process complex occlusion and scale change.

Inventors

SHI HONGYU
ZHAO ZIYI
ZHANG KAIBING
GUAN SHENGQI
MENG YALEI

Assignees

西安工程大学

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (8)

1. The semi-supervised crowd counting method based on the asymmetric co-evolution network is characterized by comprising the following steps of: Step 1, acquiring a public crowd counting data set; step 2, a semi-supervised crowd counting model based on an asymmetric co-evolution network is constructed, wherein the semi-supervised crowd counting model comprises a strong student model and a steady teacher model; Step 3, designing a dynamic targeting excitation mechanism, dynamically constructing a targeting reasoning task by using the prediction result of a 'steady teacher' model as the input of a 'strong student' model, and designing a targeting consistency loss function; And 4, training the model constructed in the step 2 by using the data set in the step 1 and the training mechanism proposed in the step 3 to obtain a trained semi-supervised crowd counting model, and realizing crowd counting.
2. The method of claim 1, wherein in the step 1, the public crowd count data sets are SHANGHAITECH PART A, SHANGHAITECH PART B, UCF-QNRF and JHU-Crowd ++ data sets, samples are randomly extracted as labeled data from training sets of the data sets according to a ratio of 1:2:8, and the rest is unlabeled data.
3. The method for counting semi-supervised crowd based on asymmetric co-evolution network as set forth in claim 2, wherein in the step 2, a "steady teacher" model adopts an improved P2PNet network architecture, and comprises a VGG16 backbone network and an initial positioning decoding module, and the weight of the "steady teacher" model is smoothly updated by an index moving average EMA mechanism, and the updating formula is as follows: (1) Wherein, the The weight parameters of the 'steady teacher' model are smoothly updated results through an index moving average mechanism when the t batch is in the training process; the weight parameters of the "steady teacher" model, representing the previous batch (t-1), contain the accumulated smoothing information of all previous batches; Representing the coefficient of smoothing and the coefficient of smoothing, Representing the current weight parameters of the "strong student" model at the t-th lot.
4. The method for counting semi-supervised crowd based on asymmetric co-evolution network according to claim 3, wherein in the step 2, a strong student model comprises a VGG16 backbone network and a perception-decoding-reasoning network, the perception-decoding-reasoning network sequentially comprises a multi-scale semantic perception module, an initial positioning decoding module and a structured context graph reasoning module, the multi-scale semantic perception module carries out fusion and upsampling processing on feature graphs extracted by the VGG16 backbone network to output a multi-scale feature pyramid, the initial positioning decoding module adopts a P2PNet pre-measuring head with shared weight to decode initial candidate point positions from the multi-scale feature pyramid, and the structured context graph reasoning module carries out context information interaction and feature enhancement on the candidate points by using a graph attention mechanism, corrects the confidence of the candidate points and obtains a final predicted point set.
5. The method for semi-supervised crowd counting based on asymmetric co-evolution networks of claim 4, wherein the implementation of the structured context graph inference module comprises: 1) Constructing a graph, namely taking candidate points as graph nodes, splicing node initial feature vectors by the confidence scores of the candidate points and local visual features, and constructing a neighbor node set of each node according to the spatial Euclidean distance between the nodes through a k-NN algorithm to form a complete graph structure; 2) Drawing attention reasoning, namely inputting a drawing structure into a drawing attention network GAT, and obtaining a final enhanced feature vector of the node through linear feature transformation, attention coefficient calculation, weight normalization and multi-head attention aggregation; 3) And (3) predicting and correcting, namely inputting the final enhanced feature vector into a multi-layer perceptron MLP, generating a candidate point confidence score subjected to context correction, and combining candidate point coordinates to obtain a final predicted point set.
6. The method for semi-supervised crowd counting based on an asymmetric co-evolution network as set forth in claim 5, wherein in the step 3, the step of designing the dynamic targeting excitation mechanism is as follows: 1) The generation of a guide signal, namely inputting an unlabeled image into a 'steady teacher' model to generate a pseudo-label point set; 2) The conversion from point to area, namely defining square image blocks with fixed size by taking each pseudo tag point as a center, combining the image blocks with spatial overlapping to form a high-density crowd targeting area set omega; 3) Target area confusion, namely dividing the target area into sub-blocks, randomly and fully arranging the sub-blocks, and then reassembling the sub-blocks to generate a mixed image as the input of a strong student model.
7. The asymmetric co-evolution network-based semi-supervised crowd counting method of claim 6, wherein in step 3, a consistency loss function is targeted The following formula (2) shows: (2) Wherein, the Is a weight factor that is used to determine the weight of the object, In order to monitor the loss of the device, Is an unsupervised consistency loss.
8. The asymmetric co-evolution network-based semi-supervised crowd counting method of claim 7, wherein in step 3, the supervision loss is Lost by positioning And classification loss The formula is as follows: (3) Loss of positioning Acting only on positive sample sets Aiming at minimizing the position deviation between the predicted point and the true point by adopting Loss measures this difference: (4) Wherein, the Is with candidate point Matching true value points; Classification loss Acting on all candidate points, the method consists of two cross entropy loss of positive and negative samples: (5) Wherein, the Is the confidence score of the candidate point.

Description

Semi-supervised crowd counting method based on asymmetric co-evolution network Technical Field The invention belongs to the technical field of computer vision, and relates to a semi-supervised crowd counting method based on an asymmetric co-evolution network. Background In the field of computer vision, crowd counting is a fundamental and critical task. Crowd counting aims to estimate the number of people in an image or video and further enable accurate localization of individuals. The technology has important application value in scenes such as public safety monitoring, traffic flow management, city planning, large-scale activity monitoring and the like. However, in practical application scenarios, crowd counting faces a number of complex challenges, including extremely dense crowd, severe scale variation, severe mutual occlusion, and background noise interference, which together lead to high-precision crowd counting and positioning becoming a very challenging task. Particularly, although the method based on the point positioning can provide more accurate individual position information, the method relies on a large amount of manual annotation data, has high annotation cost and severely restricts the popularization and application of the method in a large-scale scene. In recent years, with the development of deep learning technology, a crowd counting method based on deep learning is continuously proposed. Early related work focused on fully supervised learning, such as ：MCNN (Zhang Y, Zhou D, Chen S, et al. Single-image crowd counting via multi-column convolutional neural network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2016: 589-597.)、CSRNet (Li Y, Zhang X, Chen D. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 1091-1100.).MCNN, using multiple columns of stacked CNN layers with different receptive fields to handle proportional changes in head size, and improving the multi-scale adaptation of the model through multiple columns or multiple network models. CSRNet innovatively utilizes expansion convolution to expand receptive fields and simultaneously maintain resolution, so that multi-scale feature fusion is realized, and effective understanding of dense scenes is enhanced. In order to alleviate the dependence on large-scale labeling data, semi-supervised learning is gradually becoming a research hotspot in the field of crowd counting, and semi-supervised crowd counting methods based on the Mean-Teacher framework are generated and occupy the mainstream position. The representative work comprises ：SUA (Meng Y, Zhang H, Zhao Y, et al. Spatial uncertainty-aware semi-supervised crowd counting[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 15549-15559.)、 MRC-Crowd (Qian Y, Bounneuf W, Li L, et al. Semi-supervised crowd counting with contextual modeling: Facilitating holistic understanding of crowd scenes[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024.)、OT-M(Lin, Wei, and Antoni B. Chan. Optimal transport minimization: crowd localization on density maps for semi-supervised counting[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 21663-21673.).SUA applying a Mean-Teacher framework to a crowd counting task for the first time, generating a hard/soft uncertainty graph by a teacher model through a binary segmentation agent task to guide student model learning, and realizing spatial consistency regularization of a density graph and a segmentation graph by combining a differential conversion layer. MRC-Crowd guides the model to learn the overall understanding of the scene through a mask reconstruction mechanism, introduces fine granularity density classification tasks to enhance feature expression, and enhances the overall cognition capability of the model to the complex crowd scene. The OT-M combines the semi-supervised learning and the point positioning method for the first time, converts a density map output by a teacher model into a point-level hard pseudo tag, and constructs a confidence weighting semi-supervised framework. However, despite the advances made by existing semi-supervised algorithms, there are significant limitations. This is mainly due to the fact that the existing method generally adopts a symmetrical teacher-student network architecture, which results in a model that falls into the dilemma of "ability-stability". Specifically, if the feature extraction and reasoning capability of the student model is enhanced to cope with complex scenes (such as severe scale change and severe occlusion), instability in the training process is conducted to a teacher model with the same structure through an Exponential Moving Average (EMA) mechanism, so that pseudo tag noise generated by the teacher is accum