CN-121982390-A - Method, system, equipment and medium for counting few-sample targets
Abstract
The invention relates to the technical field of computer vision and image processing and discloses a method, a system, equipment and a medium for counting few samples, wherein the method comprises the steps of inputting query images and reference sample frames, combining a depth residual error network and an encoder to extract deep feature images, inputting the deep feature images into a frequency enhancement module, adopting a parallel processing framework of spatial domain semantic guided frequency domain excitation, decoupling frequency spectrums by utilizing fast Fourier transform, dynamically adjusting frequency domain components based on spatial domain features, outputting the enhanced feature images through inverse transformation, inputting the enhanced feature images and the reference sample frames into a decoder together, generating reference sample query vectors through a cross attention mechanism, inputting the enhanced feature images and the reference sample query vectors into a semantic guided correlation module, extracting multi-scale semantics and matching correlation, generating a semantic enhanced density feature image, and generating a final prediction density image and target count through up-sampling processing. The invention can accurately realize counting of any category under the setting of few samples.
Inventors
- LI WEI
- LI RAN
- WANG YING
- ZHOU JIUJIAN
- WU XIAO
Assignees
- 西南交通大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260120
Claims (10)
- 1. A method of counting small sample objects, comprising: the feature extraction, namely inputting a query image and a reference sample frame, and extracting a deep feature map by combining a depth residual error network and an encoder; Frequency enhancement, namely inputting the deep feature map into a frequency enhancement module, adopting a parallel processing architecture of spatial domain semantic guided frequency domain excitation, decoupling a frequency spectrum by utilizing fast Fourier transform, dynamically adjusting frequency domain components based on spatial domain features, and outputting an enhanced feature map through inverse transformation; sample inquiry coding, namely inputting the enhancement feature map and a reference sample frame into a decoder together, and generating a reference sample inquiry vector through a cross attention mechanism; And the semantic guidance matching is that the enhanced feature map and a reference sample query vector are input into a semantic guidance correlation module, multi-scale semantics are extracted and matched with the correlation, a semantic enhanced density feature map is generated, and the density feature map is subjected to up-sampling processing to generate a final prediction density map and target count.
- 2. The method of claim 1, wherein the frequency enhancement module, using a parallel processing architecture of spatial semantic guided frequency domain excitation, uses a fast fourier transform to decouple the frequency spectrum and dynamically adjusts the frequency domain components based on spatial features, and outputs an enhancement feature map via inverse transformation, comprises: the first branch is used for mapping an input deep feature map to an orthogonal frequency domain space through fast Fourier transformation and decoupling the deep feature map into an amplitude spectrum representing the intensity and a phase spectrum representing the position; The second branch takes the input deep feature map as a guiding source, captures global space context semantics through global average pooling, maps space semantic features into frequency domain adjusting parameters through a multi-layer perceptron, and dynamically generates high-frequency enhancement vectors aligned with channel dimensions; Feature fusion, namely recombining the modulated amplitude spectrum and the phase spectrum, restoring the amplitude spectrum and the phase spectrum to a space domain through inverse fast Fourier transform to obtain a reconstructed feature, generating a confidence map through a convolution layer and a Sigmoid activation function by the reconstructed feature, and fusing the confidence map with the original input feature through gate residual connection to output an enhanced feature map.
- 3. The method of claim 2, wherein the channel-wise weighted excitation of the amplitude spectrum generated by the first branch in the second branch comprises increasing the weight through spatial semantic decision channels, directionally amplifying the edge signal of the micro-object and suppressing the background channel.
- 4. The method of claim 1, wherein the semantic guidance correlation module extracts multi-scale semantics and matches correlations to generate a semantically enhanced density feature map, comprising: The visual correlation branch is that the reference sample query vector is regarded as a dynamic convolution kernel, convolution operation is carried out on the reference sample query vector and the enhancement feature map respectively, and all convolution response maps are aggregated to generate a basic visual correlation map; Capturing semantic features under different receptive fields through three parallel expansion convolutions, splicing the outputs of the three expansion convolutions in channel dimensions to generate multi-scale splicing features, simultaneously extracting basic features by adopting one 1X 1 convolution in parallel, projecting the channel dimensions of the basic features to be consistent with the multi-scale splicing features, adding the multi-scale splicing features and the basic features element by element to generate superposition features, carrying out feature fusion and channel compression on the superposition features through a convolution layer to generate a context feature map, and obtaining a refined query vector after linear projection, multi-layer perceptron processing and element by element multiplication of a reference sample query vector; feature modulation and semantic enhancement, namely calculating the similarity between the refined query vector and the context feature map through matrix multiplication, carrying out global average pooling treatment on the similarity map to generate channel weight for feature modulation, and filtering the basic visual correlation map by utilizing the channel weight to output a density feature map with semantic enhancement.
- 5. The method of claim 4, wherein filtering the base visual dependency graph with the channel weights in the feature modulation and semantic enhancement comprises calibrating the local matching result with global context information by multiplying the channel weights with the base visual dependency graph to suppress background high response regions that do not fit context semantics.
- 6. The method of claim 1, wherein the end-to-end training is performed using a hierarchical supervised loss function, the total loss including a primary density regression loss and a secondary supervised loss: Wherein, the Is the total loss of the total loss, Is the principal density regression loss of the final output layer, Is decoder No The auxiliary supervision loss of the individual intermediate layers, Is a super parameter.
- 7. The method of claim 6, wherein the primary density regression loss is calculated based on euclidean distance normalized by total number of targets and deep supervised strategy: Wherein, the A predicted density map is represented and is displayed, A truth-value density chart representing gaussian kernel generation, Is the total count of targets in the true value density chart, Is a small constant that keeps the values stable, H is the height of the density map, W is the width of the density map, Representing the squared euclidean distance.
- 8. A low sample count system, comprising: the feature extraction module is configured to extract a deep feature map according to the input query image and a reference sample frame by combining a depth residual error network and an encoder; The frequency enhancement module is configured to adopt a parallel processing architecture of spatial domain semantic guided frequency domain excitation according to an input deep feature map, utilize fast Fourier transform to decouple frequency spectrum and dynamically adjust frequency domain components based on spatial domain features, and output an enhanced feature map through inverse transformation; a sample query generation module configured to input the enhanced feature map together with a reference sample box to a decoder, generating a reference sample query vector by a cross-attention mechanism; the semantic guidance correlation module is configured to extract multi-scale semantics and match correlation according to the enhanced feature map and a reference sample query vector to generate a semantic enhanced density feature map, and upsample the density feature map to generate a final prediction density map and a target count.
- 9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the low sample target counting method of any one of claims 1-7 when executing the computer program.
- 10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the low sample target counting method of any one of claims 1-7.
Description
Method, system, equipment and medium for counting few-sample targets Technical Field The invention relates to the technical field of computer vision and image processing, in particular to a method, a system, equipment and a medium for counting few sample targets. Background Accurate estimation of the number of specific objects is a key element in understanding visual scenarios, analyzing environmental dynamics and assisting decision making. The intelligent traffic flow statistics and signal lamp optimization method can be used for early warning crowd gathering abnormality or evaluating safety risks in real time in public safety monitoring, can achieve more efficient traffic flow statistics and signal lamp optimization in intelligent traffic control, is beneficial to building more accurate species distribution patterns in ecological environment investigation, and has important significance for inventory counting and part statistics in the fields of intelligent retail and industrial production. However, the traditional fully supervised counting method relies heavily on massive and expensive class-specific annotation data, and is difficult to adapt to the endless new class in the open world. With the development of deep learning technology, low-cost and high-generalization low-sample count (Few-Shot Counting) becomes a research hotspot for breaking through the data bottleneck. The small sample count (FSC) aims to identify and count any class of objects in the query image with only a very small number of user labeling samples (Exemplars) as cues. This technique is critical to building a general visual perception system, handling long tail distribution categories, and achieving rapid deployment in a variety of dynamically varying practical applications. However, the task of counting few samples faces many inherent problems that firstly, in order to capture global semantics, a feature extraction network usually adopts continuous downsampling, so that high-frequency information (such as edges and textures) which is critical to distinguishing individuals in an image is smoothly lost, secondly, in a scene with extremely dense targets and tiny sizes, the features of adjacent targets are easy to be aliased and stuck and difficult to segment, and furthermore, a real scene is filled with complex background clutter (such as shelf textures and illumination noise), and only partial feature matching lacks macroscopic semantic context guidance, so that false high response values are easily generated in a background region, and false detection is caused. There have been some studies and models in the academia and industry to address these difficulties. For example, the mainstream method adopts an extraction-matching paradigm, and uses a convolutional neural network to extract sample features as a convolutional kernel to interact with a query image, so as to generate a similarity graph for density regression. Subsequent research has introduced vision Transformer (ViT) to capture long-range dependencies through self-attention mechanisms, enhancing the perceptibility of the model to global information to some extent. Such methods have made some progress in handling conventionally distributed counting tasks. Despite the advances made by existing methods, they still suffer from significant shortcomings in handling complex real scenes, particularly extremely dense and small target counts. Most encoder-decoder based architectures fail to effectively address the loss of detail problem with downsampling. In addition, although some recent studies attempt to introduce frequency domain analysis to assist feature extraction, most of the existing frequency domain counting methods use fixed high-pass filters or lack global unified enhancement strategies of spatial perception, and it is difficult to dynamically adjust frequency domain attention points according to spatial semantics of images. The one-cut enhancement mode can restore high-frequency details and simultaneously amplify background noise indiscriminately, so that the semantic purity of the features is destroyed, and the false detection rate in a complex scene is increased. At the same time, many matching mechanisms lack efficient integration of global semantic priors and multi-scale context information. The lack of deep semantic guidance matching results in a significant degradation of the model in the face of a scene with large background noise or severe occlusion, which is prone to misjudging the background interferents as targets. Therefore, how to accurately implement counting of any category under a few sample setting, especially under the challenging scene of dense and shielding, is a key problem to be solved in the current few sample counting field. Disclosure of Invention In order to solve the problems, the invention provides a method, a system, equipment and a medium for counting few samples, which can recover lost high-frequency details to distinguish dense individuals a