CN-122023859-A - Supervision contrast learning rebalancing method and device for long tail visual recognition
Abstract
The application relates to a supervision contrast learning rebalancing method and device for long tail visual recognition. The method comprises the steps of carrying out data enhancement on image samples in an image data set to obtain a query view and a key view, constructing a long-tail visual recognition model, carrying out gradient sensing and frequency sensing on the categories of the image samples to construct a rebalancing factor, utilizing the rebalancing factor to process the supervision contrast loss of the image samples under the corresponding categories to obtain rebalancing contrast loss, obtaining exchange prediction loss according to a similarity matrix of query features and key features and a global prototype vector set and a soft distribution matrix corresponding to a double-branch encoder, constructing prototype sensing loss according to the exchange prediction loss, aggregation loss and separation loss, training the long-tail visual recognition model according to the rebalancing contrast loss and the prototype sensing loss, and utilizing the trained long-tail visual recognition model to classify long-tail image data. By adopting the method, the image recognition performance in the long tail scene can be obviously improved.
Inventors
- LI SHUOHAO
- CHENG GUANGQUAN
- LV JIAHUI
- Hao Junji
- YANG JIAXIN
- CHEN CHAO
- ZHANG JUN
- LEI JUN
- WU XINYANG
- HUANG KUIHUA
Assignees
- 中国人民解放军国防科技大学
Dates
- Publication Date
- 20260512
- Application Date
- 20250716
Claims (10)
- 1.A long tail visual recognition-oriented supervised contrast learning rebalancing method, the method comprising: Acquiring an image data set containing a plurality of image samples, and carrying out data enhancement on the image samples to obtain a query view and a key view, wherein the data in the image data set are distributed in long tails and marked by category labels; The method comprises the steps of constructing a long-tail visual recognition model, wherein the long-tail visual recognition model comprises a backbone network, a double-branch encoder and a pre-trained classifier, the double-branch encoder comprises a query branch and a key branch, the query branch is used for respectively processing the query view and the key view to obtain query features and key features, the query branch comprises a query encoder and an MLP, and the key branch comprises a momentum encoder and an MLP; Obtaining the types of each image sample in the current training batch, performing gradient sensing and frequency sensing on each type, constructing a rebalancing factor, and processing the supervision contrast loss of the image samples under the corresponding types by using the rebalancing factor to obtain rebalancing contrast loss, wherein the supervision contrast loss is obtained by processing the output of the double-branch encoder; respectively calculating similarity matrixes of the query features and the key features and a preset global prototype vector set, obtaining exchange prediction loss according to the similarity matrixes and soft allocation matrixes corresponding to the double-branch encoders, and constructing prototype perception loss according to the exchange prediction loss, aggregation loss and separation loss; And constructing total loss according to the rebalancing contrast loss and the prototype perception loss, training the long-tail visual recognition model through the total loss to obtain a trained long-tail visual recognition model, and classifying long-tail image data by utilizing the trained long-tail visual recognition model.
- 2. The method of claim 1, wherein the gradient sensing and frequency sensing of each class, constructing a rebalancing factor comprises: calculating average gradient amplitude and focus weight of each category respectively, and constructing a rebalancing factor according to the gradient item and the frequency weight, wherein the rebalancing factor is as follows: wherein ω (c) is a rebalancing factor, α is a super parameter, Omega f (c) is the focus weight, which is the gradient term.
- 3. The method of claim 1, wherein processing the supervised contrast loss for the image samples under the corresponding class with the rebalancing factor to obtain a rebalancing contrast loss comprises: weighting the supervision contrast loss of each image sample in the current training batch according to the rebalancing factor of the category to which the image sample belongs to obtain a corresponding enhanced supervision contrast loss; And carrying out normalization weighting on the enhanced supervision contrast loss of all the image samples in the current training batch according to the size of each positive sample set of the image samples, and constructing a rebalancing contrast loss in a summing and averaging mode.
- 4. The method of claim 1, wherein the global prototype vector set is a trainable parameter, the updating step comprising: And in each training batch, carrying out back propagation update on the global prototype vector set according to the gradient of the exchange prediction loss, the aggregation loss and the separation loss, wherein the aggregation loss is used for pushing similar sample characteristics to aggregate to corresponding prototype vectors, and the separation loss is used for pushing different prototype vectors to be far away from each other.
- 5. The method of claim 1, wherein obtaining the swap prediction loss based on the similarity matrix and a soft allocation matrix corresponding to the dual branch encoder comprises: Converting the similarity matrix through a softmax function to obtain corresponding probability distribution; And obtaining the exchange prediction loss according to the probability distribution and the soft distribution matrix corresponding to the double-branch encoder.
- 6. The method of claim 1, wherein the step of obtaining a soft allocation matrix for the dual-finger encoder comprises: the method comprises the steps of constructing an optimization model by taking the matching degree of a prototype and a feature in a maximized prototype distribution matrix as a target and taking samples uniformly covered by each prototype in batches as constraints, wherein the optimization model is as follows: Wherein Q is a prototype allocation matrix, Is a feature-prototype similarity matrix, tr (·) is the trace of the matrix, H (·) is the entropy function, ψ is the entropy regularization coefficient, Is a constraint condition; And carrying out iterative solution on the optimization model by utilizing Sinkhorn algorithm to obtain a soft distribution matrix.
- 7. The method of claim 1, wherein the prototype perceived loss is: Wherein, the For the purpose of prototype perception loss, In exchange for the predicted loss, In order to achieve the polymerization loss, For separation loss, λ 1 、λ 2 and λ 3 are hyper-parameters controlling exchange predicted loss, polymerization loss, and separation loss, respectively.
- 8. A long tail visual recognition-oriented supervised contrast learning rebalancing device, the device comprising: The system comprises a sample acquisition module, a query view and a key view, wherein the sample acquisition module is used for acquiring an image data set containing a plurality of image samples, and carrying out data enhancement on the image samples to obtain the query view and the key view; The model construction module is used for constructing a long-tail visual recognition model, wherein the long-tail visual recognition model comprises a backbone network, a double-branch encoder and a pre-trained classifier, the double-branch encoder comprises a query branch and a key branch, and the query branch is used for respectively processing the query view and the key view to obtain query characteristics and key characteristics; The rebalancing module is used for obtaining the types of the image samples in the current training batch, carrying out gradient sensing and frequency sensing on each type, constructing rebalancing factors, and processing the supervision contrast loss of the image samples under the corresponding types by utilizing the rebalancing factors to obtain rebalancing contrast loss, wherein the supervision contrast loss is obtained by processing the output of the double-branch encoder; The prototype perception module is used for respectively calculating similarity matrixes of the query features and the key features and a preset global prototype vector set, obtaining exchange prediction loss according to the similarity matrixes and soft allocation matrixes corresponding to the double-branch encoders, and constructing prototype perception loss according to the exchange prediction loss, aggregation loss and separation loss; And the result output module is used for constructing total loss according to the rebalancing comparison loss and the prototype perception loss, training the long-tail visual recognition model through the total loss to obtain a trained long-tail visual recognition model, and classifying long-tail image data by utilizing the trained long-tail visual recognition model.
- 9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
- 10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
Description
Supervision contrast learning rebalancing method and device for long tail visual recognition Technical Field The application relates to the technical field of image classification, in particular to a supervision contrast learning rebalancing method and device for long tail visual recognition. Background With the development of the technology in the field of computer vision, convolutional Neural Networks (CNNs) have made breakthrough progress in a plurality of basic visual tasks such as image classification, object detection, semantic segmentation and the like by virtue of the explosive development of deep learning, and even exhibit superman performance. While advances in Neural Architecture Search (NAS) technology have further enhanced the ability of CNNs, particularly in terms of their effectiveness on large-scale data sets, through automated architecture optimization. The acquisition of these achievements depends largely on powerful computing resources in the internet age and high quality large-scale data sets like ImageNet and MS COCO. These datasets are carefully designed to ensure that each class has enough and balanced training samples, which lays a foundation for good training of the model. However, real world data distribution is often not so ideal, but rather exhibits long tail characteristics. In long-tail distribution, a few categories (head categories) have most data, and a large number of tail categories have only limited samples, and the situation is commonly found in practical applications such as rare object identification in image classification, and popular topic detection in text classification. In a long-tail recognition task, the unbalanced data distribution obviously influences the training process and the final performance of the model, so that the recognition performance of the head category is strong, the learning efficiency of the tail category is reduced due to the scarcity of a sample, and the overall classification accuracy is reduced. The root cause is the systematic deviation of the classifier weight norms, the larger weight norms make the model sensitive to the head class features, and the smaller norms make the response to the tail class features insufficient, so that misclassification is caused. To address this challenge, traditional methods mainly include sampling methods, cost-sensitive learning techniques, and classifier optimization strategies. These approaches aim to mitigate classification bias for head classes and improve recognition performance for tail classes, but they generally assume good separation differences between classes, focus on optimizing sample distribution, focus on potential representations learned from unbalanced data, and do not adequately consider feature correlations and internal relationships between classes, while overstocking tail classes tends to sacrifice head class performance, resulting in overall imbalance in classification effects. The self-supervision learning can learn without label dependence, and a new view is provided for solving the long tail problem. As the contrast learning of the self-supervision learning basic paradigm, the example-level similarity relation is effectively modeled, and the characteristic representation with high differentiation can be extracted. Khosla et al, by integrating tag information, pushes the traditional self-supervised contrast loss to a supervisory version (SCL), which has significant success on balanced datasets, but has still limited effectiveness on unbalanced distributions, because standard contrast learning frameworks rely largely on the selection of positive and negative sample pairs, are inherently biased towards most classes, and exacerbate challenges presented by data imbalances. Disclosure of Invention Based on the above, it is necessary to provide a supervision contrast learning rebalancing method and device for long tail visual recognition. A long tail visual recognition-oriented supervised contrast learning rebalancing method, the method comprising: Acquiring an image data set containing a plurality of image samples, and carrying out data enhancement on the image samples to obtain a query view and a key view, wherein the data in the image data set are distributed in long tails and marked by category labels; The method comprises the steps of constructing a long-tail visual recognition model, wherein the long-tail visual recognition model comprises a backbone network, a double-branch encoder and a pre-trained classifier, the double-branch encoder comprises a query branch and a key branch, the query branch is used for respectively processing the query view and the key view to obtain query features and key features, the query branch comprises a query encoder and an MLP, and the key branch comprises a momentum encoder and an MLP; Obtaining the types of each image sample in the current training batch, performing gradient sensing and frequency sensing on each type, constructing a rebalancing fac