CN-121999251-A - Multi-scale data set distillation method and system based on dynamic multi-order mixing loss

CN121999251ACN 121999251 ACN121999251 ACN 121999251ACN-121999251-A

Abstract

The invention discloses a multi-scale data set distillation method and system based on dynamic multi-order mixing loss, and belongs to the technical field of data set distillation and deep learning. The method comprises the steps of solving the technical problems of poor distillation adaptability, low synthesis efficiency, sample degradation and the like of the existing multi-scale data set, constructing a kernel similarity measure through a feature extraction network, acquiring a real sample selection weight by utilizing a learning sampler, calculating a weighted distribution matching loss and a pairing loss, constructing a dynamic mixing total loss, alternately updating sampler parameters and synthesis sample pixels, performing staged pixel level optimization on the updated synthesis sample, screening an optimal subset by utilizing a gradient change rate, and performing differential gradient freezing. The method can realize stable generation of the multi-scale synthesis set under single distillation, improve the discrimination fidelity under high budget while guaranteeing the structural integrity under low budget, effectively reduce the storage and calculation cost and adapt to the edge equipment deployment requirement with limited resources.

Inventors

WU YUQUAN
SHU HENG
YANG QIMING

Assignees

中国科学院软件研究所

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. A multi-scale dataset distillation method based on dynamic multi-order mixing loss, comprising the steps of: extracting corresponding features by using a feature extraction network aiming at a real sample and a synthesized sample in a real data set and a synthesized data set, and constructing a kernel similarity measure between the features; obtaining a selection weight of each real sample based on the characteristic representation of the real sample by using a sampler; calculating weighted distribution matching loss by using the kernel similarity measure and the selection weight, and iteratively updating the sampler parameters by taking the weighted distribution matching loss as a target; Calculating a multi-level pairing loss based on the feature representations of the real sample and the composite sample; constructing overall loss based on the weighted distribution matching loss and the multi-level pairing loss, and iteratively updating pixel values of the synthesized sample with the minimum overall loss as a target; and carrying out staged pixel level optimization on the updated synthesized sample, screening out an optimal subset by calculating gradient change rates of subsets with different scales, carrying out differential gradient freezing on the optimal subset, and outputting an optimized multi-scale synthesized data set.
2. The method of claim 1, wherein extracting corresponding features using a feature extraction network for real samples and synthetic samples in the real data set and the synthetic data set, and constructing a kernel similarity measure between features, comprises: mapping the real sample and the synthesized sample to a feature space by utilizing a feature extraction network, and extracting corresponding feature representation; and calculating Euclidean norms between any two characteristic representations, calculating corresponding index terms by combining a plurality of kernel bandwidth parameters, and summing to construct a multi-bandwidth Gaussian kernel similarity measure.
3. The method of claim 1, wherein obtaining the selection weight for each real sample based on the feature representation of the real sample using the sampler comprises: inputting the characteristic representation of the real sample into a sampler, and calculating to obtain the importance score of each real sample; and processing the importance scores by using a normalized exponential function to obtain the selection weights in the probability distribution form.
4. The method of claim 1, wherein calculating a weighted distribution matching penalty using the kernel similarity measure and the selection weights comprises: Weighting the kernel similarity among the real samples by using the selection weight, and calculating to obtain a weighted self-similarity item of the real sample set; Calculating the kernel similarity among the synthetic sample sets to obtain self-similarity items of the synthetic sample sets; weighting the kernel similarity between the real sample and the synthesized sample by using the selection weight, and calculating to obtain a weighted cross similarity term; And linearly combining the weighted self-similarity term, the self-similarity term and the weighted cross-similarity term to obtain the weighted distribution matching loss.
5. The method of claim 1, wherein iteratively updating sampler parameters targeting weighted distribution matching loss comprises: Calculating the gradient of the sampler parameters by using a chain rule with the aim of maximizing the weighted distribution matching loss; and combining the weight information entropy regularization term with a preset learning rate, and iteratively updating the sampler parameters by using a gradient through a gradient ascending algorithm.
6. The method of claim 1, wherein calculating a multi-level pairing penalty based on the feature representations of the real and composite samples comprises: Calculating the characteristic representation difference of the real sample and the synthesized sample in the characteristic space to obtain global pairing loss; Performing weighted sampling on the real sample based on the selection weight, and calculating the characteristic representation difference between the sampled sample and the synthesized sample to obtain sampling weighted pairing loss; And carrying out weighted summation on the global pairing loss and the sampling weighted pairing loss by using a preset balance coefficient to obtain a multi-stage pairing loss.
7. The method of claim 1, wherein constructing the ensemble loss based on the weighted distribution matching loss and the multi-level pairing loss to minimize the ensemble loss for a target iterative update of pixel values of the synthesized samples comprises: calculating the dynamic mixing weight of the current training round according to the linear scheduling strategy; The dynamic mixing weight is utilized to carry out linear combination on the weighted distribution matching loss and the multi-stage pairing loss, and the overall loss is constructed; And calculating the gradient of the total loss on the pixel value of the synthesized sample, and updating the synthesized sample by using a gradient descent algorithm by using a preset learning rate.
8. The method of claim 1, wherein performing a staged pixel level optimization on the updated composite sample, and wherein selecting the optimal subset by calculating gradient rates of different scale subsets comprises: Calculating pixel gradients of the synthesized sample batch by using the total loss, and updating the total synthesized tensor through an image optimizer to construct a mother set; Based on the mother set, carrying out iterative analysis on preset subsets with different sample sizes, and evaluating the contribution degree of each subset to distribution learning by calculating the overall gradient change rate among training rounds; Based on the evaluation result of the contribution degree, the optimal subset of the current stage is identified and screened from the different standard templates.
9. The method of claim 8, wherein performing differential gradient freezing for the optimal subset, outputting an optimized multi-scale composite dataset, comprises: When a training batch is constructed, substituting the synthesized data fragments with the scale lower than the optimal subset by utilizing non-micro substitution data so as to block the gradient return path; in the context of blocked gradient return, taking the loss function of the optimal subset as a guide, and utilizing an image optimizer to execute high-precision refined pixel updating on the optimal subset only; projecting the updated synthesized image pixel values to legal physical intervals, performing smoothing post-processing, and outputting an optimized multi-scale synthesized data set.
10. A multi-scale dataset distillation system based on dynamic multi-order mixing loss, comprising: The feature extraction network module is used for extracting corresponding features by using a feature extraction network aiming at real samples and synthesized samples in a real data set and a synthesized data set, and constructing a kernel similarity measure between the features; a sampler module for obtaining a selection weight for each real sample based on a characteristic representation of the real sample; the example feature matching module is used for calculating weighted distribution matching loss by using the kernel similarity measurement and the selection weight, and iteratively updating the parameters of the sampler by taking the weighted distribution matching loss as a target; a global distribution matching module for calculating a multi-level pairing penalty based on the feature representations of the real samples and the composite samples; the dynamic mixing loss module is used for constructing overall loss based on the weighted distribution matching loss and the multi-stage pairing loss, and iteratively updating pixel values of the synthesized sample with the minimum overall loss as a target; And the pixel level optimization module is used for carrying out staged pixel level optimization on the updated synthesized sample, screening out an optimal subset by calculating gradient change rates of subsets with different scales, carrying out differential gradient freezing on the optimal subset, and outputting an optimized multi-scale synthesized data set.

Description

Multi-scale data set distillation method and system based on dynamic multi-order mixing loss Technical Field The invention belongs to the technical field of data set distillation and deep learning, and particularly relates to a multi-scale data set distillation method based on a dynamic multi-order mixed loss function. Background In the field of machine learning data efficient learning, data set distillation techniques refer to compressing a larger scale training set into a small synthetic data set such that a model trained on the synthetic data set can approximate a model trained on the original data set with less memory and time overhead. On the basis, the multi-scale data set distillation is a technical means capable of obtaining synthetic data sets which can be divided into different scales through one-time training, so that the application flexibility is improved. The data set distillation technology has wide application value in multiple fields, namely in occasions requiring frequent retraining of models such as super-ginseng searching and NAS (network attached storage), or in situations requiring evaluation of performance comparison after algorithm updating under different structures, the training time can be greatly shortened and the training resource utilization efficiency can be improved by using the synthesized data set, the distilled data set is allowed to be subjected to local quick fine adjustment for the resource-limited edge equipment, and the synthesized data set can be shared under the condition of not exposing an original sample, so that the method can be used in fields sensitive to private data such as medical treatment. Currently, dataset distillation techniques are largely divided into feature matching at the instance level and gradient matching at the distribution level. Feature matching compresses information of multiple real samples onto a single composite sample by minimizing feature differences of the composite sample and the corresponding real sample in the network layer. The method is visual and easy to implement, and because the method fits the specific characteristic characterization of the real sample on the network layer, the synthesized sample can be rapidly optimized into effective data when a small number of steps are trained. However, this method is extremely sensitive to variations between individual samples, resulting in significant blurring of the contour structure information, and is more apparent in visual tasks where boundary information needs to be preserved. Because of the concern of the network layer performance of a real sample in a specific model architecture, the cross-architecture generalization capability of feature matching is limited, and the robustness is relatively poor. The distribution matching enables the synthesized sample to approach to the real number sample on the whole layer by statistics based on gradients, focuses on the commonality relation between the global structure and different samples, and has good performance in the aspects of global texture, visibility, diversity among categories and the like. The distribution matching method tends to average the information over different synthetic samples, each of which contributes relatively less to the training signal, and thus is less accurate than the feature matching method at low IPC (IMAGE PER CLASS, number of image samples per class) settings. Furthermore, the statistics of the computed gradients often depend on a suitable kernel function, the superparameter is relatively sensitive, and lacks a complete theoretical explanation. Conventional dataset distillation techniques each have fixed drawbacks when used alone with lower and higher order losses. The feature matching tends to approach sample by sample, so that the outline and the structural information of the samples in the synthesized samples are blurred, and the distribution matching can keep the whole outline, but the information quantity and the constraint force of each synthesized sample are insufficient under the low IPC setting, so that the sufficient training signals are difficult to provide when the samples are scarce. The synthetic data set obtained by the traditional data set distillation method directly cuts the synthetic data set result to cause cliff-type drop of performance due to the problem of subset degradation, and repeated distillation cost is brought to different IPCs for independent distillation, so that the calculation resource cost is high. Existing multiscale dataset distillation methods (e.g. "Multisize Dataset Condensation") aim to generate synthetic subsets compatible with different IPC requirements by a single distillation process. The scheme introduces a subset loss and staged freezing strategy for the first time, effectively relieves the subset degradation problem by selecting a representative real sample and reusing synthesized data, improves the flexibility when the edge equipment is deployed, and reduces t