CN-120910495-B - Data value density evaluation method and device

CN120910495BCN 120910495 BCN120910495 BCN 120910495BCN-120910495-B

Abstract

The invention relates to the technical field of data processing, and provides a data value density evaluation method and device, wherein the method comprises the steps of determining index values of a plurality of evaluation indexes of a data set to be evaluated; and evaluating the value density of the data set to be evaluated relative to the target task scene based on the index values of the evaluation indexes and the corresponding index weights. The value density of the data set to be evaluated is evaluated by adopting the index values of the plurality of evaluation indexes, the index weight of each evaluation index relative to the target task scene is determined based on the index values of each evaluation index, and the value density of the data set to be evaluated relative to the target task scene is evaluated based on the index values of each evaluation index and the corresponding index weights, so that the value density of the data set to be evaluated is comprehensively and accurately evaluated.

Inventors

LIU YUFAN
LI BING
HU WEIMING
WANG JIAN
ZHANG CHAO

Assignees

中国科学院自动化研究所

Dates

Publication Date: 20260505
Application Date: 20250610

Claims (8)

1. A method of evaluating data value density, comprising: determining index values of a plurality of evaluation indexes of a data set to be evaluated, wherein the plurality of evaluation indexes comprise at least two evaluation indexes of a density ratio, an integrity rate index, a label consistency index, a noise detection index, an intra-class diversity index, an inter-class diversity index, a data statistics complexity index, an ambiguity index, a domain offset index, a leakage ratio index and a prediction accuracy index, and the data set to be evaluated comprises an image data set and a text data set; Determining index weights of the evaluation indexes relative to target task scenes based on index values of the evaluation indexes, wherein the target task scenes comprise classified task scenes and target identification task scenes, and the data set to be evaluated has different value densities for different target task scenes; evaluating the value density of the data set to be evaluated relative to the target task scene based on the index values of the evaluation indexes and the corresponding index weights; Determining an index weight of each evaluation index relative to the target task scene based on the index value of each evaluation index, including: determining the task fitness of each evaluation index based on the index value of each evaluation index and the default value of each evaluation index by the target task scene; Determining the index weight of each evaluation index relative to a target task scene based on the task fit degree of each evaluation index; The task fitness of each evaluation index is determined based on the index value of each evaluation index and the default value of each evaluation index by the target task scene, and the task fitness T Di of each evaluation index is calculated according to the following Gaussian kernel function: ; Wherein alpha i represents an index value of the ith evaluation index, T i represents a default value of the target task scene to the ith evaluation index, The bandwidth parameters of the gaussian kernel are represented, Representing a base exponential function with a natural constant e.
2. The data value density evaluation method according to claim 1, wherein determining index values of a plurality of evaluation indexes of the data set to be evaluated includes: Performing data distillation on the data set to be evaluated to obtain a density ratio index of the distilled data subset to the data set to be evaluated, wherein the density ratio index represents a ratio of an average value of data amounts divided by categories in the distilled data subset to an average value of data amounts divided by categories in the data set to be evaluated; And calculating the integrity rate index, the label consistency index, the noise detection index, the intra-class diversity index, the inter-class diversity index, the data statistics complexity index, the ambiguity index, the field deviation index, the leakage ratio index and the prediction accuracy index of the data set to be evaluated.
3. The data value density assessment method according to claim 2, wherein the intra-class diversity index is calculated according to the following formula: ; Wherein x i represents the sample characteristics of the ith sample of any class in the dataset to be evaluated, var (x i ) represents calculating the sample characteristic variance of any class, and n represents the number of samples of any class; The inter-class diversity index is calculated according to the following formula: ; Wherein c j represents a class center feature of a j-th class in the dataset to be evaluated, var (c j ) represents an inter-mean variance of the calculated class center feature, the class center feature of the j-th class is determined based on sample features of all samples in the j-th class, and K represents the number of classes.
4. The data value density assessment method according to claim 2, wherein the ambiguity index is calculated according to the following formula: ; wherein x i represents the sample characteristics of the i-th sample in the dataset to be evaluated, The probability that the preset test model predicts the x i as the kth class is represented, and K represents the number of classes.
5. The data value density evaluation method according to claim 2, wherein the leak ratio index is calculated as follows: ; Wherein m represents the total number of samples of the test set in the data set to be evaluated, n represents the total number of samples of the training set in the data set to be evaluated, x i represents the sample characteristic of the ith sample in the test set, y j represents the sample characteristic of the jth sample in the training set, cos (θ (x i , y j )) represents the cosine similarity of the sample characteristic x i and the sample characteristic y j , ε represents a similarity threshold, θ (x i ,y j ) represents the included angle between the two sample characteristics, and II () represents an indication function.
6. A data value density evaluation apparatus, comprising: The index value determining module is used for determining index values of a plurality of evaluation indexes of a data set to be evaluated, wherein the plurality of evaluation indexes comprise at least two evaluation indexes of a density ratio, an integrity rate index, a label consistency index, a noise detection index, an intra-class diversity index, an inter-class diversity index, a data statistics complexity index, an ambiguity index, a domain offset index, a leakage ratio index and a prediction accuracy index, and the data set to be evaluated comprises an image data set and a text data set; The index weight determining module is used for determining the index weight of each evaluation index relative to a target task scene based on the index value of each evaluation index, wherein the target task scene comprises a classification task scene and a target identification task scene, and the data set to be evaluated has different value densities for different target task scenes; the value density evaluation module is used for evaluating the value density of the data set to be evaluated relative to the target task scene based on the index values of all evaluation indexes and the corresponding index weights; The index weight determining module is specifically used for determining the task fitness of each evaluation index based on the index value of each evaluation index and the default value of each evaluation index by the target task scene; the index weight determining module is specifically configured to calculate a task fitness T Di of each evaluation index according to the following gaussian kernel function: ; Wherein alpha i represents an index value of the ith evaluation index, T i represents a default value of the target task scene to the ith evaluation index, The bandwidth parameters of the gaussian kernel are represented, Representing a base exponential function with a natural constant e.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the data value density evaluation method according to any one of claims 1 to 5 when executing the computer program.
8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the data value density evaluation method according to any one of claims 1 to 5.

Description

Data value density evaluation method and device Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for evaluating data value density. Background The value density (Dataset Value Density) of a dataset is an effective amount of knowledge to quantitatively measure the load per unit data volume in the dataset and its suitability for a particular task. Currently, the data set value density evaluation method mainly comprises static data quality check, single performance index measurement and random or simple strategy compression. Static data quality verification is mainly focused on evaluation of data integrity and repeatability, and the value density of a data set cannot be comprehensively evaluated. The single performance index measurement is to train and test the data set in a specific (mature classical) deep learning model and evaluate the test result with a certain index (such as recall rate), so as to evaluate the value density of the data set, wherein the single index is difficult to characterize multiple aspects of the data set, and is easy to cause underestimation or overestimation of the potential information of the data set. The evaluation of random or simple policy compression evaluates only a subset of the data set, and also fails to fully evaluate the value density of the data set. The value density of the data is not only dependent on the static quality of the data, but also closely related to the performance of the data in a specific task scene, and the existing evaluation mode only considers the data set, so that the value density of the data set cannot be accurately evaluated. In summary, the existing value density evaluation method of the data set is only performed on the data set itself, and cannot evaluate the value density of the data set comprehensively and accurately. Disclosure of Invention The invention provides a data value density evaluation method and device, which are used for solving the problem that the value density of a data set cannot be comprehensively and accurately evaluated in the prior art. The invention provides a data value density evaluation method, which comprises the following steps: determining index values of a plurality of evaluation indexes of the data set to be evaluated; determining the index weight of each evaluation index relative to the target task scene based on the index value of each evaluation index; and evaluating the value density of the data set to be evaluated relative to the target task scene based on the index value of each evaluation index and the corresponding index weight. According to the data value density evaluation method provided by the invention, the index weight of each evaluation index relative to a target task scene is determined based on the index value of each evaluation index, and the method comprises the following steps: determining the task fitness of each evaluation index based on the index value of each evaluation index and the default value of each evaluation index by the target task scene; And determining the index weight of each evaluation index relative to the target task scene based on the task fit degree of each evaluation index. According to the data value density evaluation method provided by the invention, the task fitness of each evaluation index is determined based on the index value of each evaluation index and the default value of each evaluation index by the target task scene, and the task fitness T Di of each evaluation index is calculated according to the following Gaussian kernel function: ; Wherein alpha i represents an index value of the ith evaluation index, T i represents a default value of the target task scene to the ith evaluation index, The bandwidth parameters of the gaussian kernel are represented,Representing a base exponential function with a natural constant e. According to the data value density evaluation method provided by the invention, the index values of a plurality of evaluation indexes of a data set to be evaluated are determined, and the method comprises the following steps: Performing data distillation on the data set to be evaluated to obtain a density ratio index of the distilled data subset to the data set to be evaluated, wherein the density ratio index represents a ratio of an average value of data amounts divided by categories in the distilled data subset to an average value of data amounts divided by categories in the data set to be evaluated; Calculating an integrity rate index, a label consistency index, a noise detection index, an intra-class diversity index, an inter-class diversity index, a data statistics complexity index, an ambiguity index, a field deviation index, a leakage ratio index and a prediction accuracy index of the data set to be evaluated; the plurality of evaluation indexes comprise at least two evaluation indexes of a density ratio, an integrity rate index, a label consistency