CN-122020168-A - Data set evaluating method and device, electronic equipment and storage medium
Abstract
The application provides a data set evaluating method, a device, electronic equipment and a storage medium, which comprise the steps of preprocessing a data set to be evaluated to obtain a standard data set, wherein preprocessing at least comprises a classification mark, the classification mark is used for marking the data type of the data set to be evaluated, constructing a target data set according to the standard data set, carrying out zero sample reasoning or less sample reasoning on a reference model according to the target data set to obtain an initial score of the target data set, carrying out intervention training based on efficient fine adjustment of parameters on the reference model according to the standard data set to obtain a target model, carrying out zero sample reasoning or less sample reasoning on the target model according to the target data set to obtain an updated score of the target data set, calculating the performance gain rate of the reference model according to the initial score and the updated score, and evaluating the data set to be evaluated according to the performance gain rate. The application can improve the accuracy of the quality evaluation of the data set and the scene suitability of the evaluation of the data set.
Inventors
- FAN TIANCHEN
- GAO YUANZI
- YAN MAN
- LI NAN
- YE WEI
- JIANG MING
- LIU BO
- WANG XIUHAI
- ZHOU KAI
- CHENG CHANG
- LIANG FENG
- LI ZHIMU
Assignees
- 中信建筑设计研究总院有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260127
Claims (10)
- 1. A method of evaluating a data set, comprising: preprocessing a data set to be evaluated to obtain a standard data set, wherein the preprocessing at least comprises a classification mark, and the classification mark is used for marking the data type of the data set to be evaluated; Constructing a target data set according to the standard data set, wherein the data type of the target data set is the same as that of the data set to be evaluated, and the data content is isolated from the data set to be evaluated; Zero sample reasoning or less sample reasoning is carried out on a reference model according to the target data set to obtain initial scores of the target data set, wherein the reference model is an AI model which is not trained by the data set to be evaluated; performing intervention training based on parameter efficient fine adjustment on the reference model according to the standard data set to obtain a target model; zero sample reasoning or less sample reasoning is carried out on the target model according to the target data set, so that an update score of the target data set is obtained; and calculating the performance gain rate of the reference model according to the initial score and the updated score, and evaluating the data set to be evaluated according to the performance gain rate.
- 2. The data set evaluation method of claim 1, wherein the calculating the performance gain rate of the reference model from the initial score and the updated score comprises: the performance gain ratio is calculated according to the following formula: ; Wherein Gain is the performance Gain, score Tuned is the update Score, and Score Base is the initial Score.
- 3. The data set evaluating method according to claim 2, wherein the evaluating the data set to be evaluated according to the performance gain ratio includes: Under the condition that the performance gain rate is larger than a preset threshold value, evaluating the data set to be evaluated to be a high-quality data set; Under the condition that the performance gain is greater than or equal to 0 and less than or equal to the preset threshold, evaluating the data set to be evaluated as a common data set; And under the condition that the performance gain quantity is smaller than 0, evaluating the data set to be evaluated as a low-quality data set.
- 4. The data set evaluation method according to claim 1, wherein the intervention training of the reference model based on parameter-efficient fine tuning according to the standard data set comprises: loading the weight of the reference model and freezing the main parameters of the reference model; injecting a low-rank matrix in a bypass mode of a transmission layer of the reference model; Inputting the standard data set into the reference model, and carrying out gradient update on the parameters of the low-rank matrix according to the output result of the reference model; And updating the weight according to the low-rank matrix after gradient updating.
- 5. The data set evaluation method according to claim 1, wherein the constructing a target data set from the standard data set comprises: Randomly selecting a preset number of data to be input from the standard data set; inputting the data to be input into a preset large language model, and outputting target data corresponding to the data to be input by the large language model according to a preset execution instruction; And constructing the target data set according to the target data.
- 6. The data set evaluation method according to claim 5, further comprising, after said constructing said target data set from said target data: Removing the data to be input from the standard data set to obtain a final data set; the intervention training based on parameter efficient fine tuning of the reference model according to the standard dataset comprises: and performing intervention training based on parameter efficient fine adjustment on the reference model according to the final data set.
- 7. The data set evaluation method according to any one of claims 1 to 6, wherein the preprocessing further comprises metadata checksum denoising cleaning; the metadata verification is used for detecting whether the to-be-evaluated data set has to-be-cleaned data with missing target fields or wrong formats; The denoising cleaning is used for removing the data to be cleaned.
- 8. The data set evaluating device is characterized by comprising a preprocessing module, a construction module, an reasoning module, an intervention training module, a calculation module and an evaluating module; The preprocessing module is used for preprocessing a data set to be evaluated to obtain a standard data set, wherein the preprocessing at least comprises a classification mark, and the classification mark is used for marking the data type of the data set to be evaluated; The construction module is used for constructing a target data set according to the standard data set, wherein the data type of the target data set is the same as that of the data set to be evaluated, and the data content of the target data set is isolated from the data set to be evaluated; The reasoning module is used for carrying out zero sample reasoning or less sample reasoning on a reference model according to the target data set to obtain initial scores of the target data set, wherein the reference model is an AI model which is not trained by the data set to be evaluated; the intervention training module is used for performing intervention training based on parameter efficient fine adjustment on the reference model according to the standard data set to obtain a target model; the reasoning module is also used for carrying out zero sample reasoning or less sample reasoning on the target model according to the target data set to obtain an update score of the target data set; The calculation module is used for calculating the performance gain rate of the reference model according to the initial score and the updated score; And the evaluation module is used for evaluating the data set to be evaluated according to the performance gain rate.
- 9. An electronic device comprising a processor and a memory, wherein the memory is configured to store instructions, the processor configured to invoke the instructions in the memory, to cause the electronic device to perform the data set evaluation method of any of claims 1-7.
- 10. A computer readable storage medium storing computer instructions which, when run on an electronic device, cause the electronic device to perform the data set evaluation method of any one of claims 1 to 7.
Description
Data set evaluating method and device, electronic equipment and storage medium Technical Field The present application relates to the field of data processing technologies, and in particular, to a data set evaluation method, a data set evaluation device, an electronic device, and a storage medium. Background With the rapid development of artificial intelligence technology, data has become a central element driving large model capability transitions. According to the high-quality data set construction guide and the high-quality data set practice guide (1.0), the high-quality data set is a data set which can effectively improve the performance of a model through the processes of acquisition, processing and the like. The existing mainstream dataset evaluating technology mainly focuses on the dimension of static evaluation, namely document integrity and format normalization, namely whether data has clear metadata (such as id, source, label fields specified in high-quality dataset format requirement) or not, whether a file structure accords with the JSON/Parquet standard or not, and whether the data contains privacy information, messy codes, repeated data or content violating legal regulations or not is detected through a rule engine or a regular expression. However, the prior art has obvious limitations that the "knowledge threshold" cannot be quantified, and that the static index cannot identify the "correct content but no nutrition" data. For example, a large number of repeated utterances or synthetic data generated by a low-level model may be fully compliant in format and syntax, but cannot promote the logical reasoning or specialized capabilities of the model. Objective verification of scene suitability is lacking, namely, the data is divided into general knowledge, industry general knowledge and industry special knowledge by the high-quality data set classification guideline. Whether a certain medical data set really can promote the performance of a model in a medical diagnosis scene can not be judged by only using a static label. Therefore, a new data set evaluating method is needed to solve the above problems. Disclosure of Invention In view of the above, the present application provides a data set evaluating method, apparatus, electronic device, and storage medium, which can improve accuracy of data set quality evaluation and improve scene suitability of data set evaluation. The first aspect of the embodiment of the application provides a data set evaluation method, which comprises the steps of preprocessing a data set to be evaluated to obtain a standard data set, wherein the preprocessing at least comprises a classification mark, the classification mark is used for marking the data type of the data set to be evaluated, constructing a target data set according to the standard data set, wherein the data type of the target data set is the same as that of the data set to be evaluated and the data content of the target data set is isolated from the data set to be evaluated, carrying out zero sample reasoning or less sample reasoning on a reference model according to the target data set to obtain an initial score of the target data set, wherein the reference model is an AI model which is not trained by the data set to be evaluated, carrying out intervention training based on parameter efficient fine adjustment on the reference model according to the standard data set to obtain a target model, carrying out zero sample reasoning or less sample reasoning on the target model according to the target data set to obtain an update score of the target data set, calculating a performance gain of the reference model according to the initial score and the update score, and evaluating the performance gain data set to be evaluated. In one possible implementation, the calculating the performance gain rate of the reference model according to the initial score and the updated score includes calculating the performance gain rate according to the following formula: ; Wherein Gain is the performance Gain, score Tuned is the update Score, and Score Base is the initial Score. In one possible implementation manner, the evaluating the to-be-evaluated dataset according to the performance gain rate includes evaluating that the to-be-evaluated dataset is a high-quality dataset if the performance gain rate is greater than a preset threshold value, evaluating that the to-be-evaluated dataset is a normal dataset if the performance gain amount is greater than or equal to 0 and less than or equal to the preset threshold value, and evaluating that the to-be-evaluated dataset is a low-quality dataset if the performance gain amount is less than 0. In one possible implementation manner, the intervention training based on parameter efficient fine adjustment on the reference model according to the standard data set comprises loading weights of the reference model, freezing main parameters of the reference model, injecting a low-rank matrix in a