CN-122019519-A - Conflict restoration method, device, equipment and medium for relational training data

CN122019519ACN 122019519 ACN122019519 ACN 122019519ACN-122019519-A

Abstract

The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for repairing conflict of relational training data. According to the method, key attribute identification is carried out on the training number to be repaired, under the condition that a model is not retrained, the performance gain of different attribute combinations after being repaired is evaluated, the specific attributes are determined to be repaired preferentially, key sample identification is carried out on the training number to be repaired, and the specific training samples are determined to be repaired preferentially based on the influence of the sample before and after the repair on model parameters. The repair process can be concentrated on the field and the sample with the most influence, but not comprehensively repair all the attributes and all the samples, so that a large number of low-income modification operations are avoided, the repair process has definite pertinence and selectivity, and the repair cost of training data is reduced.

Inventors

XU ZIHUAN
FAN WENFEI
XIE MIN
REN WEILONG
WANG YAOSHU
HAN XIAOYU

Assignees

深圳计算科学研究院

Dates

Publication Date: 20260512
Application Date: 20251230

Claims (10)

1. A method for conflict repair of relational training data, the method comprising: Acquiring training data to be repaired and a target model, wherein the training data to be repaired is used for training the target model and comprises a plurality of attributes and a plurality of samples; Acquiring error information of the target model on the training data to be repaired, and screening the training data to be repaired according to the error information to obtain an evaluation set; determining remaining training data to be repaired according to the training data to be repaired and the evaluation set, and performing conflict repair on the remaining training data to be repaired to obtain pre-repair data; Acquiring a plurality of candidate attribute subsets, calculating attribute level performance gains of the candidate attribute subsets during repair according to the target model, the evaluation set and the pre-repair data for any candidate attribute subset, traversing all candidate attribute subsets to obtain attribute level performance gains of each candidate attribute subset, and carrying out identification processing on key attributes according to the attribute level performance gains of each candidate attribute subset to obtain attribute level repair data; For any sample, calculating according to the target model, the evaluation set, the residual training data to be repaired and the pre-repair data to obtain a sample level performance gain when the sample is repaired, and carrying out identification processing on a key sample according to the sample level performance gain to obtain sample level repair data; And performing conflict repair on the attribute-level repair data and the sample-level repair data to obtain repaired data.
2. The method for repairing collision as claimed in claim 1, wherein said filtering the training data to be repaired according to the error information to obtain an evaluation set includes: obtaining the prediction category of the training data to be repaired, and classifying the training data to be repaired according to the prediction category to obtain the training data to be repaired of each category corresponding to the category; Aiming at any category, acquiring an attribute value of category to-be-repaired training data corresponding to the category, and carrying out layering treatment on the category to-be-repaired training data according to the attribute value to obtain layering to-be-repaired training data corresponding to each layer; and aiming at any layer, screening the layered training data to be repaired according to the error information to obtain a set of training data to be repaired after screening of each layer, traversing all layers to obtain a set of training data to be repaired after screening of all layers, and determining the set of training data to be repaired after screening of all layers as an evaluation set.
3. The collision remediation method of claim 2 wherein the error information includes a sample loss value, a prediction confidence and an output stability; the step of screening the layered training data to be repaired according to the error information to obtain a set of the training data to be repaired after each layer of screening, comprising: according to the prediction confidence and the output stability, carrying out initial screening on the training data to be repaired to obtain an initial screening result; and sorting the initial screening results from large to small in sample loss value, and determining the initial screening results sorted in a preset interval as an evaluation set.
4. The conflict resolution method as recited in claim 1, wherein said obtaining a plurality of candidate attribute subsets comprises: Acquiring a plurality of initial candidate attribute subsets of a current round, calculating to obtain initial attribute level performance gains of the initial candidate attribute subsets during repair according to the target model, the evaluation set and the pre-repair data aiming at any initial candidate attribute subset, and traversing all the initial candidate attribute subsets to obtain attribute level performance gains of each initial candidate attribute subset; selecting the first m corresponding initial candidate attribute subsets in the value of the attribute-level performance gain as seed candidate attribute subsets, wherein m is an integer greater than zero; And carrying out cross and mutation processing on the seed candidate attribute subsets to obtain new candidate attribute subsets, taking the new candidate attribute subsets as a plurality of initial candidate attribute subsets of the next round until the round reaches a preset round to obtain final new candidate attribute subsets, and determining the final new candidate attribute subsets as a plurality of candidate attribute subsets.
5. The conflict resolution method as recited in claim 4, wherein said computing initial attribute level performance gains for said initial candidate attribute subset resolution based on said target model, said evaluation set and said pre-resolution data, Comprising the following steps: determining a pre-repair attribute data set of the initial candidate attribute subset in the pre-repair data; acquiring a preset lightweight model, and training the lightweight model by using the pre-repair attribute data set to obtain a trained lightweight model; calculating residual errors of the target model on the pre-repair attribute data set according to the pre-repair attribute data set, and fitting the residual errors by using the trained lightweight model to obtain a residual error fitting model; adding the residual fitting model and the target model to construct an evaluation model; According to the evaluation set, calculating to obtain a first performance gain of the target model on the evaluation set and a second performance gain of the target model on the evaluation set; and calculating the difference value between the second performance gain and the first performance gain to obtain the attribute-level performance gain.
6. The method for collision avoidance repair according to claim 1, wherein the calculating the sample-level performance gain at the time of the sample repair from the target model, the evaluation set, the remaining training data to be repaired, and the pre-repair data comprises: Obtaining a primary training parameter of the target model on the training data to be repaired, and calculating a model gradient of the training parameter according to a loss function of the target model, wherein the model gradient comprises a first gradient of the loss function under the training parameter based on the rest of the training data to be repaired and a second gradient of the loss function under the training parameter based on the pre-repair data; constructing an updated quantity model of model parameters according to the training parameters, the first gradient and the second gradient; Constructing a virtual model according to the target model and the update quantity model; according to the evaluation set, calculating to obtain a third performance gain of the virtual model on the evaluation set and a fourth performance gain of the target model on the evaluation set; And calculating the difference value between the third performance gain and the fourth performance gain to obtain the sample level performance gain.
7. The method for repairing collision as claimed in claim 1, wherein said performing an identification process on the key samples according to the sample level performance gain to obtain sample level repair data comprises: In the evaluation set, the first k corresponding training data in the values of the sample level performance gain are determined as sample level repair data.
8. A collision repair device for relational training data, the collision repair device comprising: the system comprises an acquisition module, a correction module and a correction module, wherein the acquisition module is used for acquiring training data to be repaired and a target model, the training data to be repaired is used for training the target model, and the training data to be repaired comprises a plurality of attributes and a plurality of samples; The screening module is used for acquiring error information of the target model on the training data to be repaired, and screening the training data to be repaired according to the error information to obtain an evaluation set; The pre-repair module is used for determining remaining training data to be repaired according to the training data to be repaired and the evaluation set, and performing conflict repair on the remaining training data to be repaired to obtain pre-repair data; The first identification module is used for acquiring a plurality of candidate attribute subsets, calculating attribute level performance gains when the candidate attribute subsets are repaired according to the target model, the evaluation set and the pre-repair data for any candidate attribute subset, traversing all candidate attribute subsets to obtain attribute level performance gains of each candidate attribute subset, and carrying out identification processing on key attributes according to the attribute level performance gains of each candidate attribute subset to obtain attribute level repair data; The second recognition module is used for aiming at any sample, calculating to obtain a sample level performance gain when the sample is repaired according to the target model, the evaluation set, the residual training data to be repaired and the pre-repair data, and recognizing and processing a key sample according to the sample level performance gain to obtain sample level repair data; and the repair module is used for performing conflict repair on the attribute-level repair data and the sample-level repair data to obtain repaired data.
9. A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the collision avoidance method according to any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the collision avoidance method of any of claims 1 to 7.

Description

Conflict restoration method, device, equipment and medium for relational training data Technical Field The invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for repairing conflict of relational training data. Background In the training process of machine learning models, the model performance is highly dependent on the accuracy and consistency of the training data. In practice, the relational training data is typically sourced from a number of business systems or data collection channels, such as customer management systems, transaction systems, manually entered databases. Because of frequent data updating, complex structure and heterogeneous sources, conflicts (Conflict) are very easy to generate in the process of integration and maintenance, including the problems of inconsistent attribute values, constraint condition violation, cross-table semantic contradiction and the like. The conflict can interfere parameter learning in a model training stage, so that the characteristic distribution of the model deviates from a real rule, and further the prediction performance is reduced or the generalization capability is insufficient. The existing data collision repair (Conflict Resolution, CR) methods are mostly based on manual rules, integrity constraints or statistical detection mechanisms, and usually adopt full-scale repair or local cleaning according to anomaly scores. The method has the advantages that all records need to be traversed, compared and modified in large-scale multi-table data to execute full-scale repair, so that the cost is extremely high, and the problem that how to reduce the repair cost is needed to be solved in the process of repairing conflict data is urgent. Disclosure of Invention In view of this, the embodiment of the application provides a method, a device, equipment and a medium for repairing conflict of relational training data, so as to solve the problem of high repairing cost in the process of repairing conflict data. In a first aspect, an embodiment of the present application provides a method for repairing a collision of relational training data, where the method for repairing a collision includes: Acquiring training data to be repaired and a target model, wherein the training data to be repaired is used for training the target model and comprises a plurality of attributes and a plurality of samples; Acquiring error information of the target model on the training data to be repaired, and screening the training data to be repaired according to the error information to obtain an evaluation set; determining remaining training data to be repaired according to the training data to be repaired and the evaluation set, and performing conflict repair on the remaining training data to be repaired to obtain pre-repair data; Acquiring a plurality of candidate attribute subsets, calculating attribute level performance gains of the candidate attribute subsets during repair according to the target model, the evaluation set and the pre-repair data for any candidate attribute subset, traversing all candidate attribute subsets to obtain attribute level performance gains of each candidate attribute subset, and carrying out identification processing on key attributes according to the attribute level performance gains of each candidate attribute subset to obtain attribute level repair data; For any sample, calculating according to the target model, the evaluation set, the residual training data to be repaired and the pre-repair data to obtain a sample level performance gain when the sample is repaired, and carrying out identification processing on a key sample according to the sample level performance gain to obtain sample level repair data; And performing conflict repair on the attribute-level repair data and the sample-level repair data to obtain repaired data. In a second aspect, an embodiment of the present application provides a collision repairing apparatus for relational training data, the collision repairing apparatus including: the system comprises an acquisition module, a correction module and a correction module, wherein the acquisition module is used for acquiring training data to be repaired and a target model, the training data to be repaired is used for training the target model, and the training data to be repaired comprises a plurality of attributes and a plurality of samples; The screening module is used for acquiring error information of the target model on the training data to be repaired, and screening the training data to be repaired according to the error information to obtain an evaluation set; The pre-repair module is used for determining remaining training data to be repaired according to the training data to be repaired and the evaluation set, and performing conflict repair on the remaining training data to be repaired to obtain pre-repair data; The first identification module is used for acquiri