CN-122020171-A - Quality evaluation method and system for AI training-oriented data cleaning and labeling

CN122020171ACN 122020171 ACN122020171 ACN 122020171ACN-122020171-A

Abstract

The application provides a quality evaluation method and a system for AI training-oriented data cleaning and labeling, which relate to the technical field of information security, and the method comprises the steps of acquiring an encrypted sample set for AI training from an untrusted domain and standard reference data from a trusted domain; based on the time information of each ciphertext sample, standard reference data are converted into standard reference vectors matched with feature dimensions of the encrypted sample subsets, a plurality of ciphertext feature descriptors are obtained in an encrypted state, distribution offset values between each ciphertext feature descriptor and the standard reference vectors are calculated respectively, when the distribution offset values are larger than a preset safety threshold, the corresponding encrypted sample subsets are judged to have abnormal distribution and are removed in the encrypted sample sets, so that quality evaluation of data cleaning and labeling is achieved, and quality and reliability of the downstream AI model training samples can be improved while the privacy safety of the data is guaranteed.

Inventors

CAO JING
YANG YING
HUANG YIRONG

Assignees

五维要数智能科技(上海)有限公司
北京凌云光子技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. The quality evaluation method for AI training-oriented data cleaning and labeling is characterized by being suitable for AI training sample evaluation under privacy protection and comprising the following steps: acquiring an encrypted sample set derived from an untrusted domain and used for AI training and standard reference data derived from the trusted domain, wherein the encrypted sample set comprises a plurality of ciphertext samples and a plurality of ciphertext labels, and the standard reference data comprises statistical summary data of benign data corresponding to the encrypted sample set and the ciphertext labels; Based on the time information of each ciphertext sample, carrying out block index construction on the encrypted sample set to obtain a plurality of encrypted sample subsets, and converting the standard reference data into standard reference vectors matched with the characteristic dimensions of the encrypted sample subsets; Under an encryption state, performing polynomial approximation calculation on each encrypted sample subset to obtain ciphertext feature information of each encrypted sample subset, and performing dimension alignment on each ciphertext feature information by taking the standard reference vector as a standard to map the position of the encrypted sample subset in a statistical distribution space to obtain a plurality of ciphertext feature descriptors; calculating a distribution offset value between each ciphertext feature descriptor and the standard reference vector; When the distribution offset value is larger than a preset safety threshold value, determining that the corresponding encrypted sample subset has distribution abnormality and eliminating the distribution abnormality in the encrypted sample set so as to realize data cleaning and labeling quality evaluation.
2. The method according to claim 1, wherein the method further comprises: Taking the encrypted sample subset with the distribution offset value smaller than or equal to a preset safety threshold value as a qualified encrypted sample subset; Performing ciphertext accumulation operation on ciphertext feature descriptors corresponding to all qualified encrypted sample subsets to obtain updated statistical information; and updating the standard reference vector by using a homomorphic encryption algorithm based on the updating statistical information to obtain the updated standard reference vector.
3. The method according to claim 1, wherein the method further comprises: acquiring a feature importance vector from the trusted domain, wherein the feature importance vector comprises a weight value corresponding to a feature dimension of each ciphertext feature descriptor; performing ciphertext multiplication operation on the standard reference vector and the feature importance vector by using a homomorphic encryption algorithm to obtain a weighted standard reference vector; Performing ciphertext multiplication operation on each feature element in the ciphertext feature descriptors and a corresponding weight value in the feature importance vector to obtain a plurality of weighted ciphertext feature descriptors; The calculating the distribution offset value between each ciphertext feature descriptor and the standard reference vector respectively includes: A distribution offset value between each weighted ciphertext feature descriptor and the weighted criterion reference vector is calculated separately.
4. A method according to claim 3, wherein said separately calculating a distribution offset value between each weighted ciphertext feature descriptor and the weighted criterion reference vector comprises: Respectively calculating the difference vector of each weighted ciphertext feature descriptor and the weighted standard reference vector in the corresponding dimension by utilizing the subtracting characteristic of homomorphic encryption; calculating square terms of vector elements in the difference vector by utilizing homomorphic encryption multiplication characteristics; And accumulating all square terms in each difference vector by utilizing the addition characteristic of homomorphic encryption to obtain a distribution offset value of each encrypted sample subset relative to the standard reference data.
5. The method of claim 1, wherein the standard reference data includes a first order origin moment and a second order origin moment corresponding to each feature dimension in the encrypted sample set and the ciphertext tag; the block index construction is performed on the encrypted sample set based on the time information of each ciphertext sample to obtain a plurality of encrypted sample subsets, and the standard reference data is converted into a standard reference vector matched with the characteristic dimension of the encrypted sample subsets, which comprises the following steps: Dividing the time span of the encrypted sample set into a plurality of continuous time intervals based on a preset time window length; Determining a time interval to which each ciphertext sample belongs according to the time information of each ciphertext sample, and distributing a corresponding block index identifier for each ciphertext sample based on the time interval; combining ciphertext samples with the same block index identifier to obtain a plurality of encrypted sample subsets; and according to a preset feature arrangement sequence, splicing the first-order origin moment and the second-order origin moment of each feature dimension in the standard reference data to obtain a standard reference vector matched with the feature dimension of the encrypted sample subset.
6. The method according to claim 1, wherein in the encrypted state, performing polynomial approximation calculation on each encrypted sample subset to obtain ciphertext feature information of each encrypted sample subset, includes: under an encryption state, combining each ciphertext sample in each encrypted sample subset and a corresponding ciphertext label into a ciphertext vector to be processed, and calculating the square value of each element in the ciphertext vector to be processed; Performing ciphertext accumulation on the original values of all the ciphertext vectors to be processed in each encrypted sample subset to obtain a first accumulated value, and performing ciphertext accumulation on the square values of all the ciphertext vectors to be processed in each encrypted sample subset to obtain a second accumulated value; and combining the first accumulated value with the second accumulated value to obtain ciphertext characteristic information of each encrypted sample subset.
7. The method of claim 1, wherein said performing a dimension alignment on each ciphertext feature information based on the standard reference vector to map a position of the encrypted sample subset in a statistical distribution space to obtain a plurality of ciphertext feature descriptors comprises: Rearranging the first accumulated value and the second accumulated value in each piece of ciphertext feature information according to the arrangement rule of the standard reference vector to obtain arranged ciphertext feature information; Obtaining the sample number of each encrypted sample subset, and generating a reciprocal factor of each encrypted sample subset based on the sample number; According to the reciprocal factor, the ordered ciphertext feature information is subjected to averaging treatment by utilizing the homomorphic encryption multiplication characteristic to obtain a plurality of statistical distribution coordinates, wherein the statistical distribution coordinates are positions, in a statistical distribution space, of mapping of corresponding encrypted sample subsets; Each statistical distribution coordinate is determined as a ciphertext feature descriptor of a corresponding encrypted sample subset.
8. Automatic data cleaning and labeling quality evaluation system for AI training is characterized by comprising: An acquisition module, configured to acquire an encrypted sample set derived from an untrusted domain and used for AI training, and standard reference data derived from a trusted domain, where the encrypted sample set includes a plurality of ciphertext samples and a plurality of ciphertext labels, and the standard reference data includes statistical summary data of benign data corresponding to the encrypted sample set and the ciphertext labels; the construction module is used for carrying out block index construction on the encrypted sample set based on the time information of each ciphertext sample to obtain a plurality of encrypted sample subsets, and converting the standard reference data into standard reference vectors matched with the characteristic dimensions of the encrypted sample subsets; The computing module is used for performing polynomial approximate computation on each encrypted sample subset in an encrypted state to obtain ciphertext feature information of each encrypted sample subset, and performing dimension alignment on each ciphertext feature information by taking the standard reference vector as a standard to map the position of the encrypted sample subset in a statistical distribution space to obtain a plurality of ciphertext feature descriptors; the calculation module is also used for calculating a distribution offset value between each ciphertext feature descriptor and the standard reference vector respectively; and the rejecting module is used for judging that the corresponding encrypted sample subset has abnormal distribution and rejecting the encrypted sample subset when the distribution offset value is larger than a preset safety threshold value so as to realize data cleaning and labeling quality evaluation.
9. An electronic device, comprising: A memory for storing a computer program; A processor for implementing the steps of the AI-training-oriented data cleaning and labeling quality assessment method of any of claims 1-7 when executing the computer program.
10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the computer program can implement the AI-training-oriented data cleaning and labeling quality assessment method according to any one of claims 1 to 7.

Description

Quality evaluation method and system for AI training-oriented data cleaning and labeling Technical Field The application relates to the technical field of information security, in particular to a quality evaluation method and a system for data cleaning and labeling for AI training. Background With the wide application of artificial intelligence technology in the fields of financial management, intelligent medical treatment, government affair fusion and the like, high-quality training data becomes a key element for guaranteeing the performance of a model. In the prior art, a centralized data quality evaluation scheme is generally adopted, namely, data of all parties are required to be converged to a central server, and abnormal data and label errors are detected by calculating statistical distribution characteristics of samples in a clear text state. However, in a cross-institution collaboration scenario involving privacy-sensitive data, data plaintext aggregation faces severe privacy exposure risks and compliance challenges, resulting in failure of traditional plaintext-based evaluation approaches. If the conventional processing is directly performed on the encrypted data, the original statistical characteristics of the data are destroyed by the encryption mechanism, and the existing distribution measurement algorithm cannot be directly applied, so that it is difficult for each participant to effectively evaluate the quality of the data and the label on the premise that the data is invisible. Therefore, the technical problem that effective quality evaluation and cleaning of the encrypted training samples and the labels cannot be realized while the privacy security of the data is ensured exists in the prior art. Disclosure of Invention The application aims to provide an AI training-oriented data cleaning and labeling quality evaluation method and system, which are used for solving the technical problem that effective quality evaluation and cleaning of encrypted training samples and labels cannot be realized while data privacy safety is ensured in the prior art. In a first aspect, the present application provides a quality assessment method for data cleaning and labeling for AI training, including: acquiring an encrypted sample set derived from an untrusted domain and used for AI training and standard reference data derived from the trusted domain, wherein the encrypted sample set comprises a plurality of ciphertext samples and a plurality of ciphertext labels, and the standard reference data comprises statistical summary data of benign data corresponding to the encrypted sample set and the ciphertext labels; Based on the time information of each ciphertext sample, carrying out block index construction on the encrypted sample set to obtain a plurality of encrypted sample subsets, and converting the standard reference data into standard reference vectors matched with the characteristic dimensions of the encrypted sample subsets; Under an encryption state, performing polynomial approximation calculation on each encrypted sample subset to obtain ciphertext feature information of each encrypted sample subset, and performing dimension alignment on each ciphertext feature information by taking the standard reference vector as a standard to map the position of the encrypted sample subset in a statistical distribution space to obtain a plurality of ciphertext feature descriptors; calculating a distribution offset value between each ciphertext feature descriptor and the standard reference vector; When the distribution offset value is larger than a preset safety threshold value, determining that the corresponding encrypted sample subset has distribution abnormality and eliminating the distribution abnormality in the encrypted sample set so as to realize data cleaning and labeling quality evaluation. Optionally, the method further comprises: Taking the encrypted sample subset with the distribution offset value smaller than or equal to a preset safety threshold value as a qualified encrypted sample subset; Performing ciphertext accumulation operation on ciphertext feature descriptors corresponding to all qualified encrypted sample subsets to obtain updated statistical information; and updating the standard reference vector by using a homomorphic encryption algorithm based on the updating statistical information to obtain the updated standard reference vector. Optionally, the method further comprises: acquiring a feature importance vector from the trusted domain, wherein the feature importance vector comprises a weight value corresponding to a feature dimension of each ciphertext feature descriptor; performing ciphertext multiplication operation on the standard reference vector and the feature importance vector by using a homomorphic encryption algorithm to obtain a weighted standard reference vector; Performing ciphertext multiplication operation on each feature element in the ciphertext feature descriptors and a corresp