CN-121980141-A - Abnormal structure type data cleaning method based on homomorphic encryption

CN121980141ACN 121980141 ACN121980141 ACN 121980141ACN-121980141-A

Abstract

The invention discloses an abnormal structure type data cleaning method based on homomorphic encryption, which comprises the following steps of cleaning abnormal structure type data based on homomorphic encryption; SOCI tool package and expansion, S3, homomorphism comparison function scheme flow, homomorphism data cleaning based on the SOCI tool package. The homomorphic encryption-based abnormal structure type data cleaning method disclosed by the invention does not need to decrypt an encrypted data set, can directly carry out data cleaning operation on ciphertext, accurately identify and mark error data elements, radically avoids the exposure risk of data plaintext, solves the core contradiction of privacy disclosure caused by decrypting original data and incapability of detecting encrypted data in traditional data detection, and realizes the compromise of privacy protection and data detection requirements.

Inventors

WANG YAN

Assignees

王妍

Dates

Publication Date: 20260505
Application Date: 20260129

Claims (4)

1. The homomorphic encryption-based abnormal structure type data cleaning method is characterized by comprising the following steps of: s1, cleaning abnormal structure type data based on homomorphic encryption; s2, SOCI toolkits and expansion; S3, homomorphic comparison function scheme flow; s4, cleaning homomorphic data based on the SOCI tool kit.
2. The homomorphic encryption based abnormal structure type data cleansing method according to claim 1, wherein in step S1, cleansing executed by a data owner is defined as C1, cleansing executed by a third party is defined as C2, and data cleansing flow converted into a trust domain is centrally displayed, including error detection and error repair links involving data owner known constraints, third party known constraints and third party executing processes, wherein error repair requires assistance of the data owner.
3. The method for cleaning abnormal structural data based on homomorphic encryption according to claim 1, wherein in step S2, a SOCI tool kit is constructed based on a Paillier cryptosystem, homomorphic encryption operation is supported, and privacy protection machine learning process is implemented by using the SOCI tool kit.
4. The method according to claim 1, wherein in step S3, the privacy-preserving data cleansing defines the data cleansing problem executed in the trust domain after conversion as a homomorphic comparison function, inputs the homomorphic comparison function as a data set and a constraint condition, and outputs the homomorphic comparison function as a data set satisfying the constraint condition, and when the data cleansing function is executed in the trust domain, it is equivalent to a conventional data cleansing process in a machine learning process in which a data owner can execute the data cleansing function in the trust domain, and a third party executes the privacy-preserving data cleansing function in the trust domain after conversion by means of the privacy-preserving machine learning method.

Description

Abnormal structure type data cleaning method based on homomorphic encryption Technical Field The invention belongs to the technical field of data cleaning, and particularly relates to an abnormal structure type data cleaning method based on homomorphic encryption. Background Big data and Machine Learning (ML) techniques have driven significant revolution in a number of areas. Machine Learning (ML) has achieved significant effort in a number of areas by virtue of large amounts of training data and high performance computing resources. With the wide spread of machine learning applications, privacy security concerns and outsourcing requirements are also brought. Meanwhile, the cloud computing and outsourcing service enables the construction of the machine learning application to be more convenient. The user can send the data to the cloud server for providing the outsourcing service, and the data privacy can be ensured while enjoying the technical advantages of machine learning. In order to solve the problem of data privacy, researchers propose and apply Privacy Preserving Machine Learning (PPML) technology to aim at aiming at various pain points in the field of data privacy, and at present, the researchers have received extensive attention in the related field. Privacy preserving machine learning can preserve various types of privacy in machine learning applications, such as data set privacy, training model privacy, participant-associated privacy, and the like. In privacy-preserving machine learning research, data cleaning plays an extremely important role in machine learning research, and there have been a great deal of research to improve machine learning model performance through data processing, and also to improve the data cleaning process by means of the machine learning model. In conventional data cleaning for machine learning studies, a data engineer can complete the cleaning process by a simple method, but the actual operation is faced with a variety of complications. Furthermore, during the training and evaluation phase, the return data cleaning session may need to be reprocessed. On the one hand, the data owner may not have the ability to complete the data cleansing, and on the other hand, it is difficult for third parties to trust that the data set provided by the data owner is clean. Thus, the privacy-preserving data cleansing process has become a new challenge for privacy-preserving machine learning. Existing Privacy Preserving Machine Learning (PPML) all defaults to using clean data sets without taking into account errors that may exist in the actual data sets. However, in a real scene, factors such as manual recording errors, sensor faults and the like often cause deviation of collected data, and in the field of databases, such inaccurate, incomplete or inconsistent data are called as 'dirty data', 'abnormal data' or 'coarse data'. The data cleaning is used as a preprocessing key link, and refers to a process of detecting and identifying damaged, wrong or irrelevant problem data from a record set, a table or a database, so as to provide a reliable basis for the subsequent processing of dirty data or rough data. The current data cleaning process faces two major core problems, namely firstly, the diversity and collaboration dependence of error detection. Error detection is the primary step in data cleansing, and its core difficulty is the complexity of the data error type. Errors in the real dataset include missing values, outliers, duplicate values, error labels, multi-dimensional mismatches, and the like. Based on the differences of the processing objects, error detection can be divided into two types, one is Partial Cleaning (PC), mainly aiming at the basic error types, the errors are commonly existing in most data sets, the owner of the data clearly knows the corresponding constraint rules (such as numerical range and format specification) of the data, the detection difficulty is relatively low, the other is Complete Cleaning (CC), focusing on special errors which can directly limit the performance of a model, the errors have concealment and unpredictability, the constraint conditions are usually mastered by a third party, and the detection difficulty is remarkably high. In addition, the repair link after error detection has the problem of collaborative cost, partial error repair is required to rely on background information or verification feedback provided by a data owner and cannot be independently completed by a third party, so that the execution complexity of a cleaning process is further increased, and the application range of the data cleaning method is not matched with the real scene requirement. The existing data cleaning method mostly uses a simple scene as a design premise, and the diversity of error types and the heterogeneity of data distribution in a real scene make the traditional simple method difficult to cover the whole area, so that the cleaning effect is poor. More signifi