CN-116304601-B - Design method of trusted fair data transaction platform for machine learning data

CN116304601BCN 116304601 BCN116304601 BCN 116304601BCN-116304601-B

Abstract

The invention discloses a design method of a data transaction platform for machine learning data, which enables a data purchaser to accurately evaluate the quality of the data before purchasing the data, and meanwhile, does not reveal the data privacy of a data provider in the evaluation process. The model owner samples the data of the data provider, the data provider submits the encrypted data subset, the two parties use a related privacy calculation method to conduct cooperative training, and the quality evaluation is conducted on the data provided by the data owner through the accuracy of the obtained model on the test data. The model owner may repeat this process for a plurality of different data providers and thereby select the higher quality data and conduct subsequent transaction operations. The transaction data is not falsified when the transaction is completed, i.e. the data evaluated before the transaction is indeed the data delivered by the data provider at the time of the transaction. The invention can ensure the credibility and fairness of the transacted data.

Inventors

LIU ZHUOTAO
LIU XUANQI

Assignees

清华大学

Dates

Publication Date: 20260505
Application Date: 20230228

Claims (8)

1. The design method of the trusted fair data transaction platform for the machine learning data is characterized by comprising the following steps of: acquiring a pre-training model of a data purchaser based on required data of a machine learning task; Acquiring all data owners participating in transactions based on the required data, and training the pre-training model according to a preset sampling proportion data set and a preset privacy calculation method in the data point abstract values of all data owners to obtain a test model; Performing accuracy testing on the test model by using preset test set data, and obtaining a data quality evaluation index according to an accuracy testing result; Selecting an optimal data owner according to the data quality evaluation index, and paying a price corresponding to a preset sampling proportion data point to the optimal data owner by utilizing a data transaction platform so as to acquire the preset sampling proportion data point of the optimal data owner; The pre-training model for local model training comprises a frozen layer, wherein the testing model is subjected to accuracy testing by using preset testing set data, and a data quality evaluation index is obtained according to an accuracy testing result, and the pre-training model comprises the following steps: Acquiring sampling input data of a preset sampling proportion according to the data point abstract value; Performing joint model training based on the condition that the data owner only comprises sampling input data, so as to input first test set data into the freezing layer to obtain a first intermediate output result which is returned to the data owner, and obtaining a first model accuracy index according to the output result obtained by performing joint model training according to a forward propagation process and a comparison result of correct classification labels; Based on the sampled input data and the corresponding classification labels contained by the data owner, carrying out local model training by using the sampled input data, inputting second test set data into a frozen layer after training is completed to obtain a second intermediate output result, and giving the second intermediate output result to the data owner for model training of a second half model to obtain a prediction result, and obtaining a second model accuracy index according to the comparison result of the prediction result and the correct classification labels.
2. The method of claim 1, further comprising, after said obtaining the preset sampling scale data point for the optimal data owner: Carrying out joint local model training by utilizing the preset sampling proportion data points so as to obtain a trained local model; And evaluating the accuracy index of the trained local model on the test set data, and comparing the accuracy index with the data quality evaluation index to obtain a data validity evaluation result.
3. The method of claim 2, wherein the pre-training model further comprises a data owner trainable layer and a data purchaser trainable layer, wherein the joint model training based on the sampled input data cases only included by the data owner comprises: Inputting the sampled input data to a freezing layer and a data possessor trainable layer for training to obtain a third intermediate output result; inputting the third intermediate output result to the data purchaser trainable layer for training output so as to obtain a correct classification label; and calculating a loss function according to the correct classification label, calculating a model gradient of the trainable layer of the data purchaser based on a loss calculation result and a back propagation method, and propagating the back propagation gradient to the trainable layer of the data owner for training, wherein the model gradient is completed through multiple rounds of iterative model training.
4. The method of claim 1, wherein storing the summary value of the data points in a distributed storage system, the obtaining the predetermined sampling scale data points of the optimal data owner comprises: Acquiring a storage address of a data point corresponding to a data point abstract in a distributed storage system; and obtaining the actual preset sampling proportion data point of the optimal data owner according to the storage address of the data point.
5. A data trading platform design system for trusted fairness of machine learning data, comprising: The task data release module is used for acquiring required data of the machine learning task based on the pre-training model of the data purchaser; The model joint training module is used for obtaining all data owners participating in transactions based on the required data, and training the pre-training model according to a preset sampling proportion data set and a preset privacy calculation method in the data point summary values of all the data owners to obtain a test model; The quality evaluation comparison module is used for testing the accuracy of the test model by using preset test set data and obtaining a data quality evaluation index according to the accuracy test result; The data transaction acquisition module is used for selecting an optimal data owner according to the data quality evaluation index, and paying a price corresponding to a preset sampling proportion data point to the optimal data owner by utilizing a data transaction platform so as to acquire the preset sampling proportion data point of the optimal data owner; the model joint training module is also used for: Acquiring sampling input data of a preset sampling proportion according to the data point abstract value; Performing joint model training based on the condition that the data owner only comprises sampling input data, so as to input first test set data into the freezing layer to obtain a first intermediate output result which is returned to the data owner, and obtaining a first model accuracy index according to the output result obtained by performing joint model training according to a forward propagation process and a comparison result of correct classification labels; Based on the sampled input data and the corresponding classification labels contained by the data owner, carrying out local model training by using the sampled input data, inputting second test set data into a frozen layer after training is completed to obtain a second intermediate output result, and giving the second intermediate output result to the data owner for model training of a second half model to obtain a prediction result, and obtaining a second model accuracy index according to the comparison result of the prediction result and the correct classification labels.
6. The system of claim 5, further comprising a validity assessment module for: Carrying out joint local model training by utilizing the preset sampling proportion data points so as to obtain a trained local model; And evaluating the accuracy index of the trained local model on the test set data, and comparing the accuracy index with the data quality evaluation index to obtain a data validity evaluation result.
7. The system of claim 6, wherein the pre-training model further comprises a data owner trainable layer and a data purchaser trainable layer, the model co-training module further configured to: Inputting the sampled input data to a freezing layer and a data possessor trainable layer for training to obtain a third intermediate output result; inputting the third intermediate output result to the data purchaser trainable layer for training output so as to obtain a correct classification label; and calculating a loss function according to the correct classification label, calculating a model gradient of the trainable layer of the data purchaser based on a loss calculation result and a back propagation method, and propagating the back propagation gradient to the trainable layer of the data owner for training, wherein the model gradient is completed through multiple rounds of iterative model training.
8. The system of claim 5, wherein the data point summary value is stored in a distributed storage system, the data transaction acquisition module further configured to: Acquiring a storage address of a data point corresponding to a data point abstract in a distributed storage system; and obtaining the actual preset sampling proportion data point of the optimal data owner according to the storage address of the data point.

Description

Design method of trusted fair data transaction platform for machine learning data Technical Field The invention relates to the technical field of fair data transaction markets, in particular to a design method of a trusted fair data transaction platform aiming at machine learning data. Background As the value of data, and in particular data in machine learning tasks, is becoming accepted by the industry, the concept of a data trade market is becoming more and more appreciated. However, how to evaluate the quality of data without directly acquiring the data is an unavoidable issue for the data trade market, since the quality of data will directly influence the various metrics of the model resulting from the machine learning task, such as classification accuracy, etc. Various indexes of the model on the test data set are important quantitative basis for evaluating the quality of training data, so that in order to evaluate the quality of the data before purchasing the data, model training and performance testing are required on the privacy-protected data set. Conventional data trading markets typically include a centralized platform, all data for trading is uploaded to and presented to the platform, and the platform evaluates the data accordingly. However, with the enhancement of data privacy and the export of the corresponding privacy protection act, directly delivering plaintext data to the platform without conditional trust concentration becomes unacceptable. Since joint learning is typically more time consuming than model and data centralized plaintext machine learning, while data sets in the data trade market are typically larger in data volume, model training and quality assessment over the entire data set will result in excessive time consuming and transmission costs. Homomorphic cryptography is a branch of cryptography that has evolved faster in recent years. The homomorphic property enables the operator to operate on the ciphertext data, and the operation result is equal to the encrypted ciphertext of the plaintext operation result. CKKS the cryptographic system supports the approximate addition and multiplication of fixed point numbers in ciphertext space, and is suitable for machine learning systems which do not need accurate computation. Meanwhile, CKKS cryptographic systems support vector operations, and an optimization space is provided for parallelization of operations. Disclosure of Invention The present invention aims to solve at least one of the technical problems in the related art to some extent. Therefore, the invention provides a design method of a data transaction platform aiming at machine learning data, which tries to design a decentralization transaction mode, plain text data does not need to be handed to a trusted third party, and a data purchaser (model owner) can cooperate with a data provider by itself to evaluate the data quality. The quality assessment needs to be trained on the data set with privacy protection, so that the method adopts a joint learning mode to ensure that the data privacy is not revealed to an assessment party in the quality assessment process. And the present invention designs a protocol such that the data quality assessment is performed for only a portion of the samples that are randomly sampled. At the same time, the invention also needs to ensure that the data of the final transaction is indeed the data set reflected by the quality assessment, and the data provider cannot make a counterfeiter. Specifically, the invention is applied to the scene of classification problems in machine learning, and relates to two modes of data transaction construction. One is that the data owner has the data and the corresponding classification label, and the other is that the data owner has the input data and the data purchaser has the corresponding classification label. Another object of the present invention is to provide a data transaction platform design system for trusted fairness of machine learning data. In order to achieve the above objective, in one aspect, the present invention provides a method for designing a data transaction platform for trusted fairness of machine learning data, including: acquiring a pre-training model of a data purchaser based on required data of a machine learning task; Acquiring all data owners participating in transactions based on the required data, and training the pre-training model according to a preset sampling proportion data set and a preset privacy calculation method in the data point abstract values of all data owners to obtain a test model; Performing accuracy testing on the test model by using preset test set data, and obtaining a data quality evaluation index according to an accuracy testing result; and selecting an optimal data owner according to the data quality evaluation index, and paying a price corresponding to a preset sampling proportion data point to the optimal data owner by utilizing a data transaction platform so as