CN-121509443-B - Data annotation crowdsourcing method and system based on blockchain and crowd wisdom

CN121509443BCN 121509443 BCN121509443 BCN 121509443BCN-121509443-B

Abstract

The invention discloses a data labeling crowdsourcing method and system based on blockchain and group wisdom, which comprises the steps of issuing labeling tasks and locking rewards, registering and mortgage fund respectively by a trusted execution environment TEE node and a labeling person, screening labeling person clusters based on multi-dimensional capacity modeling, selecting a designated number of nodes from the TEE nodes through a random selection algorithm to form an executed TEE cluster, encrypting labeling results of the labeling person clusters and sending the labeling results to the selected TEE cluster, executing a predefined aggregation algorithm and profit calculation by the TEE node in a safe environment, reporting the results, starting candidate nodes to execute successively if invalid, comparing execution results if consistent, distributing profits if inconsistent, and executing until consistent results are obtained. The invention filters the labeling person cluster with the most group wisdom, and makes the labeling result undergo redundancy aggregation and clearing through the multi-TEE cluster selected randomly so as to realize crowd-sourced labeling with high quality and high credibility.

Inventors

TAN HAIBO
FAN HAIDONG
ZHAO HE
XU JINLIN
LI XIAOFENG
Niu Zixuan
CHENG HAOTIAN
ZHOU TONG
SHENG NIANZU

Assignees

中国科学院合肥物质科学研究院
安徽中科晶格技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260112

Claims (9)

1. The data annotation crowdsourcing method based on the blockchain and the crowd wisdom is characterized by comprising the following steps: The method comprises the following steps of S1, task release and preparation, wherein a labeling data demand party releases a labeling task and locks rewards; S2, group construction, namely screening a marker cluster with the capacity reaching a threshold and the maximum group intelligent potential from the markers participated in based on multi-dimensional capacity modeling, and collecting the labeling result of the marker cluster, wherein the specific steps comprise: S21, extracting part of gold standard data sets with known real label conditions by a task issuer to serve as a test set, setting an accuracy threshold on the test set, and enabling a labeling person to firstly test and label; s22, performing dimension reduction and standardization processing on the high-dimensional embedded features to obtain low-dimensional feature vectors x of the examples; S23, constructing a probability map model based on the labeling result L and the low-dimensional feature vector x and combining with a signal detection theory for a user with the labeling result reaching a set accuracy threshold, and obtaining an optimized labeling person capacity weight vector through alternating optimization of maximum posterior estimation and gradient rising ; S24, weighting vectors based on the annotator capability Calculating a population diversity index D, and screening a marker population with the diversity index D being the largest by using a greedy algorithm; s3, selecting a designated number of nodes from registered TEE nodes through a random selection algorithm to form a TEE cluster for executing the round of aggregation and clearing tasks; S4, trusted execution and candidate steps are carried out, namely, the labeling result of the labeling person cluster is encrypted and then sent to the TEE cluster selected in the S3; S5, judging and settling, namely comparing execution results of all the TEE nodes, if the execution results are consistent, carrying out final income distribution according to the results, and if the execution results are inconsistent, triggering a new execution round until a consistent result is obtained.
2. The method as claimed in claim 1, wherein the label decision L of the signal detection theory hypothesis marker in the step S23 is based on whether a signal U exceeds a decision threshold τ, the signal U being composed of a capability weight vector And the eigenvector x with noise n.
3. The method according to claim 1, wherein the diversity index D in step S24 is calculated from the sum of covariance of the capability weight vector representing the marker and the group capability mean vector.
4. The data labeling crowd-sourcing method based on blockchain and crowd wisdom as in claim 1, wherein the random selection algorithm in the node screening step in step S3 is: Defining a function, and receiving a block hash value seed, the total node number M of the TEE and the node number ks to be selected; and adding non-repeated node numbers to the set through circularly calculating hash values and modulo operation until ks nodes are selected.
5. The method for crowd-sourcing data labeling based on blockchain and crowd wisdom according to claim 1, wherein in the step S4, the labeling result is encrypted by using the public key of the selected TEE node and forwarded by the relay contract, and the TEE node decrypts and performs computation in the secure environment and then sends the result and the digital signature to the verification contract; Wherein the following data is obtained for each sample j: Multiple annotators give labels ; Each annotator has a capability weight vector ; For each candidate tag y, calculate its weighted score : The final output label is: Wherein j is a sample index to be marked, i is a marker index; the labeling result of the labeling person i on the sample j is obtained; A capability weight vector for a annotator i; Modulo the capability weight vector of the annotator i; Candidate tag for sample j; The score obtained for candidate tag y for sample j; the final label for sample j; And (3) obtaining the label item with the largest score value from all the candidate labels of the sample j.
6. The method for crowd-sourcing data labels based on blockchain and crowd-wisdom according to claim 5, wherein in the step S4, in the trusted execution and candidate steps, when the execution node is monitored to stop service or a timeout is not reported, the same number of node-taking tasks are randomly extracted from the candidate node pool.
7. The method of claim 6, wherein for TEE nodes whose time-out has not been successfully reconnected or whose results have not been sent on time, a portion of their mortgage amount is deducted in proportion and the task publisher does not need to pay the computational service cost of the node.
8. The method for crowd-sourcing data tag based on blockchain and crowd wisdom according to claim 1, wherein in the step S5, if the execution results of TEE nodes are inconsistent, the step S3 node screening step and the step S4 trusted execution and candidate step are re-entered until consistent calculation results are obtained.
9. An annotation system employing a blockchain and crowd-sourced approach to data annotation based on blockchain and crowd-wisdom as claimed in any of claims 1 to 8, comprising: The group construction module is used for screening out a marker cluster with the maximum group intelligent potential from the markers participated in registration based on multi-dimensional capacity modeling; The task issuing and registering module is used for issuing marking requirements and locking rewards by the marking data requiring party and registering and mortgage by the TEE node and the marking person; the node selection module is used for selecting a TEE node cluster for executing the round of aggregation and clearing tasks from registered TEE nodes through a random selection algorithm; the data relay module is used for receiving the encrypted labeling result and forwarding the labeling result to the selected TEE node; the trusted execution module is deployed on the TEE node and is used for decrypting the labeling result, executing the aggregation algorithm and calculating the benefits in the safe environment; the verification and arbitration module is used for receiving and comparing the execution results of all the TEE nodes, triggering a candidate mechanism and arbitrating the final effective result; And the clearing and settlement module is used for automatically distributing the benefits of the annotators and the TEE nodes according to the final consistent aggregate clearing result.

Description

Data annotation crowdsourcing method and system based on blockchain and crowd wisdom Technical Field The invention relates to the technical field of data annotation, in particular to a data annotation crowdsourcing method and system based on blockchain and crowd wisdom. Background With the increasing use of artificial intelligence technology, high quality data labeling becomes a bottleneck for model training. Supervised learning models require a large amount of data with labels prior to training, which often require manual labeling. The traditional manual labeling mode is high in cost, and the problem of low quality of data labeling needs to be solved. The crowdsourcing platform gradually becomes an effective alternative scheme for improving the labeling speed and reducing the cost, but brings new problems that the labeling quality is uneven due to uneven skills of labeling operators, and the platform and task publishers often cannot timely publish correct and wrong information of each labeling sample, so that the rights and interests of the labeling operators are easy to be infringed, and meanwhile the difficulty of data quality control is increased. Traditional quality control methods include the use of gold standard datasets to screen quality labels, but such methods not only add additional labeling costs, but also have a higher screening cost for quality labels. The group wisdom theory provides a new thought for solving the layering labeling quality problem, namely the accuracy of overall judgment is improved through aggregation of individual diversity, but the theory still stays in a theoretical stage at present, and is not widely applied in actual application scenes, and particularly, specific technical implementation means and reliable technical guarantee measures are lacking. Trusted Execution Environment (TEE) technology provides a secure computing environment that ensures isolated execution of data and operations, thereby ensuring the security of the computing process and the reliability of the results. Although TEE technology performs well in terms of security, a series of challenges are still faced in practical applications, especially side channel attacks on the physical level and vulnerability attacks on the software level. These potential safety hazards limit the wide application of TEE technology in some high-safety requirement scenes, and a more safe and reliable technical scheme is needed to be designed to meet the actual requirements. In summary, the prior art has obvious shortcomings in cost and quality control in the crowdsourcing labeling process, and the group wisdom theory and the TEE technology can solve part of the problems in theory, but still lack effective implementation means and guarantee measures in practical application. Disclosure of Invention The invention aims to overcome the defects in the prior art, and in order to achieve the purposes, the data annotation crowdsourcing method and system based on blockchain and crowd wisdom are adopted to solve the problems in the background art. A data annotation crowdsourcing method based on blockchain and crowd wisdom comprises the following steps: The method comprises the following steps of S1, task release and preparation, wherein a labeling data demand party releases a labeling task and locks rewards; S2, group construction, namely screening out a marker cluster with the capacity reaching a threshold and maximum group intelligent potential from the markers participating in registration based on multi-dimensional capacity modeling; s3, selecting a designated number of nodes from registered TEE nodes through a random selection algorithm to form a TEE cluster for executing the round of aggregation and clearing tasks; S4, trusted execution and candidate steps are carried out, namely, the labeling result of the labeling person cluster is encrypted and then sent to the TEE cluster selected in the S3; S5, judging and settling, namely comparing execution results of all the TEE nodes, if the execution results are consistent, carrying out final income distribution according to the results, and if the execution results are inconsistent, triggering a new execution round until a consistent result is obtained. As a further scheme of the invention, the specific steps in the step S2 comprise: S21, extracting part of gold standard data sets with known real label conditions by a task issuer to serve as a test set, setting an accuracy threshold on the test set, and enabling a labeling person to firstly test and label; s22, performing dimension reduction and standardization processing on the high-dimensional embedded features to obtain low-dimensional feature vectors x of the examples; S23, constructing a probability map model based on the labeling result L and the low-dimensional feature vector x and combining with a signal detection theory for a user with the labeling result reaching a set accuracy threshold, and obtaining an optimized labeling per