CN-122020292-A - Data labeling method and device

CN122020292ACN 122020292 ACN122020292 ACN 122020292ACN-122020292-A

Abstract

The embodiment of the specification discloses a data labeling method and device. Firstly, processing an arbitrary target sample in a first sample set to be marked by using n first large models to obtain n corresponding marking results, wherein each marking result comprises a marked category label and a corresponding marking basis. And then, processing the n labeling results by using a second large model to obtain the sorting results of m candidate labels related to the n labeling results. And then, determining m confidence degrees corresponding to the m candidate labels, wherein any ith confidence degree is positively related to the sorting weight of the ith candidate label determined based on the sorting result, and negatively related to the confusion degree when the ith candidate label is generated by the second large model. And then, determining a candidate label corresponding to the maximum value as a final label of the target sample under the condition that the maximum value in the m confidence coefficients is larger than a preset threshold value.

Inventors

REN YANAN
Mu Jiaming
LI FENGTING

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (11)

1. A method of labeling data, comprising: Processing the target samples in the first sample set to be marked by using n first large models respectively to obtain n corresponding marking results, wherein each marking result comprises a marked category label and a corresponding marking basis; Processing the n labeling results by using a second large model to obtain the sorting results of m candidate labels related to the n labeling results; Determining m confidence degrees corresponding to the m candidate labels, wherein any ith confidence degree is positively related to the sorting weight of the ith candidate label determined based on the sorting result, and negatively related to the confusion degree when the ith candidate label is generated by the second large model; And under the condition that the maximum value in the m confidence coefficients is larger than a preset threshold value, determining the candidate label corresponding to the maximum value as the final label of the target sample.
2. The method of claim 1, wherein the processing of the n first large models by using the n first large models to obtain n labeling results comprises: Respectively inputting the first prompt words into the n first large models to obtain the n labeling results; The first prompt word comprises original data of the target sample, a preset classification task description and a plurality of preset class labels.
3. The method of claim 1, wherein determining m confidence levels for the m candidate tags comprises: Determining m sorting weights corresponding to the m candidate labels, wherein the more the position of the ith candidate label in the sorting result is, the higher the ith sorting weight corresponding to the ith candidate label is; determining m confusion degrees corresponding to the m candidate labels, wherein the ith confusion degree is negatively related to the generation probability of the second large model for each word element in the ith candidate label; The m confidence levels are determined based on the m ranking weights and m confusion levels.
4. The method of claim 3, wherein determining the m confidence levels based on the m ranking weights and m confusion levels comprises: determining an ith confidence score based on the ith ranking weight and the ith confusion degree; And carrying out normalization processing on the m confidence scores to obtain the m confidence scores.
5. The method of claim 1, wherein the determining of the first sample set comprises: classifying each sample in the second sample set by using a third model to obtain a plurality of corresponding pre-classification labels, wherein the plurality of pre-classification labels belong to a plurality of preset class labels; Based on the pre-classification labels, classifying each sample into a plurality of sample subsets corresponding to the preset classification labels; and carrying out hierarchical sampling on the plurality of sample subsets to obtain the first sample set.
6. The method of claim 5, wherein classifying each sample in the second set of samples with the third model results in a corresponding number of pre-classification labels, comprising: classifying the samples by using the third model aiming at the samples to obtain a plurality of classification probabilities corresponding to the plurality of preset class labels; classifying preset classification labels corresponding to the classification probabilities larger than the probability threshold value into a plurality of pre-classification labels.
7. The method of claim 5, wherein hierarchically sampling the plurality of sample subsets to obtain the first sample set comprises: The hierarchical sampling is performed based on a desired data distribution resulting in the first sample set.
8. The method of claim 7, wherein after determining the candidate tag corresponding to the maximum value as the final tag of the target sample, the method further comprises: constructing a corresponding labeling sample by utilizing the original data of the target sample and the final label, and classifying the labeling sample into a labeling sample set; Counting the data distribution of the marked sample set, and performing incremental sampling based on the plurality of sample subsets under the condition that the data distribution is not matched with the expected data part; And labeling the sample based on the incremental sampling to supplement the labeled sample set.
9. A data annotation device comprising: The multi-model labeling module is configured to process any target sample in a first sample set to be labeled by using n first large models to obtain n corresponding labeling results, wherein each labeling result comprises a labeled category label and a corresponding labeling basis; the comprehensive analysis module is configured to process the n labeling results by using a second large model to obtain the sorting results of m candidate labels related to the n labeling results; a confidence determining module configured to determine m confidences corresponding to the m candidate tags, wherein any i-th confidence is positively correlated to a ranking weight of the i-th candidate tag determined based on the ranking result, and negatively correlated to a confusion degree when the i-th candidate tag is generated by the second large model; And the confidence level diversion module is configured to determine a candidate label corresponding to the maximum value as a final label of the target sample under the condition that the maximum value in the m confidence levels is larger than a preset threshold value.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-8.
11. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-8.

Description

Data labeling method and device Technical Field One or more embodiments of the present disclosure relate to the field of computer technology, and more particularly, to a data labeling method and apparatus, a computer-readable storage medium, and a computing device. Background With the rapid development of artificial intelligence technology, particularly the wide application of large models in numerous fields, the need for high-quality training and optimization data is increasingly urgent. In the model training stage, the data quality directly influences the upper limit of the model learning effect, and the high-quality annotation data can remarkably improve the performance of the large model on specific tasks. After the model is deployed on line, the labeling and feedback of the model output result become key links of continuous iterative optimization. By constructing a closed-loop optimization mechanism of output-labeling-feedback, the generation result of the large model can be evaluated and corrected, so that model parameters and generation strategies are continuously adjusted, and the output quality is gradually improved. Therefore, in both the initial training stage and the subsequent continuous optimization process, high-quality data labeling is a necessary condition for ensuring the full performance of the large model, and the importance of the high-quality data labeling is increasingly highlighted along with the development of large model technology. However, conventional data tagging schemes face many limitations in practical applications. Therefore, a new labeling scheme is urgently needed, which can meet higher requirements in practical applications, for example, simultaneously guaranteeing labeling efficiency and labeling quality. Disclosure of Invention The embodiment of the specification describes a data labeling method and device, which can solve the technical problems. According to a first aspect, a data annotation method is provided. The method comprises the steps of respectively processing any target sample in a first sample set to be marked by using n first large models to obtain n corresponding marking results, wherein each marking result comprises a marked category label and a corresponding marking basis. And processing the n labeling results by using a second large model to obtain the sorting results of m candidate labels related to the n labeling results. And determining m confidence degrees corresponding to the m candidate labels, wherein any ith confidence degree is positively related to the sorting weight of the ith candidate label determined based on the sorting result, and negatively related to the confusion degree when the ith candidate label is generated by the second large model. And under the condition that the maximum value in the m confidence coefficients is larger than a preset threshold value, determining the candidate label corresponding to the maximum value as the final label of the target sample. In one embodiment, the n first large models are used for respectively processing the n first large models to obtain n corresponding labeling results, and the method comprises the steps of respectively inputting first prompt words into the n first large models to obtain the n labeling results. The first prompt word comprises original data of the target sample, a preset classification task description and a plurality of preset class labels. In one embodiment, determining the m confidence levels corresponding to the m candidate labels includes determining m sorting weights corresponding to the m candidate labels, wherein the i-th sorting weight corresponding to the i-th candidate label is higher as the position of the i-th candidate label in the sorting result is higher. And determining m confusion degrees corresponding to the m candidate labels, wherein the ith confusion degree is negatively related to the generation probability of the second large model for each word element in the ith candidate label. The m confidence levels are determined based on the m ranking weights and m confusion levels. Further, in a specific embodiment, the m confidence degrees are determined based on the m sorting weights and the m confusion degrees, wherein the m confidence degrees comprise the steps of determining an ith confidence score based on the ith sorting weight and the ith confusion degree, and normalizing the m confidence scores to obtain the m confidence degrees. In one embodiment, the determining of the first sample set includes classifying each sample in the second sample set by using a third model to obtain a plurality of corresponding pre-classification labels, where the plurality of pre-classification labels belong to a plurality of preset class labels. And classifying each sample into a plurality of sample subsets corresponding to the plurality of preset category labels based on the plurality of pre-category labels. And carrying out hierarchical sampling on the plurality of sample su