CN-122020148-A - Man-machine collaborative data labeling method based on large model

CN122020148ACN 122020148 ACN122020148 ACN 122020148ACN-122020148-A

Abstract

The invention discloses a man-machine collaborative data labeling method based on a large model, which comprises the steps of selecting representative and diversified samples by utilizing an active learning strategy, then labeling zero samples or few samples of the selected samples by combining the large language model, so as to generate an initial labeling database, and continuously optimizing a labeling result in a multi-round iteration mode of combining automatic labeling and manual revision of the large language model. The method provided by the invention can realize high-efficiency starting under the condition that only a small amount of initial data is needed, thereby remarkably reducing the manual labeling workload and improving the accuracy and consistency of the labeling result.

Inventors

WU WEIZE
SHI LEI
YANG YI
CHEN ZIHAN
LIU XUENING

Assignees

杭州新才智脑科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251209

Claims (8)

1. A man-machine cooperative data labeling method based on a large model is characterized by comprising the following steps: Step 1, acquiring a target data set, and selecting partial data in the target data set to form an initial sample set by utilizing an active learning strategy; performing zero sample initial round labeling on the initial sample set by using a large language model to obtain an initial labeling result corresponding to the initial sample; then, based on an RAG retrieval enhancement algorithm, performing iterative optimization on the primary labeling result by using the samples in the initial sample set as few sample examples; Step 2, taking the current initial sample set as a retrieval source, carrying out similarity sequencing on the rest data in the initial sample set based on an RAG retrieval enhancement algorithm for each piece of data to be processed, and constructing a few sample set by using samples with K similarity scores in the rest data; Inputting the few sample example sets as prompt words into a large language model for context learning, re-labeling the data to be processed based on the large language model after the context learning, revising the re-labeling result in a manual mode, and updating revised data into the initial sample sets so as to complete a round of man-machine joint labeling rounds; And step 3, repeating the step 2 until the iteration termination condition is met, so as to obtain a final labeling database.
2. The method for labeling human-computer collaborative data based on a large model according to claim 1, wherein the active learning strategy refers to selecting samples with high information content and large differences from the selected samples by calculating the representativeness and coverage of each sample.
3. The method for labeling human-computer collaborative data based on a large model according to claim 1, wherein prior to constructing an initial sample set, the target data set needs to be batched and prioritized, which is as follows: Calculating a representative index and a diversity index of each sample by adopting an active learning algorithm, and generating a priority order according to the representative index and the diversity index; Dividing the samples adjacent in sequence into the same batch based on the priority order to form a batch sequence; After the completion of the lot division, the lot with the high priority is first iteratively processed.
4. The man-machine cooperative data marking method based on the large model as claimed in claim 3, wherein the implementation process of the active learning algorithm is as follows: Setting the complete data set as Marked subset is Unlabeled subset is ; The active learning algorithm will be directed to unlabeled subsets Each candidate sample of (3) Performing annotation value evaluation to determine the contribution degree of the annotation value to the expansion of the annotation database; The active learning algorithm selects the sample with highest labeling value and never-labeled subset Middle transition to annotated subset This process is repeated until the noted subset size reaches the expected size.
5. The man-machine cooperative data labeling method based on the large model as claimed in claim 1, wherein the implementation process of the RAG retrieval enhancement algorithm is as follows: Let the initial sample set with label be Input data is ; The retrieval process inputs data through similarity function And each initial sample Scoring and sorting the results from high to low according to the score; selecting a front from the sorting results The samples constitute a small sample example set.
6. The large model-based human-machine collaborative data labeling method of claim 5, wherein the similarity function includes a sparse-vector-based representation and a semantic dense-vector-based representation.
7. The human-computer collaborative data labeling method based on a large model according to claim 1, wherein the termination condition comprises: a. The number of manual revisions in the current round is insufficient to satisfy the new round of valid updates, i.e., below the set minimum revision threshold; b. When the number of current rounds reaches the upper limit defined by the exponential decay function, the system will force termination of the iteration.
8. The human-computer collaborative data labeling method based on a large model according to claim 1 or 7, wherein the expression of the termination condition is as follows: Wherein, the method comprises the steps of, Representing round counts within a current lot ; Represent the first The number of manual revisions in the wheel; A minimum revised round threshold; Representing a difference between the initial number of rounds and the minimum number of rounds; Is an exponential decay constant; Representing a batch index; Representing the minimum number of rounds.

Description

Man-machine collaborative data labeling method based on large model Technical Field The invention belongs to the technical field of artificial intelligence and data annotation, and particularly relates to a man-machine collaborative data annotation method based on a large model. Background In recent years, artificial intelligence technology has rapidly developed, and various intelligent algorithms have been widely used in various fields. The large language model (Large Language Model, LLM) is a prominent representative of artificial intelligence research and application by virtue of its huge parameter scale and powerful natural language understanding and generating capability. The model can play a role in various downstream tasks such as information extraction, text classification, question and answer system, dialogue generation, data annotation and the like, and shows excellent performance in complex natural language processing. In the field of data annotation, large models are also increasingly being applied to assist or replace manual annotation, thereby significantly reducing labor costs and accelerating the application of data in downstream tasks. However, there are still limitations to labeling by relying only on large language models. On one hand, the model is easy to make mistakes when processing the technical terms, the meaning of the technical terms cannot be correctly understood and needs to be explained, and on the other hand, the large model has limitation on the ability of following complex labeling rules, and the labeling accuracy rate can be lower. The patent document CN120319413A discloses a data labeling system based on AI cooperation, which comprises a data preprocessing module, a multi-model assisted labeling module, an AI server, a man-machine cooperation module, a quality evaluation feedback module and a scheduling module, wherein the data preprocessing module comprises the steps of acquiring basic sample data, processing and outputting the sample data to the multi-model assisted labeling module to generate a pre-labeling result, the scheduling module schedules the AI server to accurately standard the pre-labeling result by using the man-machine cooperation module according to the sample data volume, and the quality evaluation feedback module acquires the accurate labeling result and outputs the labeling result after manual sampling inspection. The patent document CN120086754A discloses a man-machine collaborative data labeling and cleaning method based on a multi-mode large language model, which comprises the steps of obtaining text prompt words of data labeling instructions, extracting text characteristics of the text prompt words through a text encoder, extracting data characteristics associated with labeling objects in the data labeling instructions by utilizing a characteristic extraction module, mapping the data characteristics to a text space to obtain data characteristic texts, inputting the text characteristics and the data characteristic texts into the large language model, understanding the data to be labeled, generating data labeling information of the data to be labeled corresponding to the text prompt words, finishing data labeling of the data to be labeled by utilizing the data labeling module to obtain algorithm labeling results, displaying the algorithm labeling results to workers, obtaining accuracy judgment results of the algorithm labeling results based on the accuracy judgment results, and cleaning the algorithm labeling results to obtain the data labeling results. Disclosure of Invention The invention aims to provide a man-machine collaborative data labeling method based on a large model, which can realize efficient starting under the condition that only a small amount of initial data is needed, remarkably reduce the manual labeling workload and improve the accuracy and consistency of labeling results. In order to achieve the purpose of the invention, the scheme is that the man-machine cooperative data labeling method based on the large model comprises the following steps: Step 1, acquiring a target data set, and selecting partial data in the target data set to form an initial sample set by utilizing an active learning strategy; performing zero sample initial round labeling on the initial sample set by using a large language model to obtain an initial labeling result corresponding to the initial sample; then, based on an RAG retrieval enhancement algorithm, performing iterative optimization on the primary labeling result by utilizing the samples in the initial sample set as few sample examples so as to improve the labeling consistency of the initial sample set; Step 2, taking the current initial sample set as a retrieval source, carrying out similarity sequencing on the rest data in the initial sample set based on an RAG retrieval enhancement algorithm for each piece of data to be processed, and constructing a few sample set by using samples with K similarity scor