CN-121997042-A - Data labeling system, method, medium and terminal based on large language model

CN121997042ACN 121997042 ACN121997042 ACN 121997042ACN-121997042-A

Abstract

The application provides a data labeling system, method, medium and terminal based on a large language model, wherein the system comprises a data access layer, an intelligent labeling engine layer, a large language model and a quality control layer, wherein the data access layer is used for collecting data to be labeled and preprocessing the data to be labeled so as to enable the data to be converted into a unified data format, the intelligent labeling engine layer is used for searching context information in the vertical field for the received data to be labeled in the unified data format, splicing the searched context information with the data to be labeled and inputting the spliced context information into the pretrained large language models, the large language model is used for generating labeling results of the data to be labeled, and the quality control layer is used for monitoring the quality of the labeling results generated by the large language models and carrying out feedback analysis according to the monitored labeling quality problems. The method and the device can realize the full-process automation from the acquisition of the data to be marked to the generation of the marking result, integrate the knowledge in the vertical field, strictly control the marking result and continuously perform feedback optimization on the process.

Inventors

CAI MEIJIE
Deng Chengjing

Assignees

北京点富科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251215

Claims (10)

1. A large language model based data annotation system comprising: The data access layer is used for collecting data to be marked and preprocessing the data to be marked so as to convert the data to a uniform data format; The intelligent labeling engine layer is used for retrieving the context information of the vertical field for the received data to be labeled in the unified data format, and inputting the retrieved context information and the data to be labeled into a plurality of pre-trained large language models after splicing; And the quality control layer is used for monitoring the quality of the labeling results generated by the large language models and carrying out feedback analysis according to the monitored labeling quality problems.
2. The large language model based data annotation system of claim 1, wherein the data access layer comprises: the multi-mode data acquisition unit is used for acquiring data to be marked from a plurality of heterogeneous data sources based on the unified data access interface; the data preprocessing unit is used for processing the acquired data to be marked with the missing value, the abnormal value and the noise data, and fusing and aligning the data to be marked with each mode; The data normalization unit is used for converting the preprocessed data to be marked into a unified data format so as to construct normalized data.
3. The large language model based data annotation system of claim 1, wherein the intelligent annotation engine layer comprises: the context information retrieval unit is used for retrieving context information related to the data to be marked from a knowledge base in the vertical field according to the configured business rule, and splicing the retrieved context information and the data to be marked; the model input data construction unit is used for adding the spliced context information and the data to be marked to the appointed position of the preset prompt word so as to be input into a plurality of large language models; the annotation generating unit is used for understanding semantic association among the multi-mode data to be annotated based on the transducer architecture by the large language model, and generating an annotation result of the data to be annotated according to the model input data.
4. The large language model based data annotation system of claim 3, wherein the intelligent annotation engine layer further comprises: The interactive annotation interface is used for providing various annotation type tools and displaying annotation results generated by the large language model in real time; The labeling rule engine unit is used for presetting labeling rules for various scenes to form a labeling rule template library; And the active learning unit is used for identifying the data sample with high model prediction uncertainty and preferentially transmitting the data sample to the manual for marking.
5. The large language model based data annotation system of claim 1, wherein the quality control layer comprises: The multi-stage quality inspection unit is used for carrying out layer-by-layer quality inspection on the generated labeling result according to the preset quality inspection rules of each layer based on a multi-stage layered inspection mechanism; The consistency checking unit is used for comparing the labeling results generated by the large language models and identifying the difference between the labeling results; the anomaly detection unit is used for detecting an anomaly value in the generated labeling result to identify the anomaly labeling result; And the quality feedback unit is used for feeding back the layer-by-layer quality inspection result output by the multi-stage quality inspection unit, the labeling difference output by the consistency inspection unit and the abnormal labeling result output by the abnormal detection unit to the large language model so as to finely adjust the large language model.
6. The large language model based data annotation system of claim 1, further comprising a system monitoring layer comprising: The performance monitoring unit is used for tracking various operation indexes of the full link in real time; The log management unit is used for recording events occurring in each operation of the full link and classifying and storing all the recorded events according to log levels; the security audit unit is used for performing access control based on the identity authentication information and setting a secondary verification rule for sensitive operation; and the alarm notification unit is used for carrying out alarm based on a preset multi-level alarm rule and notifying the alarm to operation and maintenance personnel through multiple channels.
7. The large language model based data annotation system of claim 1, further comprising a workflow management layer comprising: The task allocation unit is used for allocating according to task difficulty, data type and emergency degree and carrying out priority allocation task based on the priority queue of the task; The progress tracking unit is used for displaying the progress of each labeling task in real time and visually displaying the task progress in the form of a statistical report and a chart; the resource scheduling unit is used for dynamically distributing computing resources according to the length and the priority of the labeling task; And the version control unit is used for creating a version number for each data to be marked and the marking result and recording version change data.
8. A method for labeling data based on a large language model, comprising: collecting data to be marked, and preprocessing the data to be marked to convert the data to be marked into a unified data format; Retrieving context information of the vertical field for the received data to be marked in the unified data format, splicing the retrieved context information with the data to be marked, and inputting the spliced context information into a plurality of pre-trained large language models, wherein the large language models are used for generating marking results of the data to be marked; And monitoring the quality of the labeling results generated by the large language models, and carrying out feedback analysis according to the monitored labeling quality problems.
9. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method as claimed in claim 8.
10. An electronic terminal comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method as claimed in claim 8.

Description

Data labeling system, method, medium and terminal based on large language model Technical Field The application relates to the technical field of data labeling, in particular to a data labeling system, method, medium and terminal based on a large language model. Background With the rapid development of artificial intelligence technology, data annotation has become a core infrastructure of the AI industry. The data annotation converts human knowledge and thinking logic into computer recognizable language through operations such as data feature extraction, classification, annotation, tagging and the like, and is a key link of artificial intelligence high-quality data set construction. Currently, the industry is presenting a development situation of 'technical drive, quality king, vertical deep ploughing', and marks the complete transformation of the traditional mode to intellectualization, specialization and scale. However, when the traditional mode of manual labeling and Excel management which is relied on for a long time in the industry is used for coping with the massive and high-complexity data labeling requirement required by large model training, the labeling efficiency is low, high labor cost is also brought, and the labeling quality is uneven due to the fact that the understanding deviation of manual labeling rules is caused. In order to overcome the defects of the manual labeling, the AI auxiliary labeling technology is generated, but the existing automatic labeling technology still has the following limitations: (1) The automatic degree is limited, the accuracy of the current AI labeling shows a smile curve, the accuracy of the current AI labeling can reach 99.5% in closed scenes such as face frames, license plates and the like, but the accuracy of the current AI labeling suddenly drops to below 70% when entering an open world, and the conventional automatic tool can only process simple tasks and has insufficient support for the requirements of complex semantic understanding, cross-mode association labeling and the like; (2) The pertinence of the vertical industry is lacking, the general AI labeling tool is difficult to meet the professional requirements of the vertical industry, professional knowledge and algorithm logic are required to be understood simultaneously in the labeling of the fields of medical treatment, finance and the like, and the existing tool lacks the field knowledge integration capability, so that the feature extraction accuracy is insufficient; (3) The quality control mechanism is imperfect, the existing automatic labeling system lacks an effective quality assurance system, intelligent positioning from surface abnormality to deep root cause cannot be realized, and when the labeling quality problem is detected, the problem root cause is difficult to quickly position and an optimization suggestion is provided. Therefore, it is necessary to provide a data labeling system, method, medium and terminal based on a large language model, so as to solve the above-mentioned problems in the prior art. Disclosure of Invention In view of the above drawbacks of the prior art, the present application aims to provide a data labeling system, method, medium and terminal based on a large language model, which are used for solving the technical problems of limited automation degree, lack of pertinence in the vertical industry and imperfect quality control mechanism in the prior art. To achieve the above and other related objects, a first aspect of the present application provides a data labeling system based on a large language model, which includes a data access layer configured to collect data to be labeled and pre-process the data to be labeled so as to convert the data to a unified data format, an intelligent labeling engine layer configured to retrieve context information of a vertical domain for the received data to be labeled in the unified data format, splice the retrieved context information with the data to be labeled, and input the spliced context information into a plurality of pre-trained large language models, wherein the large language models are configured to generate labeling results of the data to be labeled, and a quality control layer configured to monitor quality of the labeling results generated by the plurality of large language models and perform feedback analysis according to a monitored labeling quality problem. In some embodiments of the first aspect of the present application, the data access layer includes a multi-mode data acquisition unit configured to acquire data to be marked from a plurality of heterogeneous data sources based on a unified data access interface, a data preprocessing unit configured to process missing values, outliers, and noise data for the acquired data to be marked, and to fuse and align the data to be marked in each mode, and a data normalization unit configured to convert the preprocessed data to be marked into a unified data format, so as to co