CN-116049351-B - Data labeling method, device, system and storage medium

CN116049351BCN 116049351 BCN116049351 BCN 116049351BCN-116049351-B

Abstract

The invention provides a data labeling method, a device, a system and a storage medium, which belong to the field of data labeling, and the method comprises the steps of S1, importing original training data and labeled data, S2, constructing an original labeling model, training the original labeling model according to the original training data and the labeled data to obtain a first labeling model, S3, importing unlabeled training data, predicting the unlabeled training data according to the first labeling model to obtain prediction data, and S4, analyzing the first labeling model according to the unlabeled training data and the prediction data to obtain a second labeling model. According to the method and the device, training of the pre-labeling model in the target field can be achieved on the premise that a large number of manually labeled text samples are not needed, the workload of manual labeling is greatly reduced, the cost of data labeling work is saved, and the accuracy of the model is improved.

Inventors

LU JINLONG
TAO YONG
ZHOU JIN
TANG MINQIN
JIANG TAI
QIN ZIXIN

Assignees

广西瀚特信息产业股份有限公司

Dates

Publication Date: 20260508
Application Date: 20221228

Claims (8)

1. The data labeling method is characterized by comprising the following steps of: s1, importing a plurality of original training data and marked data corresponding to the original training data one by one, wherein the marked data is marked sample text; S2, constructing an original labeling model, and training the original labeling model according to a plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a first labeling model; s3, importing a plurality of unlabeled training data, and predicting each unlabeled training data according to the first labeling model to obtain predicted data of each unlabeled training data, wherein the unlabeled training data is unlabeled sample text; S4, analyzing the first labeling model according to the unlabeled training data and the predicted data of the unlabeled training data to obtain a second labeling model; S5, carrying out model analysis on the second labeling model according to the plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a third labeling model; S6, importing data to be detected, and marking the data to be detected according to the third marking model to obtain a data marking result; The process of S4 includes: S41, acquiring initial quantity of unlabeled training data, and counting the quantity of all unlabeled training data to obtain the quantity of unlabeled training data; s42, performing sum processing on the initial quantity of the unlabeled training data and the quantity of the unlabeled training data to obtain the total quantity of the unlabeled training data; S43, judging whether the total quantity of the unlabeled training data is larger than or equal to a preset first quantity, if not, executing S44, and if so, taking the first labeling model as a second labeling model; And S44, training the first labeling model according to a plurality of unlabeled training data and a plurality of predicted data of the unlabeled training data to obtain a fourth labeling model, taking the total quantity of the unlabeled training data as the initial quantity of new unlabeled training data, taking the fourth labeling model as the new first labeling model, and returning to S3.
2. The method for labeling data according to claim 1, wherein the step S2 comprises: And constructing a text convolutional neural network, and training the text convolutional neural network according to the plurality of original training data and marked data corresponding to the plurality of original training data to obtain a first marking model.
3. The method for labeling data according to claim 1, wherein the step S5 comprises: Labeling each original training data through the second labeling model to obtain pre-labeling data of each original training data; verifying whether the pre-marked data of each original training data are the same as the marked data corresponding to each original training data, and counting the number of verification success to obtain the total number of verification success; And judging whether the total verification success number is larger than a preset second total number, if not, taking the second annotation model as a new original annotation model, returning to the step S1, and if so, taking the second annotation model as a third annotation model.
4. A data tagging device, comprising: The data importing module is used for importing a plurality of original training data and marked data corresponding to the original training data one by one, and the marked data is marked sample text; the model training module is used for constructing an original labeling model, and training the original labeling model according to a plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a first labeling model; The prediction module is used for importing a plurality of unlabeled training data, predicting each unlabeled training data according to the first labeling model to obtain predicted data of each unlabeled training data, wherein the unlabeled training data is unlabeled sample text; The analysis module is used for analyzing the first labeling model according to the unlabeled training data and the predicted data of the unlabeled training data to obtain a second labeling model; The model analysis module is used for carrying out model analysis on the second labeling model according to the plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a third labeling model; The data labeling result obtaining module is used for importing data to be tested, labeling the data to be tested according to the third labeling model, and obtaining a data labeling result; the model analysis module is specifically used for: Labeling each original training data through the second labeling model to obtain pre-labeling data of each original training data; verifying whether the pre-marked data of each original training data are the same as the marked data corresponding to each original training data, and counting the number of verification success to obtain the total number of verification success; Judging whether the total verification success number is larger than a preset second total number, if not, taking the second annotation model as a new original annotation model, returning the second annotation model to the data importing module, and if so, taking the second annotation model as a third annotation model.
5. The data annotation device of claim 4, wherein the model training module is specifically configured to: And constructing a text convolutional neural network, and training the text convolutional neural network according to the plurality of original training data and marked data corresponding to the plurality of original training data to obtain a first marking model.
6. The data annotation device of claim 4, wherein the analysis module is specifically configured to: Acquiring initial quantity of unlabeled training data, and counting the quantity of all unlabeled training data to acquire the quantity of unlabeled training data; Performing sum processing on the initial quantity of the unlabeled training data and the quantity of the unlabeled training data to obtain the total quantity of the unlabeled training data; And judging whether the total number of the unlabeled training data is larger than or equal to a preset first total number, if not, training the first labeling model according to a plurality of unlabeled training data and a plurality of predicted data of the unlabeled training data to obtain a fourth labeling model, taking the total number of the unlabeled training data as the new initial number of the unlabeled training data, taking the fourth labeling model as the new first labeling model, and returning to the prediction module, and if so, taking the first labeling model as a second labeling model.
7. A data labelling system comprising a memory, a processor and a computer program stored in said memory and executable on said processor, wherein the data labelling method according to any of claims 1 to 3 is implemented when said computer program is executed by said processor.
8. A computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the data annotation method according to any of claims 1 to 3.

Description

Data labeling method, device, system and storage medium Technical Field The invention mainly relates to the technical field of data labeling, in particular to a data labeling method, a device, a system and a storage medium. Background In recent years, the artificial intelligence industry presents a high-speed growth situation under the dual drive of policies and technologies, the data flow is continuously and rapidly increased, the demand of a large amount of high-precision and scene data is promoted, and the explosive development of the data labeling industry is promoted. The text data annotation is widely applied to the aspects of new retail, medical industry, customer service industry, advertising marketing, social investigation and statistical analysis, resident life entertainment and the like. As artificial intelligence moves further into the subdivision domain, algorithmic research requires a large amount of labeling data for each domain for model training, which creates a significant time and labor cost in data acquisition and manual labeling. Therefore, reducing the cost of data annotation is an important measure to promote the rapid landing of artificial intelligence in various industries. An effective method for reducing the data marking cost is to pre-mark the data, thereby realizing semi-automation of data marking and greatly reducing the manual marking workload. In recent years, many studies have focused on the direction of deep learning-based entity extraction (NAMED ENTITY recognment, simply called NER), and many studies have been made on text pre-labeling. However, the existing researches are often based on general scenes, and the predictive models in the research results have the defects that because of the wide range of the oriented scenes, the number of unlabeled data samples and labeled data samples required by the predictive model training is large, the labor cost is high, the accuracy of the models is not high enough when the predictive models are applied to the subdivision field, the pre-labeling effect is not ideal, and the training effect of the pre-labeling models is not ideal due to the number of manually labeled data samples. Disclosure of Invention The invention aims to solve the technical problem of providing a data labeling method, a device, a system and a storage medium aiming at the defects of the prior art. The technical scheme for solving the technical problems is as follows, the data labeling method comprises the following steps: S1, importing a plurality of original training data and labeled data corresponding to the original training data one by one; S2, constructing an original labeling model, and training the original labeling model according to a plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a first labeling model; S3, importing a plurality of unlabeled training data, and predicting each unlabeled training data according to the first labeling model to obtain predicted data of each unlabeled training data; S4, analyzing the first labeling model according to the unlabeled training data and the predicted data of the unlabeled training data to obtain a second labeling model; S5, carrying out model analysis on the second labeling model according to the plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a third labeling model; and S6, importing the data to be detected, and marking the data to be detected according to the third marking model to obtain a data marking result. The other technical scheme for solving the technical problems is as follows, namely a data marking device comprises: The data importing module is used for importing a plurality of original training data and marked data corresponding to the original training data one by one; the model training module is used for constructing an original labeling model, and training the original labeling model according to a plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a first labeling model; the prediction module is used for importing a plurality of unlabeled training data, predicting each unlabeled training data according to the first labeling model, and obtaining predicted data of each unlabeled training data; The analysis module is used for analyzing the first labeling model according to the unlabeled training data and the predicted data of the unlabeled training data to obtain a second labeling model; The model analysis module is used for carrying out model analysis on the second labeling model according to the plurality of original training data and labeled data corresponding to the plurality of original training data to obtain a third labeling model; the data labeling result obtaining module is used for importing data to be tested, labeling the data to be tested according to the third labeling model, a