CN-121980034-A - Text classification model training method, system and product based on mixed sampling
Abstract
The application provides a text classification model training method, a system and a product based on mixed sampling, and relates to the technical field of deep learning, wherein the method comprises the steps of predicting the probability that data in an unlabeled text data set belong to each text category by a text classification model to obtain a prediction result; the method comprises the steps of determining a marking progress, sampling a subset by a category balance sampling strategy based on a prediction result, marking to obtain a marking subset, sampling a subset by an uncertain and diversified hybrid sampling strategy based on a prediction result in a marking progress middle stage, sampling a subset by an edge sampling strategy or an uncertain and diversified hybrid sampling strategy based on a prediction result in a marking progress later stage, marking to obtain a marking subset, constructing a corresponding augmentation training set every time the marking subset is obtained, and training a text classification model by the constructed augmentation training set once. The method aims at improving sample labeling and model training efficiency.
Inventors
- GUO PENGFEI
- SUN XIAO
- PENG BAOYUN
- NIE XIAONING
- YANG PEIYING
Assignees
- 北京大数据先进技术研究院
Dates
- Publication Date
- 20260505
- Application Date
- 20260121
Claims (10)
- 1. A method for training a text classification model based on mixed sampling, the method comprising: predicting the probability that each unlabeled text data in the unlabeled text data set belongs to each text category through a text classification model to obtain category prediction results of all unlabeled text data; Determining the current progress of the labeling task; Under the condition that the current progress of the labeling task is an initial stage, a subset is sampled from the unlabeled text data set through a class balance sampling strategy based on the class prediction result to label, so that a labeled data subset is obtained; Under the condition that the current progress of the labeling task is a middle stage, a subset is sampled from the unlabeled text data set through an uncertainty and diversity mixed sampling strategy based on the category prediction result to label, so that a labeled data subset is obtained; Under the condition that the current progress of the labeling task is a later stage, a subset is sampled from the unlabeled text data set to be labeled on the basis of the category prediction result through an edge sampling strategy or an uncertainty and diversity mixed sampling strategy, so that a labeled data subset is obtained; and each labeling data subset is labeled, a corresponding augmentation training data set is constructed, and the constructed augmentation training data set is used for training the text classification model once.
- 2. The method for training a text classification model based on mixed sampling according to claim 1, wherein, in the case that the current progress of the labeling task is a later stage, based on the classification prediction result, a subset is sampled from the unlabeled text data set for labeling by an edge sampling strategy or an uncertainty and diversity mixed sampling strategy, so as to obtain a labeled data subset, which includes: Under the condition that the current progress of the labeling task is a later stage, determining the improvement rate of each two adjacent F1 scores according to the F1 scores of the current continuous preset quantity; Determining the current average improvement rate of the text classification model according to the obtained improvement rate; sampling a subset from the unlabeled text data set through an edge sampling strategy for labeling based on the class prediction result under the condition that the average improvement rate is smaller than a first threshold; And under the condition that the average improvement rate is greater than or equal to a first threshold value, based on the category prediction result, sampling a subset from the unlabeled text data set through an uncertainty and diversity mixed sampling strategy to label, and obtaining a labeled data subset.
- 3. The method for training a text classification model based on mixed sampling according to claim 1, wherein, in the case that the current progress of the labeling task is a mid-term stage, a subset is sampled from the unlabeled text data set for labeling by an uncertainty and diversity mixed sampling strategy based on the class prediction result, so as to obtain a labeled data subset, which includes: Under the condition that the current progress of the labeling task is a middle stage, determining an uncertainty score of unlabeled text data according to a category prediction result of the unlabeled text data; Screening a candidate data set from the unlabeled text data set based on the uncertainty score of each unlabeled text data in the unlabeled text data set; Determining distances between unlabeled text data in the candidate data set and each sample in a selected sample set, and determining the minimum value in all the distances as a diversity score of the unlabeled text data, wherein the selected sample set is a labeled text data set, or randomly selecting a set consisting of a preset number of labeled text data from the labeled text data set; and based on the diversity score of each unlabeled text data in the candidate data set, sampling a subset from the candidate data set for labeling, and obtaining a labeled data subset.
- 4. The method for training a text classification model based on mixed sampling according to claim 1, wherein determining the current progress of the labeling task comprises: In the iterative training process of the text classification model, acquiring a performance evaluation value of the text classification model in the current iteration round; Calculating the current progress of the labeling task through a progress algorithm according to the performance evaluation value of the current iteration round, the initial performance evaluation value of the initial iteration round and a preset target performance evaluation value; The progress algorithm is as follows: wherein p represents the current progress of the annotation task, A performance evaluation value representing the current iteration round, An initial performance evaluation value is indicated, Representing the target performance evaluation value.
- 5. A method of training a mixed sample based text classification model as claimed in claim 1, wherein prior to labeling the selected subset, the method further comprises: Sampling a reference dataset from the annotated text dataset; for any unlabeled text data in a selected subset, calculating cosine distances between semantic vectors of the unlabeled text data and reference vectors of each reference data in the reference data set respectively; When cosine distances smaller than a second threshold exist in all cosine distances corresponding to the unlabeled text data, filtering the unlabeled text data from the subset; after labeling the selected subset, the method further comprises: updating the marked text data set with the marked subset for reference vector filtering of the next batch.
- 6. The method of training a mixed sampling based text classification model of claim 1, further comprising: Generating corresponding recommended reasons for unlabeled text data in a subset according to a sampling strategy adopted in the process of selecting the subset; and outputting the recommended reason of each unlabeled text data in the subset when outputting the selected subset.
- 7. The method for training a text classification model based on mixed sampling according to any of claims 1-6, wherein, in the case that the text classification model is an entity type extraction model, after multiple training, a trained entity type extraction model is obtained, and the trained entity type extraction model is used for determining entity types for each entity in a target text; Under the condition that the text classification model is a relation type extraction model, after multiple times of training, obtaining a trained relation type extraction model, wherein the trained relation type extraction model is used for determining a relation type for each entity pair in a target text; and constructing a knowledge graph according to the entity types of the entities in the target text and the relationship types of the entity pairs in the target text.
- 8. A text classification model training system based on mixed sampling, the system comprising: The class prediction module is used for predicting the probability that each unlabeled text data in the unlabeled text data set belongs to each text class through the text classification model, so as to obtain class prediction results of all unlabeled text data; The marking progress determining module is used for determining the current progress of the marking task; The first labeling module is used for sampling a subset from the unlabeled text data set to label based on the category prediction result under the condition that the current progress of the labeling task is an initial stage through a category balance sampling strategy, so as to obtain a labeled data subset; The second labeling module is used for sampling a subset from the unlabeled text data set to label based on the category prediction result under the condition that the current progress of the labeling task is a middle stage through an uncertainty and diversity mixed sampling strategy, so as to obtain a labeled data subset; The third labeling module is used for sampling a subset from the unlabeled text data set to label based on the category prediction result through an edge sampling strategy or an uncertainty and diversity mixed sampling strategy under the condition that the current progress of the labeling task is a later stage, so as to obtain a labeled data subset; The model training module is used for labeling a subset of the labeling data, constructing a corresponding augmentation training data set, and training the text classification model once by using the constructed augmentation training data set.
- 9. An electronic device comprising a processor, a memory and a computer program stored on the memory and running on the processor, which when executed by the processor implements the steps of a method for training a mixed sample based text classification model according to any of claims 1 to 7.
- 10. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of a method for training a text classification model based on mixed sampling according to any of claims 1 to 7.
Description
Text classification model training method, system and product based on mixed sampling Technical Field The application relates to the technical field of deep learning, in particular to a text classification model training method, a system and a product based on mixed sampling. Background With the high-speed evolution of artificial intelligence technology, high-quality annotation data has become a key for driving the improvement of model performance. Whether in the front-edge fields of knowledge graph construction, natural Language Processing (NLP), computer Vision (CV) and the like, a large-scale and high-quality labeling data set is a precondition for training and optimizing a complex model. However, the currently mainstream manual labeling mode generally has a series of bottlenecks, and is specifically characterized in that huge manpower and time cost are required to be input to comprehensively label a massive data set, so that the cost is high. The traditional random sampling or sequential labeling method often causes repeated labor to a large amount of information redundancy or samples with low value, and valuable labeling resources are consumed, so that labeling efficiency and model training efficiency are low. In addition, due to the lack of effective sample selection policy guidance, the labeling process has difficulty in systematically ensuring the representativeness and diversity of samples, and can cause deviation of a final labeling data set, influence model generalization capability, and further cause quality fluctuation. Disclosure of Invention In view of the above, the application provides a text classification model training method, a system and a product based on mixed sampling. Aims to solve or partially solve the problems existing in the background art. The application provides a text classification model training method based on mixed sampling, which comprises the following steps: predicting the probability that each unlabeled text data in the unlabeled text data set belongs to each text category through a text classification model to obtain category prediction results of all unlabeled text data; Determining the current progress of the labeling task; Under the condition that the current progress of the labeling task is an initial stage, a subset is sampled from the unlabeled text data set through a class balance sampling strategy based on the class prediction result to label, so that a labeled data subset is obtained; Under the condition that the current progress of the labeling task is a middle stage, a subset is sampled from the unlabeled text data set through an uncertainty and diversity mixed sampling strategy based on the category prediction result to label, so that a labeled data subset is obtained; Under the condition that the current progress of the labeling task is a later stage, a subset is sampled from the unlabeled text data set to be labeled on the basis of the category prediction result through an edge sampling strategy or an uncertainty and diversity mixed sampling strategy, so that a labeled data subset is obtained; and each labeling data subset is labeled, a corresponding augmentation training data set is constructed, and the constructed augmentation training data set is used for training the text classification model once. A second aspect of the present application provides a text classification model training system based on mixed sampling, the system comprising: The class prediction module is used for predicting the probability that each unlabeled text data in the unlabeled text data set belongs to each text class through the text classification model, so as to obtain class prediction results of all unlabeled text data; The marking progress determining module is used for determining the current progress of the marking task; The first labeling module is used for sampling a subset from the unlabeled text data set to label based on the category prediction result under the condition that the current progress of the labeling task is an initial stage through a category balance sampling strategy, so as to obtain a labeled data subset; The second labeling module is used for sampling a subset from the unlabeled text data set to label based on the category prediction result under the condition that the current progress of the labeling task is a middle stage through an uncertainty and diversity mixed sampling strategy, so as to obtain a labeled data subset; The third labeling module is used for sampling a subset from the unlabeled text data set to label based on the category prediction result through an edge sampling strategy or an uncertainty and diversity mixed sampling strategy under the condition that the current progress of the labeling task is a later stage, so as to obtain a labeled data subset; The model training module is used for labeling a subset of the labeling data, constructing a corresponding augmentation training data set, and training the text classification model