CN-122019774-A - Method and device for generating text classification model

CN122019774ACN 122019774 ACN122019774 ACN 122019774ACN-122019774-A

Abstract

The embodiment of the invention discloses a method and a device for generating a text classification model. According to the method, a plurality of unlabeled texts are obtained, the unlabeled texts are clustered to generate an initial positive sample set and an initial negative sample set, the positive sample set is input into a large language model which is finely tuned in advance to be labeled to generate a labeled pseudo-label positive sample set and a pseudo-label negative sample set, a training sample set is built according to the pseudo-label positive sample set, the pseudo-label negative sample set, the initial negative sample set and the pre-labeled artificial sample set, an initial text classification model is trained in a mode of weighting samples in the training sample set, and a target text classification model is generated. By the method, the target text classification model capable of classifying the text efficiently and accurately can be generated under the scene of limited manual annotation data.

Inventors

ZENG JINGHUA
FAN HANGYU
YANG RUI

Assignees

阿里巴巴（中国）有限公司

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (12)

1. A method of generating a text classification model, the method comprising: Acquiring a plurality of unlabeled texts; Clustering the plurality of unlabeled texts to generate an initial positive sample set and an initial negative sample set, wherein the positive sample set and the negative sample set respectively comprise the plurality of unlabeled texts and corresponding labels; Inputting the positive sample set into a pre-fine-tuned large language model for labeling, and generating a labeled pseudo tag positive sample set and a pseudo tag negative sample set, wherein the pseudo tag positive sample set comprises a plurality of positive samples carrying positive sample tags, and the pseudo tag negative sample set comprises a plurality of negative samples carrying negative sample tags; Constructing a training sample set according to the pseudo tag positive sample set, the pseudo tag negative sample set, the initial negative sample set and the pre-marked artificial sample set, wherein the pre-marked artificial sample set comprises the pre-marked artificial positive sample set and the pre-marked artificial negative sample set; And training an initial text classification model in a mode of weighting samples in the training sample set to generate a target text classification model.
2. The method according to claim 1, wherein the obtaining a plurality of unlabeled text specifically includes: Acquiring a plurality of initial unlabeled texts; And carrying out standardized pretreatment on the plurality of initial unlabeled texts to generate a plurality of unlabeled texts.
3. The method according to claim 1, wherein the clustering the plurality of unlabeled text to generate an initial positive sample set and an initial negative sample set specifically includes: Clustering the plurality of unlabeled texts to generate a plurality of clusters, wherein each cluster comprises a plurality of unlabeled texts with similar semantics; comparing each cluster with manual labeling data respectively to determine a label of each cluster, wherein the manual labeling data comprises manual labeling positive sample data and manual labeling negative sample data, and the labels comprise positive sample labels and negative sample labels; and determining the initial positive sample set and the initial negative sample set according to the label of each cluster.
4. The method of claim 3, wherein clustering the plurality of unlabeled text to generate a plurality of clusters specifically comprises: Extracting text features of the plurality of unlabeled texts; and clustering the plurality of unlabeled texts according to the text features to generate a plurality of clusters.
5. A method according to claim 3, characterized in that said determining said initial positive and initial negative sample sets from the labels of said each cluster comprises in particular: determining a plurality of clusters of the positive sample labels as the initial positive sample set; A plurality of high confidence negative samples in a plurality of clusters of the negative sample label are determined, and the initial negative sample set is determined from the plurality of high confidence negative samples.
6. The method according to claim 1, wherein the inputting the positive sample set into a pre-trimmed large language model for labeling, and generating a labeled pseudo tag positive sample set and pseudo tag negative sample set specifically comprises: inputting the positive sample set into a pre-fine-tuned large language model for marking, and generating a plurality of samples carrying pseudo tags; And filtering the category confidence coefficient of the samples carrying the pseudo tags to generate a pseudo tag positive sample set and a pseudo tag negative sample set.
7. The method according to claim 1, wherein the training the initial text classification model by weighting the samples in the training sample set to generate the target text classification model specifically comprises: Obtaining a first set number of artificial samples and a second set number of pseudo tag samples in the training sample set, wherein the first set number of artificial samples are extracted from the pre-labeled artificial sample set, and the second set number of pseudo tag samples are extracted from the pseudo tag positive sample set, the pseudo tag negative sample set and the initial negative sample set; determining initial weights corresponding to the first set number of artificial samples and the second set number of pseudo tag samples respectively; Training the initial text classification model according to the initial weight, and determining the weight of gradient feedback; Updating the weight in the initial text classification model according to the weight returned by the gradient, and generating an updated initial text classification model; And continuing to acquire the first set number of artificial samples and the second set number of pseudo tag samples in the training sample set until the initial text classification model converges to generate the target text classification model.
8. The method according to claim 1, wherein the training the initial text classification model by weighting the samples in the training sample set to generate the target text classification model, further comprises: Obtaining a first set number of artificial samples and a second set number of pseudo tag samples in the training sample set, wherein the first set number of artificial samples are extracted from the pre-labeled artificial sample set, and the second set number of pseudo tag samples are extracted from the pseudo tag positive sample set, the pseudo tag negative sample set and the initial negative sample set; Determining initial weights of a first set number of artificial samples and a second set number of pseudo tag samples corresponding to a first initial text classification model and a second initial text classification model respectively, wherein the first initial text classification model and the second initial text classification model are the same initial text classification model; respectively training the first initial text classification model and the second initial text model according to different initial weights, and determining a first gradient feedback weight corresponding to the first initial text classification model and a second gradient feedback weight corresponding to the second initial text classification model; Updating the weight in the first initial text classification model according to the second gradient return weight, and updating the weight in the second initial text classification model according to the first gradient return weight; And continuously acquiring a first set number of artificial samples and a second set number of pseudo tag samples in the training sample set until the first initial text classification model and the second initial text classification model converge, and determining the converged first initial text classification model or second initial text classification model as the target text classification model.
9. The method according to claim 1, wherein the method further comprises: and carrying out data enhancement on sample data in the pseudo tag positive sample set, the pseudo tag negative sample set and the initial negative sample set, and expanding the quantity of the sample data in the pseudo tag positive sample set, the pseudo tag negative sample set and the initial negative sample set.
10. An apparatus for generating a text classification model, the apparatus comprising: The obtaining unit is used for obtaining a plurality of unlabeled texts; the clustering unit is used for clustering the plurality of unlabeled texts to generate an initial positive sample set and an initial negative sample set, wherein the positive sample set and the negative sample set respectively comprise the plurality of unlabeled texts and corresponding labels; the labeling unit is used for inputting the positive sample set into a pre-fine-tuned large language model for labeling, and generating a labeled pseudo tag positive sample set and a pseudo tag negative sample set; the construction unit is used for constructing a training sample set according to the pseudo tag positive sample set, the pseudo tag negative sample set, the initial negative sample set and the pre-marked artificial sample set; And the training unit is used for training the initial text classification model in a mode of weighting the samples in the training sample set to generate a target text classification model.
11. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-9.
12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method according to any of claims 1-9.

Description

Method and device for generating text classification model Technical Field The present invention relates to the field of computer technology, and more particularly, to a method and apparatus for generating a text classification model. Background With the rapid development of the internet, the number of text contents is increased in an explosive manner, especially in scenes such as social media platforms, electronic commerce platforms, online forums and the like, users issue massive text information every day, and the text information often contains a large amount of contents which are unsuitable or violated with platform specifications, such as bad comments, nonsensical repeated texts, extreme views, advertisement popularization texts and the like, and the contents not only affect user experience, but also possibly cause legal disputes, social public opinion crisis and the like, so that the contents which do not meet the specifications are required to be identified in the massive text information. In the prior art, a traditional text classification mode is adopted, a text classification model is trained according to a large-scale manually marked data set, and in reality, the high-quality manual marking cost is high, and especially when the problems of emerging semantic changes, dialect expression, network term variation and the like are faced, the manually marked data are more limited, so that the effect of the existing text classification model is poor. In summary, how to efficiently and accurately classify texts in the scene of limited manual annotation data is a problem to be solved at present. Disclosure of Invention In view of the above, the embodiment of the invention provides a method and a device for generating a text classification model, which can generate a target text classification model capable of classifying texts efficiently and accurately under the scene of limited manual annotation data. According to the method, a plurality of unlabeled texts are obtained, the unlabeled texts are clustered to generate an initial positive sample set and an initial negative sample set, the positive sample set and the negative sample set respectively comprise a plurality of unlabeled texts and corresponding labels, the positive sample set is input into a pre-fine-tuned large language model to be labeled to generate a labeled pseudo-label positive sample set and a pseudo-label negative sample set, the pseudo-label positive sample set comprises a plurality of positive samples carrying positive sample labels, the pseudo-label negative sample set comprises a plurality of negative samples carrying negative sample labels, a training sample set is constructed according to the pseudo-label positive sample set, the pseudo-label negative sample set, the initial negative sample set and a pre-labeled artificial sample set, the pre-labeled artificial sample set comprises the pre-labeled artificial positive sample set and the pre-labeled artificial negative sample set, the initial text classification model is trained in a weighted mode through samples in the training sample set, and the target text classification model is generated. Optionally, the obtaining a plurality of unlabeled texts specifically includes obtaining a plurality of initial unlabeled texts, and performing standardized preprocessing on the plurality of initial unlabeled texts to generate a plurality of unlabeled texts. The method comprises the steps of selecting a plurality of unlabeled texts, clustering the unlabeled texts to generate an initial positive sample set and an initial negative sample set, wherein the unlabeled texts are clustered to generate a plurality of clusters, each cluster comprises a plurality of unlabeled texts with similar semantics, comparing each cluster with manual labeling data respectively to determine labels of each cluster, wherein the manual labeling data comprises manual labeling positive sample data and manual labeling negative sample data, the labels comprise positive sample labels and negative sample labels, and determining the initial positive sample set and the initial negative sample set according to the labels of each cluster. Optionally, the clustering the plurality of unlabeled texts to generate a plurality of clusters specifically includes extracting text features of the plurality of unlabeled texts, and clustering the plurality of unlabeled texts according to the text features to generate a plurality of clusters. Optionally, the determining the initial positive sample set and the initial negative sample set according to the label of each cluster specifically includes determining a plurality of clusters of the positive sample label as the initial positive sample set, determining a plurality of high confidence negative samples in the plurality of clusters of the negative sample label, and determining the initial negative sample set according to the plurality of high confidence negative samples. Optionally, the posit