CN-116401565-B - Training method of text clustering model and text clustering method based on training method

CN116401565BCN 116401565 BCN116401565 BCN 116401565BCN-116401565-B

Abstract

The invention provides a training method of a text clustering model, which comprises the steps of S1, obtaining a training set and an initial model, wherein the initial model comprises a pre-trained BERT model, a pre-trained depth coding model and a clustering module, the training set comprises a plurality of types of text data, the pre-trained depth coding model comprises an encoder, an hidden layer and a decoder, S2, the initial model obtained in the step S1 is trained to be converged by adopting the training set obtained in the step S1 for a plurality of times, and the pre-trained BERT model in the converged initial model, the encoder in the pre-trained depth coding model and the clustering module form a final text clustering model. The method solves the problem of embedding and clustering separation of the traditional deep clustering method, and improves the clustering effect of the text clustering model.

Inventors

LIAO LIEFA
YAO XIU

Assignees

江西理工大学

Dates

Publication Date: 20260505
Application Date: 20230411

Claims (7)

1. A method for training a text clustering model, the method comprising: S1, acquiring a training set and an initial model, wherein the initial model comprises a pre-trained BERT model, a pre-trained depth coding model and a clustering module, the training set comprises text data of a plurality of categories, and the pre-trained depth coding model comprises an encoder, an implicit layer and a decoder; S2, training the initial model obtained in the step S1 to be converged by adopting the training set obtained in the step S1 for multiple times, wherein a text clustering model is formed by a pre-trained BERT model in the initial model after convergence, an encoder in a pre-trained depth coding model and a clustering module to be converged, and a final text clustering model is obtained, wherein the clustering module consists of a single hidden layer and a clustering submodule with self-encoder property and adopts a likelihood function with the following rule: Wherein, the For the reconstruction loss of the depth coding model, For all the number of input vectors, Representing the i-th input vector of the input signal, Representing the vector after reconstruction of the ith input vector, Representation sparsification Preventing the cluster centroid from deviating, The posterior probability is represented by the probability of a posterior, Each parameter representing the sparsity model is represented, Is a polynomial formed by formula transformation and combination, wherein And (3) with Is the same meaning, i.e. the probability that the i-th data point comes from the i-th cluster, And (3) with Meaning the same meaning, i.e. the center vector of the first cluster class, Representing the kth cluster center vector Is used to determine the transposed vector of (c), The number of clusters that are ultimately generated for text clustering, Representing the use of dirichlet priors to effectively balance the allocation of cluster classes, Is the dirichlet a priori parameter for the kth cluster class, Taking the logarithm of gamma k to balance the calculation, Represents the average attraction degree of K clusters, K represents the number of cluster classes finally generated by text clustering, and K represents the kth cluster class generated by clustering.
2. The method according to claim 1, wherein in the step S1, the pre-trained BERT model is used for encoding text data in a training set into text vectors corresponding to the text data, the encoder in the pre-trained depth coding model is used for performing dimension reduction processing on the text vectors output by the pre-trained BERT model to obtain dimension reduced text vectors, the clustering module is used for performing self-coding clustering on the dimension reduced text vectors output by the pre-trained depth coding model, and the hidden layer and the decoder reconstruct the dimension reduced text vectors output by the pre-trained depth coding model.
3. The method according to claim 2, wherein in the method, training the initial model each iteration calculates the total loss and updates the parameters of the initial model based on the total loss by: Wherein, the Representing the total loss of the initial model, Reconstruction loss weight parameter representing encoder, initial β=0.5, For the reconstruction loss of the DAE, Is the corresponding first in the input text The reconstruction of the individual vectors is lost, For all the input vector numbers corresponding to the input text in one iteration of training, Representing the first iteration training The number of input vectors is chosen to be, Representing the first iteration training The vectors after the reconstruction of the individual input vectors, Representing the reconstruction loss of the clustering module CM, Representing the first input into the clustering module CM The number of input vectors is chosen to be, Representing the vector after reconstruction of the ith input vector, Various parameters used to sparse the text cluster model, Representing the posterior probability, i.e. the probability that the i-th data point comes from the k-th cluster, The number of clusters that are ultimately generated for text clustering, Represents the kth cluster class generated by the cluster, The distribution of clusters is balanced with dirichlet priors, Represents the average attraction of k cluster classes, Represents the Lagrangian theorem orthogonal constraint, the problem, guarantees the sparsity of each parameter in the formula, wherein lambda represents the Lagrangian coefficient, Representing a cluster center matrix Is used to determine the transposed matrix of (a), And representing an identity matrix with a characteristic value of k, wherein k is the kth cluster class generated by clustering.
4. The method of claim 1, wherein the clustering module comprises an implicit layer and a gaussian mixture model.
5. A text clustering method, characterized in that the text clustering method comprises: T1, acquiring text data to be clustered; And T2, clustering the text data to be clustered obtained in the step T1 by adopting a training method according to any one of claims 1-4 to obtain a final text clustering model so as to obtain a clustering result.
6. A computer readable storage medium, having stored thereon a computer program executable by a processor to implement the steps of the method of any one of claims 1 to 5.
7. An electronic device, comprising: One or more processors; Storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to perform the steps of the method of any of claims 1-5.

Description

Training method of text clustering model and text clustering method based on training method Technical Field The invention relates to the field of machine learning, in particular to the field of text clustering, and more particularly relates to a training method of a text clustering model and a text clustering method based on the training method. Background The text sample contains a word ambiguous sample, such as a patent text and a microblog text, wherein the clustering of the patent text is difficult, and the document [1] records that the patent text contains innovative information and advanced technology researched by a large number of scholars, and the state of the art and the development condition of the related patent can be obtained by analyzing the patent text. Patent text analysis may utilize existing text classification, clustering, etc. methods, where the nature of clustering is a grouping with the criteria of similarity between samples within a group being as high as possible and similarity between samples between groups being as low as possible. The text samples such as patent text and the like are used for analyzing information which is hidden in the text samples and is not easy to be obtained by direct statistics, and the latest dynamic change of the patent can be effectively known for the patent text. The document [2] discloses that the patent text contains words, technical terms, synonyms and anti-ambiguities in the specific field of the patent, and the traditional vectorization method cannot well solve the problem of ambiguities in the patent text, namely, the complete semantic information is difficult to accurately extract, and the effect of subsequent clustering treatment is directly affected. For example, in Document [3], it is described that the conventional deep clustering method is likely to separate clusters from embedments, and there is a problem that clusters cannot be adapted to embedments because no cluster-induced embedment is used, and in the vectorization representation of patent text, the technical solution in Document [4] uses a TF-IDF (Term-Frequency-Document-Frequency) method for vector representation, but if the patent content is more, the vector space dimension generated by TF-IDF increases with the increase of patent space, and a lot of time is consumed for application, and Document [5] proposes using word2vec for patent vector representation, but word2vec is a static word embedding model, and cannot solve the word ambiguity problem in patent text. To solve the problem of word ambiguity in a text sample, ELMo (embedding from language models) in document [6] uses an unsupervised bi-directional language model to pretrain, dynamically adjusts embedding of a word according to its context, effectively distinguishing different semantics of the same word in different contexts, but ELMo adopts a Long-short-term memory network (Long-Short Term Memory, LSTM) feature extractor, which has weak feature extraction capability. And BERT (bidirectional encoder representations from transformers) in the document [7] adopts a transducer feature extractor, and the multi-head attention mechanism is utilized to obtain more complete characterization information of the word by integrating the feature mode by BERT (binary image processing) under the condition that the output results of the corresponding 12 layers of transducers are different for the same word under different contexts, so that the word ambiguity problem is more effectively solved. In order to obtain a better clustering effect of a text sample, a depth clustering method combining a classical clustering algorithm with deep learning is used in the prior art, for example, document [8] utilizes the time sequence memory capability of LSTM and the nonlinear feature extraction capability of an automatic encoder to perform automatic feature extraction and nonlinear dimension reduction, and then adopts a k-means clustering algorithm to perform clustering analysis. However, the sequential clustering mode can adapt the clustering to the features to cluster, and is greatly dependent on the quality of feature extraction, the fact that DEC (deep embedding cluster) uses a deep neural network to learn the feature representation and the cluster allocation simultaneously can effectively avoid the defects of the sequential clustering mode is improved in the literature [9], the literature [10] improves on the basis of DEC, an IDEC algorithm is provided, the algorithm performs joint clustering, embedded features suitable for clustering are learned, an automatic encoder is combined to maintain a local structure, and particularly, IDEC (improved deep embedded clustering) is an improvement of the DEC algorithm, and the improvement is aimed at the problem that the decoding layer is discarded in the training stage of the DEC, so that the encoding layer distorts the embedded space, the reservation of the original data features in the extracted features is