CN-116483999-B - Deep text clustering method and device for self-adaptive structure learning

CN116483999BCN 116483999 BCN116483999 BCN 116483999BCN-116483999-B

Abstract

The invention discloses a deep text clustering method based on self-adaptive structure learning, which comprises the following steps of firstly constructing a K-neighbor graph, secondly generating a graph of the self-adaptive structure through a self-adaptive threshold strategy, thirdly enabling a graph convolution kernel to dynamically adjust a topological range by using a threshold attenuation strategy so as to learn self-adaptive structural semantic representation of a text, fourthly, learning the semantic representation of the text by using a self-encoder, merging the semantic representation into the structural semantic representation learned in the third step layer by layer, and learning the fusion enhanced semantic representation, and fifthly, reversely adjusting and optimizing encoder parameters continuously to obtain a final text clustering result. The structure information is fused into the text semantic representation, and the structure information and the semantic information are utilized to jointly supervise the clustering process, so that the problems of unsupervised text clustering text representation difference and insufficient supervision are effectively solved, the accuracy of a clustering result is improved, and the clustering result which is more suitable for downstream tasks is obtained.

Inventors

REN LINA
HUANG RUIZHANG
PAN WEI

Assignees

贵州轻工职业技术学院
贵州大学

Dates

Publication Date: 20260508
Application Date: 20230414

Claims (6)

1. The deep text clustering method based on the self-adaptive structure learning is characterized by comprising the following steps of: constructing a K-neighbor graph by using a K-neighbor method on original text data; filtering adjacent points with low similarity through an adaptive threshold strategy to generate a graph of an adaptive structure; inputting the graph of the self-adaptive structure into a neural network of the self-adaptive topological structure, enabling a graph convolution kernel to dynamically adjust the topological range by using a threshold attenuation strategy, and learning self-adaptive structural semantic representation of the text; Inputting the text self information into a self-encoder, learning text self semantic representation, merging the text self semantic representation into the self-adaptive structural semantic representation learned in the step three layer by layer, and learning a fusion enhanced semantic representation with the text self semantic representation and the structural semantic representation; Step five, clustering and classifying cluster distribution is carried out on the fusion enhanced semantic representation obtained in the step four by utilizing softmax, loss between the self-adaptive structural semantic representation and the text self-semantic representation is calculated by utilizing a double self-supervision mechanism, and encoder parameters are continuously and inversely adjusted and optimized to obtain final semantic representation and text clustering results; the adaptive structured semantic representation of the text is calculated by the following formula: , Wherein, the For the number of layers of the neural network, , Is the first The layer polynomial convolution kernel, Is the first The layer polynomial parameters, M representing the number of convolution kernels, b being a learnable bias matrix, Is the first The number of the layer vertexes is taken as a full 1 vector of the dimension, A represents a similarity matrix of the automatically adjusted graph structure information obtained in the step two, Is the first An adaptive structured semantic representation of layer text, σ being an activation function; Dividing a class cluster distribution R, obtaining text data distribution Q from text semantic representation learned by a self-encoder by using student distribution, calculating distribution P based on the text data distribution Q, calculating difference loss between P and Q distribution and difference loss between P and R by using a KL divergence formula, minimizing the difference loss between P and Q distribution and the difference loss between P and R, learning high confidence degree distribution, finely adjusting model parameters, combining reconstruction loss of the self-encoder, forming a model loss function together, and obtaining a clustering result by R; the calculation method of the difference loss between the P and Q distribution is as follows: , wherein, subscript i represents the ith sample, subscript j represents the jth cluster center, P represents one value in P, Q represents one value in Q, and L ae represents the difference loss between P and Q distribution; the calculation method of the difference loss between the P and R distribution is as follows: , wherein the index i represents the ith sample, the index j represents the jth cluster center, P represents one of the values in P, R represents one of the values in R, and L tagcn represents the difference loss between the P and R distributions.
2. The method is characterized in that firstly, text data are preprocessed, at least Word bag models, TF-IDF or Word2Vec are selected for preprocessing the text data, and then, a K-neighbor graph with K neighbors of each text is formed by the preprocessed text data through a KNN algorithm according to the principle that the text has similar characteristics to the neighbors of the text.
3. The method for deep text clustering based on adaptive structure learning of claim 1, wherein the adaptive threshold strategy is to set a threshold for automatically adjusting the number of neighbors for each text of the K-neighbor graph obtained in the step one.
4. The method for deep text clustering based on adaptive structure learning of claim 3, wherein the threshold for automatically adjusting the number of neighbors is set by: Initializing the threshold value of the 1 st filter to be 0.006, and for the threshold value of the m-th filter The threshold size is m-1 filter threshold Is the following formula: 。
5. The method for deep text clustering based on adaptive structure learning of claim 1 wherein the fusion enhanced semantic representation is iteratively calculated by the following formula: , , Wherein, the Represent the first The layer itself is a semantic representation of, Represent the first A structured semantic representation of layer adaptation, Represent the first An adaptive structured semantic representation of layer text, Represent the first The layer fuses the enhanced semantic representation, σ is the activation function, b is the learnable bias matrix, Is that The number of layer vertices is taken as the full 1 vector of dimensions, Is the first A layer polynomial convolution kernel.
6. An adaptive structure learning based deep text clustering device comprising a processor and a memory, wherein the memory has stored therein computer program instructions adapted to be executed by the processor, which when executed by the processor, cause the processor to perform the adaptive structure learning based deep text clustering method of any one of claims 1-5.

Description

Deep text clustering method and device for self-adaptive structure learning Technical Field The invention relates to the field of information extraction and text processing, in particular to a deep text clustering method and device for self-adaptive structure learning, and belongs to the technical field of data mining and natural language processing. Background With the development of deep neural networks, in recent years, deep text clustering has become a popular research field. Deep text clustering refers to the task of learning text semantic representation by using a neural network and dividing text document data representing similarity into one type on the basis of the text semantic representation, and can be used in the fields of text analysis, business application, web search, recommendation system, biomedicine and the like. The existing semantic representation learning of deep text clustering is mainly divided into semantic representation learning of text self content and structural semantic representation learning of a text data set. Because the two representations are not independent and are mutually associated and complemented, a deep text clustering model which fuses semantic representations of the content of the text and structural semantic representations of a text data set often has better clustering effect than a model which only learns a single representation. At present, two types of semantic representation of the text self content and two types of structural features of an initial text data set in the structure semantic representation learning of the text data set in a deep text clustering model of the structure semantic representation of the text data set are fused, namely, a K neighbor sample of each text document is calculated by using a K neighbor algorithm to form a K neighbor graph, and a structure diagram formed by relationships such as a co-author, a co-reference and the like is utilized by utilizing a graph structure of the text data set. However, in the current deep text clustering model, the text is fixed to the data of K in the utilization of the K neighbor graph, if the K value is too small, the structural information is too little, and if the K value is too large, redundant noise data can be brought, and the final clustering effect can be seriously affected by the two data sets, and noise data exists in the graph structure with the text data set, so that two documents connected at the edge do not necessarily belong to the same type, for example, 26.45% of connection nodes in the co-referenced graph structure of the data set Citeseer of the real text data set do not belong to the same type. Thus, how to learn an adaptive structure for a text dataset is a matter of investigation. In addition, the learning of the structural semantic representation in the deep text clustering model which fuses the semantic representation of the text self content and the structural semantic representation of the text data set at present is mainly divided into 2 types, namely, based on a graph convolution neural network, the idea of the algorithm is characterized in that the graph convolution neural network is utilized to learn the local characteristics of each sample and neighbor, and based on the attention graph attention neural network, the algorithm is characterized in that the attention mechanism is utilized to learn the different roles of each neighbor on the samples when the structural representation of each sample is calculated. However, the methods do not consider the role of multi-hop neighbors in the graph structure, that is, the neighbors of the neighbors have a learning effect on the sample representation, in practical application, the similar texts are not necessarily in a class cluster, and some texts are far away from each other but may belong to a class, which results in poor clustering effect in the prior art. Disclosure of Invention The invention provides a depth text clustering method and device based on self-adaptive structure learning, which are used for overcoming the defects of the prior art. The technical scheme of the invention is that the method for clustering the depth text based on self-adaptive structure learning comprises the following steps: constructing a K-neighbor graph by using a K-neighbor method on original text data; filtering adjacent points with low similarity through an adaptive threshold strategy to generate a graph of an adaptive structure; inputting the graph of the self-adaptive structure into a neural network of the self-adaptive topological structure, enabling a graph convolution kernel to dynamically adjust the topological range by using a threshold attenuation strategy, and learning self-adaptive structural semantic representation of the text; Inputting the text self information into a self-encoder, learning text self semantic representation, merging the text self semantic representation into the self-adaptive structural semantic representation lear