CN-122021560-A - Long text abstract generation method based on hierarchical graph comparison subject
Abstract
The invention discloses a long text abstract generation method based on hierarchical graph comparison subjects, which comprises the steps of 1, preprocessing an original document, dividing sentence sequences, obtaining sentences and document representations perceived by global context through a hierarchical encoder network, 2, deducing document-level and sentence-level subject distribution by utilizing a neural subject model, and 3, constructing a supervision graph based on a standard abstract to conduct graph comparison study so as to pull up the subject representations of the documents and key sentences and push away redundant information. The invention can effectively capture the deep semantic structure of the long document, thereby improving the consistency of the abstract theme and the information coverage and reducing the redundancy.
Inventors
- HE JIN
- WANG WEI
- GU LICHUAN
- JIANG TINGTING
- YANG SHUAI
- LI TONGGE
Assignees
- 安徽农业大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260130
Claims (8)
- 1. A long text abstract generation method based on hierarchical graph comparison subject is characterized by comprising the following steps: Step 1, obtaining original document data to be processed, and performing de-stop word and word segmentation processing to obtain a text dataset D= { D 1 ,...,D i ,...,D M }, wherein D i is the ith document in D, and M is the total number of documents; Dividing D i into J continuous sentence sequences S i ={S i,1 ,...,S i,j ,...,S i,J , wherein S i,j is the J-th sentence in D i , J represents the total number of sentences, and S i,j corresponds to a real classification label y i,j ; constructing a bag-of-word feature vector F i of D i ; Step 2, constructing a hierarchical encoder network, including a block-level transform encoder, a HDBSCAN clustering module and a document-level transform encoder, and processing S i to obtain a sentence representation matrix with global document context awareness of D i And document representation ; Step 3 based on And Deducing the document topic distribution of D i and the sentence topic distribution of S i through a neural topic model; Step 4, constructing evidence lower bound loss L ELBO based on document theme distribution and sentence theme distribution; Step 5, constructing a supervision graph G i of the D i , and constructing an adjacency matrix A i of the D i based on the document theme distribution of the D i and the sentence theme distribution of the S i ; Step 6, performing graph contrast learning on the supervisory graph G i and the adjacency matrix A i thereof so as to construct a graph contrast loss function L con ; Step 7, constructing a total loss function L of a text abstract network formed by a hierarchical encoder network, a neural topic model and a graph neural network by utilizing a formula (14), and carrying out end-to-end joint updating on all parameters to be trained in the text abstract network by utilizing a gradient descent optimization algorithm with the aim of minimizing the total loss function L until the L converges, so as to obtain a trained text abstract model for generating a corresponding long text abstract for input document data; (14) in the formula (14), η is a super parameter.
- 2. The method for generating the long text abstract based on hierarchical graph comparison subjects according to claim 1, wherein the step 2 is performed as follows: Step 2.1, the block-level transducer encoder processes S i,j to obtain a context representation of S i,j , thereby obtaining a context representation sequence H i of D i , and further obtaining a context representation set H of D; Step 2.2, HDBSCAN clustering module performs semantic clustering on H to obtain each cluster, and distributes unique discrete identifiers as clustering labels for each cluster, so as to generate clustering feature vectors of each sentence according to the clustering labels of the clusters to which each sentence in H belongs, and fuses each sentence clustering feature vector with the context representation of the corresponding sentence to obtain enhanced context representation of each sentence, thereby obtaining enhanced context sentence representations of all sentences in D i , and forming enhanced sentence representation sequence of D i ; Step 2.3 document level Transformer encoder pair Modeling global context dependency among all sentences to obtain a D i context-aware sentence representation matrix And is opposite to Pooling to obtain D i document representation 。
- 3. The method for generating the long text abstract based on hierarchical graph comparison subjects according to claim 2, wherein the step3 is performed as follows: Step 3.1 obtaining a context-hidden representation of document D i using equation (1) : (1) In the formula (1), R is a feedforward neural network; step 3.2, calculating the mean value of the document theme distribution of D i by using the formula (2) and the formula (3), respectively Sum covariance : (2) (3) In the formulas (2) and (3), For a feed-forward neural network for calculating document topic distribution mean parameters, For a feedforward neural network for calculating document topic distribution covariance parameters, diag represents constructing a diagonal matrix with input vectors as diagonal elements; step 3.3, obtaining the document theme distribution of D i by using the formula (4) : (4) In the formula (4), the amino acid sequence of the compound, Sampling noise variable for D i ; Representing an activation function; Step 3.4, calculating the mean value of the sentence theme distribution of S i,j by using the formula (5) and the formula (6), respectively Sum covariance : (5) (6) In the formulas (5) and (6), For a feedforward neural network to calculate sentence topic distribution mean parameters, For a feedforward neural network to calculate a sentence topic distribution covariance parameter, Representation of Context-aware sentence representation of the jth sentence; Step 3.5, obtaining the sentence theme distribution of S i,j by using the formula (7) : (7) In the formula (7), the amino acid sequence of the compound, Is the noise variable of S i,j .
- 4. The method for generating a long text abstract based on hierarchical graph comparison subjects according to claim 3, wherein step 4 is performed as follows: Step 4.1, predicting the probability of generating a sentence tag of S i,j by using equation (8) ; (8) In the formula (8), the amino acid sequence of the compound, A feed-forward neural network with a sigmoid activation function; Step 4.2, obtaining the sentence theme variation approximate posterior distribution of S i,j by utilizing the step (9) ; (9) In the formula (9), the amino acid sequence of the compound, Is a multi-element normal distribution; Step 4.3, obtaining the approximate posterior distribution of the document theme variation of D i by using the method (10) ; (10) Step 4.4, Representing a pre-set based hyper-parameter Constructing a topic prior distribution of D i ; step 4.5, constructing evidence lower bound loss L ELBO by using the formula (11): (11) In the formula (11), the amino acid sequence of the compound, Representation of The following expectation, D KL (||) represents calculating the KL-divergence between two probability distributions.
- 5. The method for generating a long text abstract based on hierarchical graph comparison subjects as claimed in claim 4, wherein the step5 is performed as follows: Step 5.1, build a supervisory graph G i ={V i ,E i of D i , wherein, Represents the set of nodes in G i , an , wherein, The document node that is the D i , E i represents the edge set in G i , including edges between the document node and all sentence nodes; Step 5.2, using a greedy search algorithm that maximizes the ROUGE-2 score from Selecting sentence nodes with the maximum semantic similarity with the standard abstract to obtain a key sentence node set of D i ; Step 5.3, if Belonging to Let the j-th element value a i,j in the adjacency matrix a i be "1", otherwise let the j-th element value a i,j in the adjacency matrix a i be "0".
- 6. The method for generating a long text abstract based on hierarchical graph comparison subjects as claimed in claim 5, wherein the step 6 is performed as follows: Step 6.1, respectively taking the document theme distribution and the sentence theme distribution as the document node and the sentence node initial characteristics of the graphic neural network, thereby constructing a node initial characteristic set of the graphic neural network ; Step 6.2, will Inputting the data and A i into a graph neural network, and aggregating topic distribution information of neighborhood nodes by using a formula (12) to obtain an enhanced topic embedding matrix of D i : (12) In equation (12), reLU is the activation function of the modified linear unit, In order to include the adjacency matrix for the self-connection, The diagonal matrix is D i , the numerical value on the diagonal line represents the number of neighbors connected with each node, and W is the weight matrix to be learned of the graph neural network; Step 6.3, calculate graph contrast loss function L con using equation (13): (13) In the formula (13), the amino acid sequence of the compound, Is that The number of nodes in the (c) tree, For the indication function, cos represents the computed cosine similarity; Representation of The enhanced sentence topic of S i,j in (b) is embedded in a vector, Representation of The enhanced document theme of D i in (a) embeds the vector.
- 7. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the hierarchical graph comparison topic based long text digest generation method of any one of claims 1-6 when the computer program is executed.
- 8. A computer readable storage medium, on which a computer program is stored, which when being executed by a processor implements the steps of the hierarchical graph comparison topic based long text digest generation method according to any one of claims 1-6.
Description
Long text abstract generation method based on hierarchical graph comparison subject Technical Field The invention belongs to the field of text abstract generation, and particularly relates to a long text abstract generation method based on hierarchical graph comparison subjects. Background Automatic text summarization techniques aim to automatically compress lengthy raw text into a brief, smooth summary of core information. Currently, pre-trained language models represented by BERT, roBERTa, etc. have become the mainstream technical framework in this field. However, due to the inherent attention mechanism and the limitation of the context length, when the models process documents such as book chapters, long reports and academic papers, long-distance semantic dependency relations between sentences and paragraphs are difficult to effectively model, so that the topic venues and logic structures of the whole text cannot be accurately captured. This directly results in the generated abstract being prone to key problems such as insufficient coverage of the subject, broken logical continuity, repeated redundancy of information, and the like. In order to break through the above limitation, some prior art attempts to introduce a neural topic model in order to characterize the global topic distribution of a document from the perspective of probability generation and incorporate it as auxiliary information into the summary generation process. However, the combination scheme still has the obvious defects that firstly, a model fuses local context characteristics and global theme characteristics in a simple splicing or shallow interactive mode, effective synergy and balance of the local context characteristics and the global theme characteristics in a deep semantic level cannot be achieved, so that the local details and the whole theme of the abstract are difficult to consider, secondly, the method is mostly limited to theme analysis in a single document, and explicit modeling of cross-document semantic association is lacked. When the multi-document abstract task is processed or the contradiction of the content is required to be corrected by means of external knowledge, the model is easily interfered by high-frequency words or one-sided topics in the document, and real key and non-redundant information cannot be screened from a wider semantic view. Disclosure of Invention The invention aims to solve the defects of the prior art, and provides a long text abstract generation method based on hierarchical graph comparison subject, so that the subject consistency and information coverage of the long text abstract can be effectively improved, and redundancy can be reduced, thereby providing reliable technical support for information efficient acquisition and accurate extraction of key content, and assisting knowledge management, information analysis and intelligent decision in practical scenes such as news aggregation, academic literature analysis, enterprise long report interpretation and the like. In order to achieve the aim of the invention, the invention adopts the following technical scheme: the invention discloses a long text abstract generating method based on a hierarchical graph comparison subject, which is characterized by comprising the following steps of: Step 1, obtaining original document data to be processed, and performing de-stop word and word segmentation processing to obtain a text dataset D= { D 1,...,Di,...,DM }, wherein D i is the ith document in D, and M is the total number of documents; Dividing D i into J continuous sentence sequences S i={Si,1,...,Si,j,...,Si,J, wherein S i,j is the J-th sentence in D i, J represents the total number of sentences, and S i,j corresponds to a real classification label y i,j; constructing a bag-of-word feature vector F i of D i; Step 2, constructing a hierarchical encoder network, including a block-level transform encoder, a HDBSCAN clustering module and a document-level transform encoder, and processing S i to obtain a sentence representation matrix with global document context awareness of D iAnd document representation; Step 3 based onAndDeducing the document topic distribution of D i and the sentence topic distribution of S i through a neural topic model; Step 4, constructing evidence lower bound loss L ELBO based on document theme distribution and sentence theme distribution; Step 5, constructing a supervision graph G i of the D i, and constructing an adjacency matrix A i of the D i based on the document theme distribution of the D i and the sentence theme distribution of the S i; Step 6, performing graph contrast learning on the supervisory graph G i and the adjacency matrix A i thereof so as to construct a graph contrast loss function L con; And 7, constructing a total loss function L of a text abstract network formed by a hierarchical encoder network, a neural topic model and a graph neural network by utilizing a formula (14), and carrying out end-to-end joint up