Search

CN-115186089-B - Automatic abstract generation method based on core sampling

CN115186089BCN 115186089 BCN115186089 BCN 115186089BCN-115186089-B

Abstract

The invention discloses an automatic abstract generation method based on core sampling, which uses UniLM models, so that semantic information of contexts can be combined when an original text is interpreted, semantic understanding effects are better, when the abstract is generated, the abstract is unidirectionally generated through a Mask mechanism, logic of text renewal is met, and the UniLM models are pretrained through multitasking, so that the abstract generation method has stronger generalization capability. When UniLM is used for decoding, a core sampling function is adopted, and a Mask matrix is constructed according to the core sampling function, so that words to be generated are randomly sampled and generated within a limited range, and the problem that the generated abstract generates repeated texts is mainly solved.

Inventors

  • XU WENBO
  • SUN JINGZHE
  • HE XIXIU
  • LI JIAN
  • LIU BOWEN
  • HU JIALI

Assignees

  • 电子科技大学
  • 电子科技大学

Dates

Publication Date
20260421
Application Date
20220711
Priority Date
20220711

Claims (3)

  1. 1. The automatic digest generation method based on the kernel sampling is characterized by comprising the following steps of: Step 1, using a microblog abstract data set, performing data cleaning based on UniLM pre-training models, and dividing training sets and testing; Step 2, constructing a Mask matrix suitable for the Seq2 Seq; step 3, fine Tuning parameters of the UniLM language model, namely Fine-Tuning; step 4, core sampling decoding: a) First, according to the inputted temperature parameter t, the formula is passed Reshaping the original probability distribution of Token; b) Next, based on the threshold value p of the input, the vocabulary of Top-p is defined To meet the condition Is a minimum vocabulary set of (1); c) Then, let the Rescaling the initial conditional probability distribution to a new distribution Wherein when In the time-course of which the first and second contact surfaces, Otherwise, 0, so as to construct a kernel sampling function; d) Finally, sorting and accumulating one by one according to the probability corresponding to the new distribution from large to small to form a candidate Token set, and randomly sampling the candidate Token set to obtain a predicted Token; step 5, for the abstract part which needs to be predicted for generating the abstract, taking each Token as a unit, and circularly sampling and decoding the core to form an abstract generating model; and step6, inputting the test set after data cleaning into a summary generation model to obtain a summary result.
  2. 2. The method of claim 1, wherein the trapezoidal Mask matrix in the step 2 is obtained by splicing a lower triangular Mask matrix and another Mask matrix, so that the model can see the two-way information of the context when reading the original text, and performs the one-way generation from front to back when generating the abstract.
  3. 3. The method of claim 1, wherein step 5 is specifically that the entire text data to be processed is input into a model, and the model uses the generated Token as an input basis of a subsequent decoding part in the cyclic prediction process through a Mask mechanism of the Seq2Seq until a complete abstract text is generated.

Description

Automatic abstract generation method based on core sampling Technical Field The invention relates to the field of automatic abstract generation, in particular to a Chinese abstract automatic generation method and system based on an NLP technology. Background Text extraction refers to extracting, summarizing or refining the essential point information of a text or a text set through various technologies, so as to summarize and display the main content or meaning of the original text. Text summarization technology, and in particular, generating text summaries, has been a research difficulty in natural language technology because it involves the ability to handle very deep natural language (natural language understanding, natural language generation, etc.). The text summarization technology commonly used at present can be mainly divided into two types, namely extraction type text summarization and generation type text summarization. The extraction type text abstract, as the name implies, is to extract one sentence or several sentences from the document or the document set to form the abstract. The score represents the importance degree by calculating the score of the sentences in the document, the higher the score is, the more important the sentences are represented, and then the length of the abstract depends on the compression rate by sequentially selecting a plurality of sentences with high scores to form the abstract. The proposal has the advantages of simplicity, practicality and no complete separation from the document itself. Although having the advantages, the method may have the defects of inconsistent abstract generation, bad word count control, ambiguous target sentence, and the like, and even can be said that the abstract quality is determined by the original text. The method for generating the formula text abstract does not have the problem, and the method for generating the formula abstract does not simply utilize words or phrases in the original document to form the abstract, but expresses the main ideas in different expression modes after acquiring the main ideas from the original document. The method of generating a summary in order to convey the main point of view of the original document, phrases and sentences in the original document can be reused, but in general, the summary needs to be expressed in summary by the author himself. The method for generating the abstract needs to analyze grammar and semantics of the original document by utilizing a natural language understanding technology, then fuses information, and generates a new text abstract by utilizing a natural language generating technology. Disclosure of Invention The invention mainly aims to provide an automatic abstract generating method based on core sampling, which aims to solve the problem that the generated abstract generates repeated texts. In order to achieve the above object, according to one aspect of the present invention, there is provided an automatic digest generation method based on core sampling, including the steps of: Step 1, using a microblog abstract data set, performing data cleaning based on UniLM pre-training models, and dividing a training set and a testing set; Step 2, constructing a Mask matrix suitable for the Seq2 Seq; step 3, fine Tuning parameters of the UniLM language model, namely Fine-Tuning; step 4, core sampling decoding: a) Constructing a kernel sampling function according to an input threshold p and a temperature parameter t; b) Constructing a core sampling Mask matrix according to the core sampling function; c) According to the Mask matrix, sequencing the probabilities of the Token from large to small, accumulating one by one until the probability accumulation sum is greater than a threshold value p, stopping accumulating, and forming a candidate Token set; d) Randomly sampling from the Token set to obtain a predicted Token; step 5, for the abstract part which needs to be predicted for generating the abstract, taking each Token as a unit, and circularly sampling and decoding the core to form an abstract generating model; and step6, inputting the test set after data cleaning into a summary generation model to obtain a summary result. The specific Mask matrix of step 2 is as follows: The Mask matrix is a lower triangle Mask matrix and a Mask matrix are spliced to obtain a trapezoid Mask matrix, and the trapezoid Mask matrix can enable the BERT to have reading and understanding capabilities and generating capabilities through only one Mask mechanism under the condition that the BERT basic framework is not changed. And the method is different from a GPT type simple generation model, can enable the model to see the two-way information of the context at the same time when reading the original text, has stronger understanding and induction capability than the GPT, and can generate the abstract unidirectionally from front to back without destroying the generated basic logic. The step 4 is specifically