CN-115841114-B - Short text abstract extraction method based on words

CN115841114BCN 115841114 BCN115841114 BCN 115841114BCN-115841114-B

Abstract

The invention belongs to the technical field of text abstract extraction in the NLP field, and discloses a word-based short text abstract extraction method, which comprises an extraction model and a word sequence model, wherein the word sequence model comprises the steps of firstly marking data, S1.1, selecting 50 ten thousand dialogue short text data in an electric pin scene, manually checking and correcting word sequence of each sentence, S1.2, using jieba to divide words, then enumerating all word combinations for each sentence, marking 0, and finally manually checking and correcting all label data, and secondly, preprocessing the data. According to the invention, the expansion convolutional neural network (Dilated Convolution Neural Network, DCNN) is used in the extraction model network structure, and the expansion coefficients are set according to the sequence of 1,2, 4 and 1, so that the models can capture text information as much as possible, a word sequence model is added, the effectiveness and the continuity of abstracts are improved, and the problem of discontinuous abstracts is perfectly solved.

Inventors

SHAO ZHUQUAN

Assignees

上海绘话智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20221207

Claims (4)

1. The short text abstract extraction method based on the words comprises an extraction model and a word sequence model, and is characterized in that the word sequence model comprises the following steps: the first step, data marking is first; s1.1, selecting 50 ten thousand dialogue short text data in an electric marketing scene, and manually checking and correcting word sequence of each sentence; S1.2, using jieba word segmentation, namely, counting all word combinations for each sentence with labels of 1 in the sequence of original words, wherein the labels are 0, and finally, manually checking and correcting all label data; Step two, data preprocessing is carried out firstly; thirdly, constructing a model I, and reducing the dimension of high-order data to 2 dimensions; fourthly, evaluating the model I; S4.1, in the training process, each time the training set runs 100 batches, running the verification set once, and calculating F1-score and loss for the predicted label and the real label of the verification set; S4.2, when the F1-score is not lifted after 10 batches, model training is finished in advance, the last model saved is considered to be an optimal model, the optimal model running test set data is used for calculating the F1-score, and the current F1-score value is the score of the model; the steps of the extraction model are as follows: Firstly, marking data, namely selecting 20 ten thousand dialogue short text data in an electric pin scene, and manually summarizing a concise abstract text according to original text content; Step two, data preprocessing step two; Thirdly, constructing a model II, and capturing data information by using an expansion convolutional neural network with expansion coefficients set in sequence of 1,2, 4 and 1; fourthly, model evaluation II; s4.1, setting a threshold value to judge whether the current word is extracted abstract content, if the score is larger than the threshold value, the word is considered to be extracted, otherwise, discarding; S4.2, inputting all words larger than a threshold value, exhausting all combinations to a word sequence model, and sequentially splicing word sequences with highest scores to serve as abstracts; S4.3, performing rouge scoring on the abstract and the original text, and evaluating the quality of the current model by using the average value of all rouge scoring, wherein the closer the rouge scoring is to 1, the better; the model constructs a network structure as follows: First layer Embedding Second layer BILSTM () A third layer, linear (); Embedding is a vector representation layer, the second layer is a bidirectional long-short-term memory neural network bilstm, and Linear () is a Linear layer.
2. The method for word-based short text summary extraction of claim 1, wherein said data preprocessing comprises the specific steps of: firstly, constructing a dictionary, and taking all words subjected to word segmentation and duplication removal of the training set as the dictionary; Unifying the text dimension, namely using a < PAD > symbol to PAD all texts into a unified max_length dimension, and selecting the maximum number of words in a training set as max_length; Thirdly, representing training data as indexes according to a dictionary, circularly traversing each sentence of text, searching whether the text exists in the dictionary, if so, obtaining the indexes in the dictionary, otherwise, obtaining the indexes of < UNK >, wherein < UNK > represents all unregistered words; and fourthly, generating a data iterator, dividing the data into a plurality of batches, sending the batches into a model for training, wherein the number of the data of each batch is 64, and adding the data into the GPU in the process of generating the iterator.
3. The word-based short text abstract extraction method of claim 1 wherein said data preprocessing step II comprises the following steps: Searching an index of an artificial abstract word in an original text, firstly, performing jieba word segmentation on the abstract and the original text respectively, then searching a central position in the original text where the abstract is located according to a sliding window method, matching the abstract from the central position to two sides, marking 1 after matching is successful, marking 0 if not, and finally generating an abstract label corresponding to each sentence of the original text; Constructing dictionary and Embedding, selecting the first 20 ten thousand words and training set of the Tencent word vector dictionary and the words after the Tencent word vector dictionary are de-duplicated to form dictionary, if the words in the training set appear in the Tencent word vector dictionary, selecting the word vectors therein, otherwise, randomly initializing into 200-dimensional word vectors; Unifying the text dimension, after word segmentation, using a < PAD > symbol to PAD all texts into a unified max_length dimension, and selecting the maximum number of words in a training set as max_length; And fourthly, generating a data iterator, for deep learning, dividing the data into a plurality of batches, sending the batches into a model for training, wherein the number of the data in each batch is 32, and adding the data into the GPU in the process of generating the iterator.
4. The word-based short text abstract extraction method of claim 1 wherein said model building two networks is structured as follows: First layer Embedding Second layer, linear () Third layer Dropout (0.5) Fourth layer DCNN (dilation _rate=1) Fifth layer Dropout (0.5) Sixth layer DCNN (dilation _rate=2) Seventh layer Dropout (0.5) Eighth layer DCNN (dilation _rate=4) Ninth layer Dropout (0.5) Tenth layer DCNN (dilation _rate=1) Eleventh layer Dropout (0.5) Twelfth layer, linear () A tenth layer, sigmoid (); Embedding is a vector representation layer, linear () is a full connection layer, dropout is an inactivation layer, 50% of network nodes are randomly selected for inactivation, each layer of DCNN is connected with an inactivation layer, DCNN is an expansion convolutional neural network, and dilation _rate is expansion rate.

Description

Short text abstract extraction method based on words Technical Field The invention belongs to the technical field of text abstract extraction in the NLP field, and particularly relates to a short text abstract extraction method based on words. Background The goal of the text summary is to compress, generalize, and summarize long text to form short text with a generalized meaning. According to the different numbers of the documents, the text abstracting task can be divided into a single document abstract and a multi-document abstract, according to the different abstracting methods, the text abstract can be divided into two main types of extraction (Extractive) and generation (Abstractive), wherein the text abstract is directly extracted from an original text, and the abstract is generated word by word, in comparison with the extraction method, due to the inherent characteristics of the method, the content of the original text can not be summarized briefly and coherently. Disclosure of Invention The invention aims to provide a short text abstract extraction method based on words, which aims to solve the problems in the background technology. In order to achieve the above purpose, the invention provides a short text abstract extraction method based on words, which comprises an extraction model and a word sequence model, wherein the word sequence model comprises the following steps: the first step, data marking is first; s1.1, selecting 50 ten thousand dialogue short text data in an electric marketing scene, and manually checking and correcting word sequence of each sentence; S1.2, using jieba word segmentation, namely, counting all word combinations for each sentence with labels of 1 in the sequence of original words, wherein the labels are 0, and finally, manually checking and correcting all label data; Step two, data preprocessing is carried out firstly; thirdly, constructing a model I, and reducing the dimension of high-order data to 2 dimensions; fourthly, evaluating the model I; S4.1, in the training process, each time the training set runs 100 batches, running the verification set once, and calculating F1-score and loss for the predicted label and the real label of the verification set; s4.2, when the F1-score is not lifted after 10 batches, model training is finished in advance, the last model saved is considered to be an optimal model, the optimal model running test set data is used for calculating the F1-score, and the current F1-score value is the score of the model; the steps of the extraction model are as follows: Firstly, marking data, namely selecting 20 ten thousand dialogue short text data in an electric pin scene, and manually summarizing a concise abstract text according to original text content; Step two, data preprocessing step two; Thirdly, constructing a model II, and capturing data information by using an expansion convolutional neural network with expansion coefficients set in sequence of 1,2, 4 and 1; fourthly, model evaluation II; S4.1, setting a threshold value to judge whether the current word is extracted abstract content, if the score is larger than the threshold value, the word is considered to be extracted, otherwise, the word is abandoned; S4.2, inputting all words larger than a threshold value, exhausting all combinations to a word sequence model, and sequentially splicing word sequences with highest scores to serve as abstracts; and S4.3, performing rouge scoring on the abstract and the original text, and evaluating the quality of the current model by using the average value of all rouge scores, wherein the closer the rouge score is to 1, the better. Preferably, the data preprocessing comprises the following specific steps: Firstly, constructing a dictionary, and taking all words after word segmentation and duplication removal of the training set as the dictionary; Unifying the text dimension, namely using a < PAD > symbol to PAD all the texts into a unified max_length dimension, and selecting the maximum number of words in a training set as max_length; Thirdly, representing training data as indexes according to a dictionary, circularly traversing each sentence of text, searching whether the text exists in the dictionary, if so, obtaining the indexes in the dictionary, otherwise, obtaining the indexes of < UNK >, wherein < UNK > represents all unregistered words; and fourthly, generating a data iterator, dividing the data into a plurality of batches, sending the batches into a model for training, wherein the number of the data of each batch is 64, and adding the data into the GPU in the process of generating the iterator. Preferably, the model constructs a network structure as follows: First layer Embedding Second layer BILSTM () A third layer, linear (); Embedding is a vector representation layer, the second layer is a bidirectional long-short-term memory neural network bilstm, and Linear () is a Linear layer. Preferably, the specific steps of the second data preprocessing a