CN-115730587-B - Text feature extraction method based on NGU language model

CN115730587BCN 115730587 BCN115730587 BCN 115730587BCN-115730587-B

Abstract

The invention relates to the technical field of natural language processing, in particular to a text feature extraction method based on an NGU language model. The patent aims at optimizing and improving a GRU model based on a cyclic neural network, provides a new text feature extraction model NGU language model, introduces a normalization mechanism into a gating unit of the GRU, replaces a hyperbolic tangent function with a saturation region with a layer normalization operation, and simultaneously fuses a feedforward layer neural network of a transducer into an iteration unit to improve the semantic representation capacity of the model, namely the data fitting capacity of the model.

Inventors

Cao Xiaopan
Ma Guozu

Assignees

中电万维信息技术有限责任公司

Dates

Publication Date: 20260508
Application Date: 20221215

Claims (1)

1. A text feature extraction method based on an NGU language model is characterized by comprising the following steps: S1, constructing a training data set, namely collecting training data sets related to the arrangement and tasks, putting the training data sets into train. Txt, inputting a text maximum length of an NGU language model to be 1000, and filling the text maximum length of 1000 by adopting [ PAD ] when the text length is less than 1000; S2, constructing a mapping from characters to IDs, namely counting characters in the train-txt in the training set in S1 to be marked as a token_list, then establishing a dictionary Dict _token according to the characters in the token_list, wherein the front item key of the Dict _token is a character index number, the rear item value of the Dict _token is a specific single character, and the [ PAD ] is a completion character when the text is not enough to be full; S3, training data and model adaptation, wherein the text sample in the training data set obtained in the step S1 is not enough to have the maximum length of 1000, a list is filled to the maximum length of 1000 through [ PAD ], then mapped into an index number list through a dictionary Dict _token and changed into a tensor X input by a model, the batch size batch_size input into an NGU language model is 128, and the size of X is [ 128,1000 ]; And S4, extracting text characteristics of the NGU language model, wherein an original GRU network model iteration formula is as follows: f is the equivalent formula of the GRU gating circulating unit, and the detailed formula of f is as follows: The proposed NGU iteration formula is specifically as follows: The sigmoid function, when x is far from 0, has a saturation region, After passing through the full connection layer, the information is lost through a sigmoid function, the layer normalization operation is introduced, normalization is carried out on the embedded representation dimension, and then the text representation information is effectively reserved by adopting the sigmoid function; When (when) In (a) The output tends to a stable value when the value is far from 0 and reaches the saturation region, and a lot of semantic information is lost; the layer normalization operation is adopted to replace the hyperbolic tangent function tanh in the GRU, and the operation of the layernorm layer normalization is as follows, namely, normalization is carried out in the semantic representation dimension d_model, if the current word represents a matrix as T, the matrix size is [1, d_model ], and the values in the second dimension are as follows in sequence: The average value is as follows: The variance is: Each input of the layer normalization is: The layer normalization operation only carries out translation shrinkage on data, semantic information is not lost, meanwhile, the operation normalizes embedded representation dimensions to be near 0, model training is more stable, a feedforward layer neural network layer in a Transformer is migrated into an NGU model, the feedforward neural network comprises two linear transformations and one nonlinear transformation GeLU to activate functions, and then the functions are subjected to residual network and layer normalization operation; The word embedding dimension is 256, the hidden layer dimension of the feedforward neural network is 2048, each word in the text data is embedded and represented through token_ embedding for the size of model input in step S3 [ 128,1000 ], the matrix shape of the embedded representation of each word in 1000 words is [ 128,256 ], 128 is the size of batchsize in training, 256 is the word embedding dimension in the text, then each word in the text data X is embedded and sequentially input into the NGU loop iteration unit, and then each time step is performed Spliced together into The dimension is 128, 1000 and 256, text characteristics are extracted through an NGU language model, and each word of each sentence of data in one batch is expressed as 256 dimensions through the NGU language model; and S5, applying the NGU language model to carry out parameter training according to a specific natural language processing task by using the non-pre-training language model of the NGU language model, wherein the size of the text representation tensor obtained in S4 is [ 128,1000,256 ], and connecting the tensor into a neural network for subsequent text classification, relation extraction, text generation and entity identification to carry out training.

Description

Text feature extraction method based on NGU language model Technical Field The invention relates to the technical field of natural language processing, in particular to a text feature extraction method based on an NGU language model. Background RNN, transducer, etc. play a significant role in natural language processing tasks as an important basic unit in the field of natural language processing. Among them, the GRU model belonging to RNN series plays an important role in entity recognition, relation extraction and text generation. In the text generation task, the GRU unit may cause text generation reasoning to occur in an iterative stream fashion, with text generation being faster than the Transformer, GPT series of models. In the tasks of entity identification and relation extraction, the GRU model also shows inauguration because the cyclic iteration mechanism of the GRU can introduce adjacent text content information. However, the transducer has strong semantic characterization capability due to the adoption of a multi-head self-attention mechanism of global interactive perception, so that the current natural language processing related task gradually starts to adopt a large model based on the transducer related variant, and the model is trained on a large data set to achieve an excellent effect, but needs to consume great computing resources. In order to integrate the advantages of the transducer and the GRU, the GRU has the semantic expression capability of the transducer and has the advantages of the GRU. The patent provides an NGU language model based on GRU model improvement. Disclosure of Invention The patent aims at optimizing and improving GRU models based on a cyclic neural network, provides a new text feature extraction model NGU language model, introduces a normalization mechanism into a gate control unit of the GRU, replaces hyperbolic tangent functions with saturation regions with layer normalization operation, and simultaneously fuses a feedforward layer neural network of a transducer into an iteration unit to improve the semantic representation capacity of the model, namely the model fitting data capacity, and is defined as the NGU language model. Solves the problems in the prior art, provides a text feature extraction method based on an NGU language model for synthesizing respective advantages of GRU and Transformer, and comprises the following steps: S1, constructing a training data set, namely collecting training data sets related to the arrangement and tasks, putting the training data sets into train. Txt, inputting a text maximum length of an NGU language model to be 1000, and filling the text maximum length of 1000 by adopting [ PAD ] when the text length is less than 1000; S2, constructing a mapping from characters to IDs, namely counting characters in the train set in S1, namely marking the characters as a token_list, then establishing a dictionary Dict _token according to the characters in the token_list, wherein the front item key of the Dict _token is a character index number, the rear item value of the Dict _token is a specific single character, and the [ PAD ] is a complement character when the text is not enough to be the maximum length; S3, training data and model adaptation, wherein the text sample in the training data set obtained in the step S1 is not enough to have the maximum length of 1000, a list is filled to the maximum length of 1000 through [ PAD ], then mapped into an index number list through a dictionary Dict _token and changed into a tensor X input by a model, the batch size batch_size input into an NGU language model is 128, and the size of X is [ 128,1000 ]; S4, extracting text features of the NGU language model, wherein an original GRU network model iteration formula is as follows: f is the equivalent formula of the GRU gating circulating unit, and the detailed formula of f is as follows: The proposed NGU iteration formula is specifically as follows: the sigmoid function, when x is far from 0, has a saturation region, After passing through the full connection layer, the information is lost through a sigmoid function, the layer normalization operation is introduced, normalization is carried out on the embedded representation dimension, and then the text representation information is effectively reserved by adopting the sigmoid function; When (when) In (a)The output tends to a steady value away from the value 0 to the saturation region, losing much semantic information. The layer normalization operation is adopted to replace the hyperbolic tangent function tanh in the GRU, and the operation of the layernorm layer normalization is as follows, namely, normalization is carried out in the semantic representation dimension d_model, if the current word represents a matrix as T, the matrix size is [ 1, d_model ], and the values in the second dimension are as follows in sequence: The average value is as follows: The variance is: Each input of the layer normalization i