Search

CN-116522961-B - Training method, device and storage medium of machine translation model

CN116522961BCN 116522961 BCN116522961 BCN 116522961BCN-116522961-B

Abstract

The invention discloses a training method, a device and a storage medium of a machine translation model, wherein the method comprises the steps of obtaining the attention score of an ith word to the t word, carrying out weighted summation on the attention score and the word vector of the ith word to obtain a hidden layer vector, calculating the distance between the hidden layer vector and each sub-attribute value of a corresponding discrete hidden variable, wherein the distance between the hidden layer vector and the nearest sub-attribute value is the original attribute value of the discrete hidden variable, carrying out weighted summation on the attention score and the original attribute value to obtain an attribute vector, calculating the distance between the attribute vector and each sub-attribute value of the discrete hidden variable of the ith word, carrying out constraint on the original attribute value and the new attribute value by a loss function to obtain a loss L C , merging the new attribute value into the hidden layer vector to obtain a fusion vector corresponding to the ith word, calculating a machine translation loss L nmt by using the fusion vector, carrying out weighted summation on the loss L C and the loss L nmt to obtain a final loss function, and training the machine translation model by using the final loss function.

Inventors

  • HUANG SHUJIAN
  • LIU ZIHAN
  • DAI XINYU
  • ZHANG JIANBING
  • CHEN JIAJUN

Assignees

  • 南京大学

Dates

Publication Date
20260508
Application Date
20230314

Claims (6)

  1. 1. A training method of a machine translation model is characterized in that a discrete hidden variable is modeled for discrete attributes of words at each position in a source end and a target end to represent, and the discrete hidden variable is endowed with a sub-attribute value for each sub-attribute in the discrete attributes, and specifically comprises the following steps: S1, converting the ith word and the nth word of the source end and the target end into a word vector e i and a word vector e t respectively, inputting the word vector e i and the word vector e t into a neural network translation model, and obtaining the attention score a it of the ith word to the nth word, wherein t < i; S2, carrying out weighted summation on the attention score a it and the word vector e i to obtain a hidden layer vector x i corresponding to the word in the ith position; S3, respectively calculating the distance between the hidden layer vector x i and each sub-attribute value of the discrete hidden variable of the ith word, and taking the sub-attribute value closest to the sub-attribute value as an original attribute value l i of the discrete hidden variable of the ith word; S4, carrying out weighted summation on the discrete hidden variables of the t-th word according to the attention score a it and the original attribute value l i to obtain an attribute vector attr i corresponding to the i-th word, respectively calculating the distance between the attribute vector attr i and each sub-attribute value of the discrete hidden variables of the i-th word, and taking the sub-attribute value closest to the attribute vector attr i as a new attribute value l 2 i of the discrete hidden variables of the i-th word; S5, restraining the original attribute value L i and the new attribute value L 2 i by using a loss function to obtain a loss L C , merging the new attribute value L 2 i into the hidden layer vector x i to obtain a fusion vector h i corresponding to the ith word, and calculating the loss L nmt of machine translation by using the fusion vector h i ; S6, obtaining a final loss function L of machine translation by adding the loss L C and the loss L nmt , and training a machine translation model by utilizing the final loss function L of machine translation, wherein when the discrete hidden variable represents the discrete attribute of gender, the discrete hidden variable respectively assigns a positive value for men in the discrete attribute of gender, and assigns a negative value for women in the discrete attribute of gender; the loss function in S5 adopts KL divergence, and the calculation formula of the loss L C is as follows: L C =kl_div(p 1 ,p 2 ); p 1 =softmax([d 0 ,d 1 ]); p 2 =softmax([d 2 0 ,d 2 1 ]); Where kl_div represents the loss of KL divergence, p 1 and p 2 represent probability distributions, d 0 represents the distance of the hidden layer vector of the ith word x i from the positive value of the discrete hidden variable of the ith word, d 1 represents the distance of the hidden layer vector of the ith word x i from the negative value of the discrete hidden variable of the ith word, d 2 0 represents the distance of the attribute vector attr i from the positive value of the discrete hidden variable of the ith word, d 2 1 represents the distance of the attribute vector attr i from the negative value of the discrete hidden variable of the ith word, and softmax is a normalized exponential function.
  2. 2. The method for training a machine translation model according to claim 1, wherein the calculation formula of the attribute vector attr i in S4 is as follows: 。
  3. 3. The method for training a machine translation model according to claim 1, wherein S5, the new attribute value l 2 i is fused into the hidden layer vector x i by a gating mechanism, and the fused vector h i has the following calculation formula: h i =g×x i +(1-g)×l 2 i ; g=σ(concat(x i ,l 2 i )@W); Wherein concat represents vector splicing operation @ represents matrix multiplication operation, W represents model parameter matrix, W epsilon R 2d×d , sigma represents sigmoid function, and g represents gating unit.
  4. 4. The training device of the machine translation model is characterized in that a discrete hidden variable is modeled for discrete attributes of words at each position in a source end and a target end to be represented, the discrete hidden variable is endowed with a sub-attribute value for each sub-attribute in the discrete attributes, and the training device of the machine translation model comprises: The attention score acquisition module is used for respectively converting the ith word and the nth word of the source end and the target end into a word vector e i and a word vector e t , inputting the word vector e i and the word vector e t into a neural network translation model, and acquiring the attention score a it of the ith word to the nth word, wherein t < i; The hidden layer vector calculation module is used for carrying out weighted summation on the attention score a it and the word vector e i to obtain a hidden layer vector x i corresponding to the word at the ith position; the original attribute value calculation module of the discrete hidden variable is used for respectively calculating the distance between the hidden layer vector x i and each sub-attribute value of the discrete hidden variable of the ith word, and taking the sub-attribute value closest to the sub-attribute value as the original attribute value l i of the discrete hidden variable of the ith word; the new attribute value calculation module of the discrete hidden variable is used for carrying out weighted summation on the discrete hidden variable of the t-th word according to the attention score a it and the original attribute value l i to obtain an attribute vector attr i corresponding to the i-th word, respectively calculating the distance between the attribute vector attr i and each sub-attribute value of the discrete hidden variable of the i-th word, and taking the sub-attribute value closest to the distance as a new attribute value l 2 i of the discrete hidden variable of the i-th word; The constraint loss calculation module is used for constraining the original attribute value L i and the new attribute value L 2 i by using a loss function to obtain a loss L C ; The loss calculation module of the machine translation is used for integrating the new attribute value L 2 i into the hidden layer vector x i to obtain a fusion vector h i and calculating the loss L nmt of the machine translation by utilizing the fusion vector h i ; The machine translation model training module is used for obtaining a final loss function L of the machine translation by adding the loss L C and the loss L nmt , and training a machine translation model by using the final loss function L of the machine translation; When the discrete hidden variables represent the discrete attributes of the gender, respectively giving a positive value to the males in the discrete attributes of the gender and giving a negative value to the females in the discrete attributes of the gender; The loss function adopts KL divergence, and the calculation formula of the loss L C is as follows: L C =kl_div(p 1 ,p 2 ); p 1 =softmax([d 0 ,d 1 ]); p 2 =softmax([d 2 0 ,d 2 1 ]); Where kl_div represents the loss of KL divergence, p 1 and p 2 represent probability distributions, d 0 represents the distance of the hidden layer vector of the ith word x i from the positive value of the discrete hidden variable of the ith word, d 1 represents the distance of the hidden layer vector of the ith word x i from the negative value of the discrete hidden variable of the ith word, d 2 0 represents the distance of the attribute vector attr i from the positive value of the discrete hidden variable of the ith word, d 2 1 represents the distance of the attribute vector attr i from the negative value of the discrete hidden variable of the ith word, and softmax is a normalized exponential function.
  5. 5. The training apparatus of machine translation model of claim 4 wherein said loss calculation module of machine translation merges said new attribute value l 2 i into said hidden layer vector x i via a gating mechanism, said fusion vector h i having the following formula: h i =g×x i +(1-g)×l 2 i ; g=σ(concat(x i ,l 2 i )@W); Wherein concat represents vector splicing operation @ represents matrix multiplication operation, W represents model parameter matrix, W epsilon R 2d×d , sigma represents sigmoid function, and g represents gating unit.
  6. 6. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement the steps of the method of training a machine translation model according to any of claims 1 to 3.

Description

Training method, device and storage medium of machine translation model Technical Field The invention relates to the technical field of machine translation, in particular to a training method, a training device and a storage medium of a machine translation model. Background Machine translation refers to the technique of automatically translating text in one natural language into text in another natural language using a computer program. With the development of computer technology and deep learning, a neural machine translation (Neural Machine Translation, NMT) model based on a deep neural network takes the dominant role of machine translation research. The most dominant model structure in NMT is the transducer, whose core is the attention mechanism. When the model generates a translated word, the attention mechanism will cause the word currently in the generated position to focus on all the words in the input sequence and calculate a corresponding attention score (normalized to between 0-1) for each word. The higher this score, the more important it is to represent the generation of this word for the current location. This mechanism of dynamically computing the contribution of each word makes the model very effective. Most of the existing translation techniques concern sentence-level translation, i.e., translating a sentence from a source language to a target language. In an actual translation scenario, we need to translate a chapter made up of multiple sentences. This requires consideration of context relations between sentences to ensure consistency of translation, such as tense consistency, sex consistency, etc., thereby generating a more natural and smooth translation result. To capture contextual information, existing work typically uses variations of various attention mechanisms so that currently generated unknown words can focus on words that cross sentences. However, these variants do not explicitly model the consistency of interest and thus have shortcomings in generating consistency. In order to solve the problem of consistency, the prior technical scheme adopts a regular attention mode. This method requires some manually annotated data indicating which word pairs need to be consistent. For example, the tag male he needs to be consistent with the name Bob. With such labeling data, the attention regularization method attempts to guide the word he to the attention word Bob during training, giving it a higher attention score. The specific approach is to use a KL divergence as a loss function to focus the model's attention distribution to Bob of interest, increasing its attention score and reducing the attention to other words. However, attention regularization simply tells the model which words are more important, but does not explicitly model the implications of such consistency, i.e., temporal consistency or gender consistency in which implications are present, and thus has a disadvantage in translation efficiency. Disclosure of Invention In order to overcome the defect that the attention regularization mode in the background technology only tells the model which words are more important, but does not have the meaning of modeling the consistency explicitly, namely the implied tense consistency or sex consistency, the translation effect still has the defect, and the invention provides a training method of a machine translation model. In order to achieve the above purpose, the invention adopts the following technical scheme: In a first aspect of the present invention, a training method of a machine translation model is provided, where a discrete hidden variable is modeled for a discrete attribute of a word at each position in a source end and a target end, and the discrete hidden variable assigns a sub-attribute value for each sub-attribute in the discrete attribute, and specifically includes the following steps: S1, converting the ith word and the nth word of the source end and the target end into a word vector e i and a word vector e t respectively, inputting the word vector e i and the word vector e t into a neural network translation model, and obtaining the attention score a it of the ith word to the nth word, wherein t < i; S2, carrying out weighted summation on the attention score a it and the word vector e i to obtain a hidden layer vector x i corresponding to the word in the ith position; S3, respectively calculating the distance between the hidden layer vector x i and each sub-attribute value of the discrete hidden variable of the ith word, and taking the sub-attribute value closest to the sub-attribute value as an original attribute value l i of the discrete hidden variable of the ith word; s4, carrying out weighted summation on the discrete hidden variables of the t-th word according to the attention score a it and the original attribute value l i to obtain an attribute vector attr i corresponding to the i-th word, respectively calculating the distance between the attribute vector attr i