US-12626050-B2 - Semi-autoregressive text editing

US12626050B2US 12626050 B2US12626050 B2US 12626050B2US-12626050-B2

Abstract

Provided are improved machine learning-based text editing models. Specifically, example implementations include a flexible semi-auto-regressive text-editing approach for generation, designed to derive the maximum benefit from non-auto-regressive text-editing and autoregressive decoding. In contrast to conventional sequence-to-sequence (seq2seq) models, the proposed approach is fast at inference time, while being capable of modeling flexible input-output transformations.

Inventors

Jonathan Stephen Mallinson
Aliaksei Severyn
Eric Emil Malmi
Jakub Dominik Adamek

Assignees

GOOGLE LLC

Dates

Publication Date: 20260512
Application Date: 20220523

Claims (20)

1 . A computer system that performs text editing, the computer system comprising: one or more processors; a machine-learned text editing model configured to receive and process a source text string to generate an output text string, the output text string comprising an edited version of the source text string, the machine-learned text editing model comprising: an encoder model configured to process the source text string in a non-autoregressive manner to generate an intermediate text representation, wherein the source text string comprises a plurality of source tokens, and wherein the intermediate text representation indicates: (1) a subset of the plurality of source tokens to be maintained for the output text string and (2) an ordering of the subset of the plurality of source tokens to be maintained for the output text string; and a decoder model configured to process the intermediate text representation in an autoregressive manner to select one or more additional tokens to insert into the subset of the plurality of source tokens to generate the output text string, wherein the decoder model is configured to first predict a position token indicating a position at which one or more of the additional tokens should be inserted into the intermediate text representation and then to second predict the one or more of the additional tokens to be inserted into the intermediate text representation at the position; wherein the encoder model is configured to process the source text string in the non-autoregressive manner to generate the intermediate text representation without indicating the position at which one or more of the additional tokens should be inserted into the intermediate text representation; and one or more non-transitory computer readable media that store instructions that, when executed by the one or more processors, cause the computer system to perform operations, the operations comprising: obtaining the source text string; processing the source text string with the machine-learned text editing model to generate the output text string; and providing the output text string as an output.
2 . The computer system of claim 1 , wherein the encoder model comprises: a text embedding model configured to process the source text string to generate a hidden representation; a tagging model configured to assign a respective tag to each of the source tokens in the source text string, the respective tag for each source token indicating whether or not such source token is included in the subset of the plurality of source tokens to be maintained for the output text string; and a pointer model configured to generate the ordering of the subset of the plurality of source tokens to be maintained for the output text string based at least in part on the hidden representation and the respective tag assigned to each of the source tokens.
3 . The computer system of claim 1 , wherein each of the encoder model and the decoder model comprises a transformer neural network.
4 . The computer system of claim 3 , wherein at least the transformer neural network in the decoder model comprises a T5 pre-trained transformer neural network that has been pre-trained to insert missing spans.
5 . The computer system of claim 1 , wherein the decoder model has been pre-trained with a denoising objective.
6 . The computer system of claim 1 , wherein the encoder model comprises one or more Sinkhorn layers that normalize over both rows and columns of an intra-pointer attention.
7 . The computer system of claim 1 , wherein the decoder model being configured to first predict the position token comprises repurposing one or more special span_i tokens to indicate the position token indicating the position at which one or more of the additional tokens should be inserted into the intermediate text representation.
8 . A computer-implemented method to train a text editing model, the method comprising: obtaining, by a computing system comprising one or more computing devices, a training example comprising a source text string and target text string; processing, by the computing system, the source text string with the text editing model to generate an output text string, wherein processing the source text string with the text editing model to generate the output text string comprises: processing, by the computing system, the source text string in a non-autoregressive manner with an encoder model of the text editing model to generate an intermediate text representation, wherein the source text string comprises a plurality of source tokens, and wherein the intermediate text representation indicates: (1) a subset of the plurality of source tokens to be maintained for the output text string and (2) an ordering of the subset of the plurality of source tokens to be maintained for the output text string; and processing, by the computing system, the intermediate text representation in an autoregressive manner with a decoder model of the text editing model to select one or more additional tokens to insert into the subset of the plurality of source tokens to generate the output text string; wherein the encoder model is configured to process the source text string in the non-autoregressive manner to generate the intermediate text representation without indicating a position at which one or more of the additional tokens should be inserted into the intermediate text representation; evaluating, by the computing system, a combined loss function that respectively compares (i) each of: a set of ground truth labels for the source tokens to be maintained, a ground truth ordering, and the target text string respectively to (ii) each of: the subset of the plurality of source tokens, the ordering of the subset of the plurality of source tokens, and the output text string; and modifying, by the computing system, one or more parameters of the text editing model based on the combined loss function.
9 . The computer-implemented method of claim 8 , wherein the encoder model comprises: a text embedding model configured to process the source text string to generate a hidden representation; a tagging model configured to assign a respective tag to each of the source tokens in the source text string, the respective tag for each source token indicating whether or not such source token is included in the subset of the plurality of source tokens to be maintained for the output text string; and a pointer model configured to generate the ordering of the subset of the plurality of source tokens to be maintained for the output text string based at least in part on the hidden representation and the respective tag assigned to each of the source tokens.
10 . The computer-implemented method of claim 8 , wherein the combined loss function comprises a tagging loss term that evaluates a probability of the encoder model outputting the set of ground truth labels for the source tokens to be maintained for the output text string.
11 . The computer-implemented method of claim 8 , wherein the combined loss function comprises a pointing loss term that compares the ordering with the ground truth ordering.
12 . The computer-implemented method of claim 8 , wherein the combined loss function comprises an insertion loss term that evaluates a probability of the decoder model outputting a set of ground truth tokens to be included in the output text string.
13 . The computer-implemented method of claim 8 , wherein each of the encoder model and the decoder model comprises a transformer neural network.
14 . The computer-implemented method of claim 8 , wherein the encoder model comprises one or more Sinkhorn layers that normalize over both rows and columns of an intra-pointer attention.
15 . One or more non-transitory computer readable media that store: a machine-learned text editing model configured to receive and process a source text string to generate an output text string, the output text string comprising an edited version of the source text string, the machine-learned text editing model comprising: an encoder model configured to process the source text string in a non-autoregressive manner to generate an intermediate text representation, wherein the source text string comprises a plurality of source tokens, and wherein the intermediate text representation indicates: (1) a subset of the plurality of source tokens to be maintained for the output text string and (2) an ordering of the subset of the plurality of source tokens to be maintained for the output text string; and a decoder model configured to process the intermediate text representation in an autoregressive manner to first predict a position token indicating a position at which one or more additional tokens should be inserted into the intermediate text representation and then to second select the one or more additional tokens to insert into the subset of the plurality of source tokens at the position to generate the output text string; wherein the encoder model is configured to process the source text string in the non-autoregressive manner to generate the intermediate text representation without indicating the position at which one or more of the additional tokens should be inserted into the intermediate text representation.
16 . The one or more non-transitory computer readable media of claim 15 , wherein the encoder model comprises: a text embedding model configured to process the source text string to generate a hidden representation; a tagging model configured to assign a respective tag to each of the source tokens in the source text string, the respective tag for each source token indicating whether or not such source token is included in the subset of the plurality of source tokens to be maintained for the output text string; and a pointer model configured to generate the ordering of the subset of the plurality of source tokens to be maintained for the output text string based at least in part on the hidden representation and the respective tag assigned to each of the source tokens.
17 . The one or more non-transitory computer readable media of claim 15 , wherein each of the encoder model and the decoder model comprises a transformer neural network.
18 . The one or more non-transitory computer readable media of claim 17 , wherein at least the transformer neural network in the decoder model comprises a T5 pre-trained transformer neural network that has been pre-trained to insert missing spans.
19 . The one or more non-transitory computer readable media of claim 18 , wherein the decoder model comprises a single T5 decoder transformer layer.
20 . The one or more non-transitory computer-readable media of claim 15 , wherein the encoder model comprises one or more Sinkhorn layers that normalize over both rows and columns of an intra-pointer attention.

Description

FIELD The present disclosure relates generally to improved machine learning-based text editing models. More particularly, example aspects of the present disclosure relate to sequence to sequence models that combine a text editing task with a text generation task. BACKGROUND A number of machine learning-based solutions to the task of text-to-text transduction have been proposed in the art. As one example, T5 (Raffel, Colin et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” ArXiv abs/1910.10683 (2020)) is a sequence to sequence (“seq2seq”) model pre-trained on span in-filling. Other pre-trained seq2seq models, such as BART (Lewis, Mike et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.” ACL (2020)) and MASS (Song, Kaitao et al. “MASS: Masked Sequence to Sequence Pre-training for Language Generation.” ICML (2019)) represent the current standard for text-to-text transduction. However, while seq2seq frameworks offer a generic tool for modeling almost any kind of text-to-text transduction, there are still many real-world tasks where generating target texts completely from scratch—as is done with seq2seq approaches—is often wasteful and leads to unnecessarily high latency at inference time. This is especially true for monolingual settings where input and output texts have relatively high degrees of overlap. In such cases an alternative approach is to cast conditional text generation as a text-editing task, where the model learns to reconstruct target texts by applying a set of edit operations to the inputs. Typically, the set of edit operations is fixed and pre-defined ahead of time. This leads to higher sample-efficiency as the limited set of allowed operations significantly reduces the search space. However, choosing from only a limited set of edit operations limits the flexibility of the model to reconstruct arbitrary output texts from their inputs, often reducing the quality of the resulting output texts. SUMMARY Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. One example aspect of the present disclosure is directed to a computer system that performs text editing. The computer system includes one or more processors. The computer system includes a machine-learned text editing model configured to receive and process a source text string to generate an output text string. The output text string comprises an edited version of the source text string. The machine-learned text editing model comprises: an encoder model configured to process the source text string to generate an intermediate text representation. The source text string comprises a plurality of source tokens. The intermediate text representation indicates: (1) a subset of the plurality of source tokens to be maintained for the output text string and (2) an ordering of the subset of the plurality of source tokens to be maintained for the output text string. The machine-learned text editing model comprises: a decoder model configured to process the intermediate text representation to select one or more additional tokens to insert into the subset of the plurality of source tokens to generate the output text string. The computer system includes one or more non-transitory computer readable media that store instructions that, when executed by the one or more processors, cause the computer system to perform operations. The operations comprise: obtaining the source text string; processing the source text string with the machine-learned text editing model to generate the output text string; and providing the output text string as an output. Another example aspect of the present disclosure is directed to a computer-implemented method to train a text editing model. The method includes obtaining, by a computing system comprising one or more computing devices, a training example comprising an source text string and target text string. The method includes processing, by the computing system, the source text string with the text editing model to generate an output text string. Processing the source text string with the text editing model to generate the output text string comprises: processing, by the computing system, the source text string with an encoder model of the text editing model to generate an intermediate text representation, wherein the source text string comprises a plurality of source tokens, and wherein the intermediate text representation indicates: (1) a subset of the plurality of source tokens to be maintained for the output text string and (2) an ordering of the subset of the plurality of source tokens to be maintained for the output text string. Processing the source text string with the text editing model to generate the output text string comprises processing, by the compu