Search

US-12626072-B2 - Multilingual grammatical error correction

US12626072B2US 12626072 B2US12626072 B2US 12626072B2US-12626072-B2

Abstract

A method of training a text-generating model for grammatical error correction (GEC) includes obtaining a multilingual set of text samples where each text sample includes a monolingual textual representation of a respective sentence. The operations also include, for each text sample of the multilingual set of text samples, generating a corrupted synthetic version of the respective text sample where the corrupted synthetic version of the respective text sample includes a grammatical change to the monolingual textual representation of the respective sentence associated with the respective text sample. The operations further include training the text-generating model using a training set of sample pairs. Each sample pair in the training set of sample pairs includes one of the respective text samples of the multilingual set of text samples and the corresponding corrupted synthetic version of the one of the respective text samples of the multilingual set of text samples.

Inventors

  • Sebastian Krause
  • Sascha Rothe
  • Jonathan Mallinson
  • Eric MALMI
  • Aliaksei Severyn

Assignees

  • GOOGLE LLC

Dates

Publication Date
20260512
Application Date
20210616

Claims (20)

  1. 1 . A computer-implemented method of training a text-generating model for grammatical error correction (GEC), the method when executed by data processing hardware causes the data processing hardware to perform operations comprising: obtaining a multilingual set of grammatically correct text samples, each grammatically correct text sample comprising a monolingual textual representation of a respective sentence; for each grammatically correct text sample of the multilingual set of grammatically correct text samples, generating a corrupted synthetic ungrammatical text version of the respective grammatically correct text sample by making a grammatical change to the monolingual textual representation of the respective sentence associated with the respective grammatically correct text sample, wherein the grammatical change is made to the monolingual textual representation of the respective sentence associated with the respective grammatically correct text sample without using a dictionary that is specific to a language of the monolingual textual representation of the respective segment such that the grammatical change made to the monolingual textual representation of the respective sentence is agnostic to the language of the monolingual textual representation of the respective sentence; and training the text-generating model using a training set of sample pairs, each sample pair in the training set of sample pairs comprising: one of the respective grammatically correct text samples of the multilingual set of grammatically correct text samples; and the corresponding corrupted synthetic ungrammatical text version of the one of the respective grammatically correct text samples of the multilingual set of grammatically correct text samples.
  2. 2 . The method of claim 1 , wherein the operations further comprise, after training the text-generating model, fine-tuning the trained text-generating model using supervised training data, the supervised training data comprising non-synthetic text pairs, each non-synthetic text pair comprising an ungrammatical text sample and a grammatical text version of the ungrammatical text sample.
  3. 3 . The method of claim 1 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises removing more than one character from the respective sentence associated with the respective grammatically correct text sample.
  4. 4 . The method of claim 1 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises replacing a first set of characters from the respective sentence associated with the respective grammatically correct text sample with a second set of characters different from the first set of characters.
  5. 5 . The method of claim 1 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises inserting one or more characters into the respective sentence associated with the respective grammatically correct text sample.
  6. 6 . The method of claim 1 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises changing a character-case for a character of a word of the respective sentence associated with the respective grammatically correct text sample.
  7. 7 . The method of claim 1 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises randomly applying a corruption operation to the respective sentence associated with the respective grammatically correct text sample, the corruption operation comprising at least one of: removing more than one characters from the respective sentence associated with the respective grammatically correct text sample; replacing a first set of characters from the respective sentence associated with the respective grammatically correct text sample with a second set of characters different from the first set of characters; inserting one or more characters into the respective sentence associated with the respective grammatically correct text sample; or changing a character-case of a word of the respective sentence associated with the respective grammatically correct text sample, wherein each corrupted synthetic ungrammatical text version is unique with respect to the other corrupted synthetic ungrammatical text versions of the text samples.
  8. 8 . The method of claim 1 , wherein the text-generating model comprises a transformer encoder-decoder architecture.
  9. 9 . The method of claim 1 , wherein the operations further comprise pre-training the text-generating model with a multilingual training corpus based on a masked-language objective.
  10. 10 . The method of claim 1 , wherein after training the text-generating model for GEC, the trained text-generating model is configured to: receive, as input, a first input text in a first language that includes grammatical errors and generate, as output, a first output text in the first language that corrects the grammatical errors; and receiving, as input, a second input text in a different second language that includes grammatical errors and generate, as output from the trained text-generating model, a second output text in the second language that corrects the grammatical errors.
  11. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining a multilingual set of grammatically correct text samples, each grammatically correct text sample comprising a monolingual textual representation of a respective sentence; for each grammatically correct text sample of the multilingual set of grammatically correct text samples, generating a corrupted synthetic ungrammatical text version of the respective grammatically correct text sample by making a grammatical change to the monolingual textual representation of the respective sentence associated with the respective grammatically correct text sample, wherein the grammatical change is made to the monolingual textual representation of the respective sentence associated with the respective grammatically correct text sample without using a dictionary that is specific to a language of the monolingual textual representation of the respective segment such that the grammatical change made to the monolingual textual representation of the respective sentence is agnostic to the language of the monolingual textual representation of the respective sentence; and training the text-generating model using a training set of sample pairs, each sample pair in the training set of sample pairs comprising: one of the respective grammatically correct text samples of the multilingual set of grammatically correct text samples; and the corresponding corrupted synthetic ungrammatical text version of the one of the respective grammatically correct text samples of the multilingual set of grammatically correct text samples.
  12. 12 . The system of claim 11 , wherein the operations further comprise, after training the text-generating model, fine-tuning the trained text-generating model using supervised training data, the supervised training data comprising non-synthetic text pairs, each non-synthetic text pair comprising an ungrammatical text sample and a grammatical text version of the ungrammatical text sample.
  13. 13 . The system of claim 11 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises removing more than one characters from the respective sentence associated with the respective grammatically correct text sample.
  14. 14 . The system of claim 11 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises replacing a first set of characters from the respective sentence associated with the respective grammatically correct text sample with a second set of characters different from the first set of characters.
  15. 15 . The system of claim 11 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises inserting one or more characters into the respective sentence associated with the respective grammatically correct text sample.
  16. 16 . The system of claim 11 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises changing a character-case for a character of a word of the respective sentence associated with the respective grammatically correct text sample.
  17. 17 . The system of claim 11 , wherein generating the corrupted synthetic ungrammatical text version of the respective grammatically correct text sample comprises randomly applying a corruption operation to the respective sentence associated with the respective grammatically correct text sample, the corruption operation comprising at least one of: removing more than one characters from the respective sentence associated with the respective grammatically correct text sample; replacing a first set of characters from the respective sentence associated with the respective grammatically correct text sample with a second set of characters different from the first set of characters; inserting one or more characters into the respective sentence associated with the respective grammatically correct text sample; or changing a character-case of a word of the respective sentence associated with the respective grammatically correct text sample, wherein each corrupted synthetic ungrammatical text version is unique with respect to the other corrupted synthetic ungrammatical text versions of the text samples.
  18. 18 . The system of claim 11 , wherein the text-generating model comprises a transformer encoder-decoder architecture.
  19. 19 . The system of claim 11 , wherein the operations further comprise pre-training the text-generating model with a multilingual training corpus based on a masked-language objective.
  20. 20 . The system of claim 11 , wherein after training the text-generating model for GEC, the trained text-generating model is configured to: receive, as input, a first input text in a first language that includes grammatical errors and generate, as output, a first output text in the first language that corrects the grammatical errors; and receiving, as input, a second input text in a different second language that includes grammatical errors and generate, as output from the trained text-generating model, a second output text in the second language that corrects the grammatical errors.

Description

TECHNICAL FIELD This disclosure relates to multilingual grammatical error correction. BACKGROUND As user-generated text continues to play a significant role in human-computer interaction and human-to-human interaction using a computing device, the ability of a Natural Language Generation (NLG) system to ensure that the user-generated text is grammatically accurate can be an important aspect of communication. For instance, grammatically accurate text enables readability and may prevent potential miscommunication or misunderstanding. That is, grammatical errors may change the meaning of a communication or lead to some degree of confusion as to the meaning of the text. Although conventional grammatical error correction techniques attempt to address grammar problems in text, such techniques often suffer from issues with training data (e.g., scarcity of training data, label accuracy of training data, and/or a lack of bias in error distributions for training data), causing grammatical error correction models to be limited in their capabilities. SUMMARY One aspect of the disclosure provides a computer-implemented method of training a text-generating model for grammatical error correction (GEC). The method when executed by data processing hardware causes the data processing hardware to perform operations. The operations include obtaining a multilingual set of text samples where each text sample includes a monolingual textual representation of a respective sentence. The operations also include, for each text sample of the multilingual set of text samples, generating a corrupted synthetic version of the respective text sample where the corrupted synthetic version of the respective text sample includes a grammatical change to the monolingual textual representation of the respective sentence associated with the respective text sample. The operations further include training the text-generating model using a training set of sample pairs. Each sample pair in the training set of sample pairs includes one of the respective text samples of the multilingual set of text samples and the corresponding corrupted synthetic version of the one of the respective text samples of the multilingual set of text samples. Another aspect of the disclosure provides a system of training a text-generating model for grammatical error correction (GEC). The system includes data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations. The operations include obtaining a multilingual set of text samples where each text sample includes a monolingual textual representation of a respective sentence. The operations also include, for each text sample of the multilingual set of text samples, generating a corrupted synthetic version of the respective text sample where the corrupted synthetic version of the respective text sample includes a grammatical change to the monolingual textual representation of the respective sentence associated with the respective text sample. The operations further include training the text-generating model using a training set of sample pairs. Each sample pair in the training set of sample pairs includes one of the respective text samples of the multilingual set of text samples and the corresponding corrupted synthetic version of the one of the respective text samples of the multilingual set of text samples. Implementations of the method or the system of the disclosure may include one or more of the following optional features. In some implementations, the operations further include, after training the text-generating model, fine-tuning the trained text-generating model using supervised training data where the supervised training data includes non-synthetic text pairs with each non-synthetic text pair including an ungrammatical text sample and a grammatical text version of the ungrammatical text sample. In some examples, generating the corrupted synthetic version of the respective text sample includes removing more than one characters from the respective sentence associated with the respective text sample. In some configurations, generating the corrupted synthetic version of the respective text sample includes replacing a first set of characters from the respective sentence associated with the respective text sample with a second set of characters different from the first set of characters. In some implementations, generating the corrupted synthetic version of the respective text sample includes inserting one or more characters into the respective sentence associated with the respective text sample. Optionally, generating the corrupted synthetic version of the respective text sample includes changing a character-case for a character of a word of the respective sentence associated with the respective text sample. The text-generating model may inclu