US-20260127502-A1 - CONTRASTIVE SEQUENCE-TO-SEQUENCE DATA SELECTOR

US20260127502A1US 20260127502 A1US20260127502 A1US 20260127502A1US-20260127502-A1

Abstract

A method includes generating a base model by training with a first dataset of data pairs and generating an adapted model by training the base model on a second dataset of data pairs. The method also includes determining a contrastive score for each data pair of a third dataset of data pairs using the base model and the adapted model. The contrastive score is indicative of a probability of quality of the respective data pair. The method also includes training a target model using the data pairs of the third dataset and the contrastive scores.

Inventors

Wei Wang
Bowen Liang
Macduff Hughes
Taro Watanabe
Tetsuji Nakagawa
Alexander Rudnick

Assignees

GOOGLE LLC

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A computer-implemented method executed by data processing hardware that causes the data processing hardware to perform operations comprising: obtaining a first model trained on a first dataset; obtaining a second model trained on a second dataset; obtaining a third dataset comprising a plurality of training data pairs; for each respective training data pair, determining, using the first model and the second model, a contrastive score for the respective training data pair, the contrastive score indicating a quality of the respective training data pair; selecting a subset of training data pairs from the third dataset based on the contrastive score determined for each respective training data pair; and training a third model on the selected subset of data pairs from the third dataset.
2 . The method of claim 1 , wherein the contrastive score comprises a Kullback-Leibler (KL) divergence between a first probability distribution associated with the first model and a second probability distribution associated with the second model.
3 . The method of claim 1 , wherein the plurality of training data pairs of the third dataset comprises sentence data pairs each comprising a first sentence in a first language and a second sentence in a second language.
4 . The method of claim 1 , wherein the first model, the second model, and the third model each comprise a respective sequence-to-sequence model.
5 . The method of claim 1 , wherein training the third model comprises training parameters of the third model while parameters of the first model and the second model are frozen.
6 . The method of claim 1 , wherein the first model, the second model, and the third model share a same model architecture and have a same model size.
7 . The method of claim 1 , wherein the second model is an adapted model trained to shift probability mass from noisy data to clean data relative to the first model.
8 . The method of claim 1 , wherein the contrastive score for each respective training data pair is determined using a unified metric for data quality representing a probability of cleanness of the respective training data pair.
9 . The method of claim 1 , wherein the first model is trained on the first dataset until convergence prior to obtaining the second model.
10 . The method of claim 1 , wherein the third model is trained on lower-quality data pairs from the third dataset at a beginning of training and on higher-quality data pairs from the third dataset towards an end of training.
11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining a first model trained on a first dataset; obtaining a second model trained on a second dataset; obtaining a third dataset comprising a plurality of training data pairs; for each respective training data pair, determining, using the first model and the second model, a contrastive score for the respective training data pair, the contrastive score indicating a quality of the respective training data pair; selecting a subset of training data pairs from the third dataset based on the contrastive score determined for each respective training data pair; and training a third model on the selected subset of data pairs from the third dataset.
12 . The system of claim 11 , wherein the contrastive score comprises a Kullback-Leibler (KL) divergence between a first probability distribution associated with the first model and a second probability distribution associated with the second model.
13 . The system of claim 11 , wherein the plurality of training data pairs of the third dataset comprises sentence data pairs each comprising a first sentence in a first language and a second sentence in a second language.
14 . The system of claim 11 , wherein the first model, the second model, and the third model each comprise a respective sequence-to-sequence model.
15 . The system of claim 11 , wherein training the third model comprises training parameters of the third model while parameters of the first model and the second model are frozen.
16 . The system of claim 11 , wherein the first model, the second model, and the third model share a same model architecture and have a same model size.
17 . The system of claim 11 , wherein the second model is an adapted model trained to shift probability mass from noisy data to clean data relative to the first model.
18 . The system of claim 11 , wherein the contrastive score for each respective training data pair is determined using a unified metric for data quality representing a probability of cleanness of the respective training data pair.
19 . The system of claim 11 , wherein the first model is trained on the first dataset until convergence prior to obtaining the second model.
20 . The system of claim 11 , wherein the third model is trained on lower-quality data pairs from the third dataset at a beginning of training and on higher-quality data pairs from the third dataset towards an end of training.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This U.S. patent application is a continuation of, and claims priority under 35 U.S.C. § 120 from, U.S. patent application Ser. No. 18/351,397, filed on Jul. 12, 2023, which is a continuation of Ser. No. 16/376,254, now U.S. Pat. No. 11,734,600, filed on Apr. 5, 2019, which claims priority under 35 U.S.C. § 119 (e) to U.S. Provisional Application 62/668,650, filed on May 8, 2018. The disclosures of these prior applications are considered part of the disclosure of this application and are hereby incorporated by reference in their entireties. TECHNICAL FIELD This disclosure relates to contrastive sequence-to-sequence data selectors for training neural translation models on noisy data. BACKGROUND A neural translation model learns to distribute probability mass over translations. A model trainer typically trains the model with parallel data such that more plausible translations get higher probabilities than less plausible ones. When trained on very noisy parallel data, the learned distribution is inaccurate, which then produces less precise translations. However, large-scaled high-quality data that is clean and matching the test domain is rare. Automatic data miners typically produce parallel data and a sentence aligner processes the parallel data. The processing of the parallel data may introduce severe noise to the parallel data. Typically, trainers address this issue as a classification problem, by training a convolutional network to classify good data or bad data, with a small amount of clean data (or in-domain data). The trainer then uses the selected data to train a system having a different architecture from the selector. Thus, what the selector identifies as good data may not necessarily be good data for the final model. SUMMARY One aspect of the disclosure provides a method for training target models. The method includes generating, by data processing hardware, a base model by training with a first dataset of data pairs, and generating, by the data processing hardware, an adapted model by training the base model on a second dataset of data pairs. The method also includes determining, by the data processing hardware, a contrastive score for each data pair of a third dataset of data pairs using the base model and the adapted model. The contrastive score is indicative of a probability of quality of the respective data pair. The method also includes training, by the data processing hardware, a target model using data pairs of the third dataset and the contrastive scores. Implementations of the disclosure may include one or more of the following optional features. In some implementations, training the target model further includes using data pairs of the third dataset that satisfies a threshold contrastive score. In some examples, the method further includes: determining, by the data processing hardware, that the target model is a same size as the base model; replacing, by the data processing hardware, the base model with the adapted model; replacing, by the data processing hardware, the adapted model with the target model; determining, by the data processing hardware, the contrastive score for each data pair of a fourth dataset of data pairs using the base model and the replaced adapted model; and training, by the data processing hardware, a subsequent target model using the data pairs of the fourth dataset and the contrastive scores. In other examples, the target model is larger than the base model. The first dataset may include random data. Here, when the first dataset includes random data, the second dataset may include data that is cleaner than the random data of the first dataset. Additionally or alternatively, the contrastive score may include a Kullback-Leibler (KL) divergence and/or each dataset may include sentence language pairs. In some implementations, the method further includes sorting, by the data processing hardware, the data pairs of the third dataset based on the respective contrastive scores. In these examples, training the target model may further include generating a plurality of data batches and using each data batch to train the target model. Here, each data batch includes at least one data pair, and wherein a probability that a select data pair is included in a select data batch is based on the respective contrastive score of the select data pair, and wherein the probability increases as the respective contrastive score increases. Furthermore, in these examples, generating the plurality of data batches may include: determining a selection ratio for each data batch; determining a batch size for each data batch based on the selection ratio and a number of data pairs in the third dataset; selecting a number of data pairs from the third dataset that corresponds with the determined batch size; sorting the selected data pairs based on the respective contrastive scores; and removing, from the data batch, a removal ratio of the selected data pairs wit