US-12619835-B2 - Adapters for zero-shot multilingual neural machine translation

US12619835B2US 12619835 B2US12619835 B2US 12619835B2US-12619835-B2

Abstract

Multilingual neural machine translation systems having monolingual adapter layers and bilingual adapter layers for zero-shot translation include an encoder configured for encoding an input sentence in a source language into an encoder representation and a decoder configured for processing output of the encoder adapter layer to generate a decoder representation. The encoder includes an encoder adapter selector for selecting, from a plurality of encoder adapter layers, an encoder adapter layer for the source language to process the encoder representation. The decoder includes a decoder adapter selector for selecting, from a plurality of decoder adapter layers, a decoder adapter layer for a target language for generating a translated sentence of the input sentence in the target language from the decoder representation.

Inventors

Matthias Galle
Alexandre Berard
Laurent BESACIER
Jerin PHILIP

Assignees

NAVER CORPORATION

Dates

Publication Date: 20260505
Application Date: 20211108

Claims (19)

1 . A multilingual neural machine translation system comprising at least one processor for translating an input sequence from a source language to a target language, comprising: an encoder configured for encoding the input sequence in the source language into an encoder representation, wherein the encoder comprises an encoder adapter selector for selecting, from a plurality of encoder adapter layers, an encoder adapter layer corresponding to the source language for processing the encoder representation; and a decoder configured for processing output of the encoder adapter layer to generate a decoder representation, wherein the decoder comprises a decoder adapter selector for selecting, from a plurality of decoder adapter layers, a decoder adapter layer corresponding to the target language for generating a translation of the input sequence in the target language from the decoder representation; wherein (i) at least one of each encoder adapter layer and each decoder adapter layer corresponding to a language in a set of languages is trained with parallel data of at least one other language in the set of languages, and (ii) at least another one of an encoder adapter layer and a decoder adapter layer corresponding to a language in the set of languages is not trained with parallel data of at least one other language in the set of languages; and wherein the multilingual neural machine translation system is configured to perform zero-shot translation using the encoder representation and the decoder representation that are produced by the encoder adapter layer and the decoder adapter layer, respectively, for the language in the set of languages that is not trained with parallel data of at least one other language in the set of languages.
2 . The multilingual neural machine translation system of claim 1 , wherein the at least one of each encoder adapter layer and each decoder adapter layer is a monolingual adapter layer implemented by the processor and trained using parallel data for the set of languages.
3 . The multilingual neural machine translation system of claim 1 , wherein the encoder comprises a plurality of transformer encoder layers forming an encoder pipeline, wherein each transformer encoder layer comprises a respective encoder adapter layer for the source language.
4 . The multilingual neural machine translation system of claim 3 , wherein the decoder comprises a plurality of transformer decoder layers forming a decoder pipeline, wherein each transformer decoder layer comprises a respective decoder adapter layer for the target language.
5 . The multilingual neural machine translation system of claim 1 , wherein the encoder and the decoder comprise transformers, and wherein the encoder adapter layer and the decoder adapter layer are adapter layers comprising a feed-forward network with a bottleneck layer.
6 . The multilingual neural machine translation system of claim 5 , wherein each adapter layer of the plurality of encoder adapter layers and the plurality of decoder adapter layers comprises a residual connection between input of each adapter layer and output of each adapter layer.
7 . The multilingual neural machine translation system of claim 1 , further comprising: a source pre-processing unit, implemented by the processor, with an initial source embedding layer trained on the plurality of languages and one or more language-specific source embedding layers that are each trained on languages that are not one of the plurality of languages; one of the initial source embedding layer and the one or more language-specific source embedding layers being configured to pre-process the input sequence to generate representations for input to the encoder; and a target pre-processing unit, implemented by the processor, with an initial target embedding layer trained on the plurality of languages and one or more language-specific target embedding layers that are each trained on the languages that are not one of the plurality of languages; one of the initial target embedding layer and the one or more language-specific target embedding layers being configured to pre-process the input sequence to generate representations for input to the decoder.
8 . The multilingual neural machine translation system of claim 7 , wherein the encoder and the decoder are configured with language-specific parameters that correspond to the one or more language-specific embedding layers, independent of the parameters that correspond to the plurality of languages.
9 . The multilingual neural machine translation system of claim 8 , wherein the source pre-processing unit is configured with language codes that are associated with the one or more language-specific target embedding layers, independent of the initial embedding layers that are associated with the plurality of languages.
10 . A multilingual neural machine translation method for translating an input sequence from a source language to a target language, comprising: storing in a memory an encoder having a plurality of encoder adapter layers and a decoder having a plurality of decoder adapter layers; selecting, from the plurality of encoder adapter layers, an encoder adapter layer for the source language; processing, using the selected encoder adapter layer corresponding to the source language, the input sequence in the source language to generate an encoder representation; selecting, from the plurality of decoder adapter layers, a decoder adapter layer for the target language; and processing, using the selected decoder adapter layer corresponding to the target language, the encoder representation to generate a translation of the input sequence in the target language; wherein (i) the encoder adapter layers and the decoder adapter layers are trained using parallel data for a set of languages, (ii) at least one of each encoder adapter layer and each decoder adapter layer corresponding to a language in the set of languages is trained with parallel data of at least one other language in the set of languages, and (iii) at least another one of an encoder adapter layer and a decoder adapter layer corresponding to a language in the set of languages is not trained with parallel data of at least one other language in the set of languages; and wherein the multilingual neural machine translation system is configured to perform zero-shot translation using the encoder representation and the decoder representation that are produced by the encoder adapter layer and the decoder adapter layer, respectively, for the language in the set of languages that is not trained with parallel data of at least one other language in the set of languages.
11 . The multilingual neural machine translation method of claim 10 , wherein the at least one of each encoder adapter layer and each decoder adapter layer is a monolingual adapter layer trained using parallel data for the set of languages.
12 . The multilingual neural machine translation method of claim 10 , wherein the at least one of each encoder adapter layer and each decoder adapter layer is a bilingual adapter layer trained using parallel data for the set of languages.
13 . The multilingual neural machine translation method of claim 10 , wherein said storing the encoder further comprises storing a plurality of transformer encoder layers forming an encoder pipeline, wherein each transformer encoder layer comprises a respective encoder adapter layer for the source language.
14 . The multilingual neural machine translation method of claim 13 , wherein said storing the decoder further comprises storing a plurality of transformer decoder layers forming a decoder pipeline, wherein each transformer decoder layer comprises a respective decoder adapter layer for the target language.
15 . The multilingual neural machine translation method of claim 10 , wherein the encoder and the decoder stored in said memory are transformers.
16 . The multilingual neural machine translation method of claim 15 , wherein the encoder adapter layer and the decoder adapter layer stored in said memory are adapter layers comprising a feed-forward network with a bottleneck layer.
17 . The multilingual neural machine translation method of claim 16 , wherein each adapter layer of the plurality of encoder adapter layers and the plurality of decoder adapter layers stored in said memory has a residual connection between input of each adapter layer and output of each adapter layer.
18 . The multilingual neural machine translation method of claim 10 , further comprising: storing in the memory a source pre-processing unit with an initial source embedding layer trained on the plurality of languages and one or more language-specific source embedding layers that are each trained on languages that are not one of the plurality of languages; selecting one from the initial source embedding layer and the one or more language-specific source embedding layers to pre-process the input sequence in the source language to generate representations for input to the encoder; storing in the memory a target pre-processing unit with an initial target embedding layer trained on the plurality of languages and one or more language-specific target embedding layers that are each trained on the languages that are not one of the plurality of languages; and selecting one from the initial target embedding layer and the one or more language-specific target embedding layers to pre-process the input sequence to generate representations for input to the decoder.
19 . The multilingual neural machine translation method of claim 18 , further comprising: storing in the memory language-specific parameters for the encoder or the decoder that correspond to the one or more language-specific embedding layers, independent of parameters stored in the memory that correspond to the plurality of languages, wherein said selecting the source embedding layer for the source language selects the language-specific parameters for the encoder or decoder when the source language is not one of the plurality of languages; and storing in the memory language-specific parameters for the encoder or the decoder that correspond to the one or more language-specific embedding layers, independent of parameters stored in the memory that correspond to the plurality of languages, wherein said selecting the target embedding layer for the target language selects the language-specific parameters for the encoder or the decoder when the target language is not one of the plurality of languages.

Description

PRIORITY INFORMATION The present application claims priority, under 35 USC § 119(e), from U.S. Provisional Patent Application, Ser. No. 63/111,863, filed on Nov. 10, 2020. The entire content of U.S. Provisional Patent Application, Ser. No. 63/111,863, filed on Nov. 10, 2020, is hereby incorporated by reference. The present application claims priority, under 35 USC § 119(e), from U.S. Provisional Patent Application, Ser. No. 63/253,698, filed on Oct. 8, 2021. The entire content of U.S. Provisional Patent Application, Ser. No. 63/253,698, filed on Oct. 8, 2021, is hereby incorporated by reference. FIELD The present disclosure relates to multilingual neural machine translation. BACKGROUND Multilingual neural machine translation (MNMT) for generating translations across a large number of languages has progressed. A feature offered by methods of MNMT is that translation quality can be improved for languages that lack sufficient training data. In the extreme case of zero-shot translation, MNMT systems translate between a language-pair that has not been seen at training time. While performance in the low resource setting has increased over the past years, zero-shot performance of known MNMT systems remains low. For known MNMT systems, it has been observed that zero-shot performance increases with the number of considered languages for which the system is trained. However, with increasing number of considered languages, these systems increasingly suffer from insufficient modelling capacity and generate artifacts such as off-target translation. Some solutions that address this problem propose using language-aware normalization. Other solutions propose using back translation to improve quality of zero-shot translation. Adapting conventional artificial neural networks for a new task, such as translations involving a new language that has not been used in previous training, requires retraining the whole artificial neural network yielding another set of parameters that must be stored. To address the problem of increase in model size in multi-task settings, adapter modules have been proposed, where an MNMT system with lightweight adapter layers are transplanted between the layers of a pre-trained parent artificial neural network. In this approach, the parameters of the parent MNMT system remain fixed, so that the final multilingual model is only insignificantly larger than the parent MNMT model. In addition, in this approach, adapter layers are trained pair-wise for translation from a particular source language to a particular target language. Such an MNMT system has been shown to mitigate the problem of performance drop in higher resource languages. In other solutions, plug-and-play encoders and decoders have been proposed that require considerably larger model sizes. There continues therefore to be a need for an improved MNMT system that addresses these and other problems. SUMMARY In the present disclosure, a parameter-efficient artificial neural network for multilingual machine translation (MNMT) is set forth that allows to translate from any source language to any target languages seen in the training data regardless of whether the system has been trained for the specific language direction. In addition, in the present disclosure, there is provided a method for adding new source or target languages without having to retrain on the initial set of languages used to train the MNMT system. In a feature, a multilingual neural machine translation system for translating an input sequence from a source language to a target language, includes: an encoder configured for encoding the input sequence in the source language into an encoder representation, wherein the encoder includes an encoder adapter selector for selecting, from a plurality of encoder adapter layers, an encoder adapter layer corresponding to the source language for processing the encoder representation; and a decoder configured for processing output of the encoder adapter layer to generate a decoder representation, wherein the decoder includes a decoder adapter selector for selecting, from a plurality of decoder adapter layers, a decoder adapter layer corresponding to the target language for generating a translation of the input sequence in the target language from the decoder representation; wherein the adapter layers are monolingual adapter layers (i.e., single-language adapter layers) trained using parallel data for a set of languages. In further features, the multilingual neural machine translation system is configured where (i) each adapter layer corresponding to a language in the set of languages is trained with parallel data of at least one other language in the set of languages, and (ii) at least one adapter layer corresponding to a language in the set of languages is not trained with parallel data of at least one other language in the set of languages, to perform zero-shot translation using the encoder representation and the decoder representation that