EP-4742086-A1 - SERVER DEVICE FOR TRAINING TRANSLATION MODEL, ELECTRONIC DEVICE USING TRAINED TRANSLATION MODEL, AND METHODS THEREFOR

EP4742086A1EP 4742086 A1EP4742086 A1EP 4742086A1EP-4742086-A1

Abstract

A server device for training a translation model, an electronic device using the trained translation model, and methods therefor are disclosed. The electronic device comprises: a memory for storing actual speech spoken in a first language, first text corresponding to the actual speech, second text, which is the target language corresponding to the first text, a communication unit, and a translation model; and a processor, wherein the processor generates synthesized speech corresponding to the first text, and can train the translation model by using the actual speech, the first text, the synthesized speech and the second text.

Inventors

JUNG, JUNGHO

Assignees

Samsung Electronics Co., Ltd.

Dates

Publication Date: 20260513
Application Date: 20240820

Claims (15)

A server device comprising: memory storing an actual speech uttered in a first language, a first text corresponding to the actual speech, a second text which is a second language corresponding to the first text, and a translation model; and a processor, wherein the processor is configured to: generate a synthesized speech corresponding to the first text, and train the translation model by using the actual speech, the first text, the synthesized speech, and the second text.
The server device of claim 1, wherein the translation model comprises a vector quantization (VQ) codebook including feature information of speeches, and the processor is configured to: update the VQ codebook by generating feature information that was adjusted such that a difference in features of the actual speech and the synthesized speech becomes smaller than a predetermined threshold value based on feature information corresponding to the actual speech and feature information corresponding to the synthesized speech.
The server device of claim 2, wherein the memory stores a discriminator module for distinguishing a difference between the actual speech and the synthesized speech, and the processor is configured to: input a first output value of the translation model that received input of the actual speech, and a second output value of the translation model that received input of the synthesized speech into the discriminator module, and reupdate the VQ codebook included in the translation model by comparing an output value of the discriminator module and the threshold value until the output value of the discriminator module becomes smaller than the threshold value.
The server device of claim 3, wherein the memory stores a regularizer module for learning the feature information by dividing the information for each text of the first language, and the processor is configured to: input the first output value and the second output value obtained for each text of the first language into the regularizer module, and update the VQ codebook such that each text has different feature information based on an output value of the regularizer module.
The server device of claim 4, wherein the memory stores: a first subsampler configured to sample an actual speech signal; a first encoder configured to encode the actual speech sampled in the first subsampler; a second subsampler configured to sample a synthesized speech signal that converted the first text by using a TTS module; and a second encoder configured to encode the synthesized speech sampled in the second subsampler, and the processor is configured to: repeatedly train the translation model until similarity between an output value of the translation model for the actual speech encoded in the first encoder and an output value of the translation model for the synthesized speech encoded in the second encoder becomes greater than or equal to the predetermined threshold value.
The server device of claim 5, wherein the translation model further comprises: a shared encoder configured to encode a speech based on the feature information included in the VQ codebook; and a decoder configured to extract a text in the second language by decoding a feature vector output from the shared encoder based on dictionary data, and the processor is configured to: train the translation model by using the actual speech and the synthesized speech in a state wherein update of the VQ codebook has been completed.
The server device of claim 6, further comprising: a communicator, wherein the processor is configured to: based on receiving index information from at least one electronic apparatus wherein the translation model including the VQ codebook was installed through the communicator, extract feature information corresponding to the index information from the VQ codebook stored in the memory, and generate a text in the second language corresponding to the extracted feature information by using the shared encoder and the decoder, and transmit the generated text to the at least one electronic apparatus through the communicator.
An electronic apparatus comprising: a microphone; a communicator; a display; memory storing a VQ codebook trained based on actual speeches and synthesized speeches; and a processor, wherein the processor is configured to: based on receiving input of a speech signal in a first language through the microphone, extract index information of feature information corresponding to the speech signal among feature information recorded in the VQ codebook, and transmit the index information to a server device through the communicator, and based on information about a text in a second language corresponding to the index information being transmitted from the server device, control the display to display the text in the second language.
A method for training a server device for training a translation model, the method comprising: generating a synthesized speech based on a first text corresponding to an actual speech uttered in a first language; and training a translation model by using the actual speech, the first text, the synthesized speech, and a second text which is a target language corresponding to the first text.
The training method of claim 9, wherein the translation model comprises a vector quantization (VQ) codebook including feature information of speeches, and the training comprises: updating the VQ codebook by generating feature information that was adjusted such that a difference in features of the actual speech and the synthesized speech becomes smaller than a predetermined threshold value based on feature information corresponding to the actual speech and feature information corresponding to the synthesized speech.
The training method of claim 10, wherein the training comprises: obtaining each of a first output value of the translation model that received input of the actual speech, and a second output value of the translation model that received input of the synthesized speech; inputting the first output value and the second output value into a discriminator module for distinguishing a difference between the actual speech and the synthesized speech; and reupdating the VQ codebook included in the translation model by comparing an output value of the discriminator module and the threshold value until an output value of the discriminator module becomes smaller than the threshold value.
The training method of claim 11, wherein the training further comprises: inputting the first output value and the second output value obtained for each text of the first language into a regularizer module, and updating the VQ codebook such that each text has different feature information based on an output value of the regularizer module.
The training method of claim 12, wherein the training comprises: sampling an actual speech signal; encoding the sampled actual speech by using a first encoder; sampling the synthesized speech that converted the first text by using a TTS module; encoding the sampled synthesized speech by using a second encoder; and repeatedly training the translation model until similarity between an output value of the translation model for the actual speech encoded in the first encoder and an output value of the translation model for the synthesized speech encoded in the second encoder becomes greater than or equal to the predetermined threshold value.
The training method of claim 13, wherein the translation model further comprises: a shared encoder configured to encode a speech based on the feature information included in the VQ codebook; and a decoder configured to extract a text in the second language by decoding a feature vector output from the shared encoder based on dictionary data, and the training comprises: training the translation model by using the actual speech and the synthesized speech in a state wherein update of the VQ codebook has been completed.
A non-transitory computer-readable recording medium storing a program for executing a method for training a server device for training a translation model, wherein the training method comprises: generating a synthesized speech based on a first text corresponding to an actual speech uttered in a first language; and training a translation model by using the actual speech, the first text, the synthesized speech, and a second text which is a target language corresponding to the first text.

Description

[Technical Field] The disclosure relates to a server device that trains a translation model that performs translation between different types of languages, an electronic apparatus using the translation model, and methods therefor. [Background Art] For communication of users having various languages, use of translation models that perform translation between different languages is increasing. In particular, there is a growing interest in multi language models that perform translation for two or more languages. A conventional language model used a step by step method (a cascaded method). By a cascaded method, two translation models are involved, and thus a problem of error propagation and a problem of latency were generated. Accordingly, a need for an end-to-end model arose for resolving the problems of errors and accuracy of translation. However, even in the case of using an end-to-end model, there is a disadvantage that pairs of data of actually uttered speeches and text data of a source language are scarce. Accordingly, training for a translation model is difficult, and thus a solution to this problem is needed. [Disclosure of Invention] [Solution to Problem] A server device according to at least one embodiment of the disclosure includes a processor and memory. The processor may generate a synthesized speech corresponding to a text of an actual speech uttered in a first language, and train the translation model by using the actual speech uttered in the first language, a first text corresponding to the actual speech, a synthesized speech corresponding to the first text corresponding to the actual speech uttered in the first language, and a second text which is a target language corresponding to the first text. An electronic apparatus according to another embodiment of the disclosure includes a microphone, a communicator, a display, memory storing a VQ codebook trained based on actual speeches and synthesized speeches, and a processor. The processor may, based on receiving input of a speech signal in a first language through the microphone, extract index information of feature information corresponding to the speech signal among feature information recorded in the VQ codebook, and transmit the index information to a server device through the communicator, and based on information about a text in a second language corresponding to the index information being transmitted from the server device, control the display to display the text in the second language. A method for training a translation model of a server device according to still another embodiment of the disclosure may include the steps of generating a synthesized speech based on a first text corresponding to an actual speech uttered in a first language, and training a translation model by using the actual speech, the first text, the synthesized speech, and a second text which is a target language corresponding to the first text. [Brief Description of Drawings] FIG. 1 is a diagram for illustrating an operation of translating by using a translation model in at least one electronic apparatus according to the disclosure;FIG. 2 is a block diagram for illustrating a configuration of a server device according to at least one embodiment of the disclosure;FIG. 3 is a diagram for illustrating an example of a method for training a translation model in a server device;FIG. 4 is a block diagram illustrating a configuration of an electronic apparatus according to at least one embodiment of the disclosure;FIG. 5 is a flow chart for illustrating a method for training a translation model of a server device according to at least one embodiment of the disclosure;FIG. 6 is a flow chart for illustrating a process of translating in a server device according to at least one embodiment of the disclosure; andFIG. 7 is a flow chart for illustrating a process of translating in an electronic apparatus according to at least one embodiment of the disclosure. [Mode for Invention] Various modifications may be made to the embodiments of the disclosure, and there may be various types of embodiments. Accordingly, specific embodiments will be illustrated in drawings, and the embodiments will be described in detail in the detailed description of the disclosure. However, it should be noted that the various embodiments are not for limiting the scope of the disclosure to a specific embodiment, but they should be interpreted to include various modifications, equivalents, and/or alternatives of the embodiments of the disclosure. In addition, with respect to the detailed description of the drawings, similar components may be designated by similar reference numerals. Also, in describing the disclosure, in case it is determined that detailed explanation of related known functions or features may unnecessarily confuse the gist of the disclosure, the detailed explanation will be omitted. In addition, the embodiments below may be modified in various different forms, and the scope of the technical id