CN-122021665-A - Speech translation method, server, storage medium, and program product

CN122021665ACN 122021665 ACN122021665 ACN 122021665ACN-122021665-A

Abstract

The application provides a speech translation method, a server, a storage medium and a program product. The application relates to the field of artificial intelligence, which is characterized by acquiring original voice of an original language to be translated and a target language to be translated, searching translation knowledge matched with the original voice in a translation knowledge base, inputting the translation knowledge matched with the original voice into a voice translation model, translating the original voice in the translation knowledge appearing in the original voice into a corresponding target language text through the voice translation model, and generating a translation text of the target language. External knowledge intervention is performed on the voice translation model based on a cross-modal translation knowledge base, so that translation errors of customized words are effectively avoided on the premise that parameters of the voice translation model do not need to be retrained, translation accuracy of the voice translation model to customized words such as proper nouns, professional terms and new words is improved, and accuracy of voice translation results is improved.

Inventors

TANG JIALONG
ZHANG PEI
GAO RUIZE
YANG BAOSONG

Assignees

阿里巴巴（中国）有限公司

Dates

Publication Date: 20260512
Application Date: 20241112

Claims (16)

1. A method of speech translation, comprising: acquiring original voice of a primitive language to be translated and a target language to be translated; Searching translation knowledge matched with the original voice in a translation knowledge base, wherein the translation knowledge base comprises translation knowledge of at least one customized vocabulary, and the translation knowledge of the customized vocabulary comprises primitive language voice and target language text of the customized vocabulary; Inputting the translation knowledge of the original voice and the original voice matching into a voice translation model, translating primitive language voices appearing in the original voice into corresponding target language texts through the voice translation model, and generating the translation texts of the target language.
2. The method of claim 1, wherein the translation knowledge further comprises a phonetic vector representation of a custom vocabulary; The searching translation knowledge base for translation knowledge matching the original speech comprises: converting the original speech to a speech vector representation; And determining the translation knowledge matched with the original voice according to the similarity between the voice vector representation of the original voice and the voice vector representation contained in the translation knowledge base.
3. The method of claim 2, wherein said converting said original speech into a speech vector representation comprises: And the original voice is input into a pre-trained voice representation model for representation, so that voice vector representation of the original voice is obtained.
4. The method of claim 1, wherein the inputting the original speech and the translation knowledge of the matching of the original speech into a speech translation model translates the primitive speech sounds present in the original speech into corresponding target language text and generates translated text of the target language, comprising: Inserting the original voice and the translation knowledge matched with the original voice into a prompt word to generate prompt information, wherein the prompt information is used for prompting the voice translation model to generate a translation text of the target language based on the translation knowledge matched with the original voice, and requesting the voice translation model to translate the primitive voice appearing in the original voice into a corresponding target language text; and inputting the prompt information into a voice translation model, and generating a translation text of the target language based on the prompt information through the voice translation model.
5. The method of any of claims 1-4, wherein the training process of the speech translation model comprises: Acquiring cross-modal voice translation data, wherein the cross-modal voice translation data comprises a voice sample, an original text of a primitive language corresponding to the voice sample and a target text of a target language; Constructing training data according to the cross-modal voice translation data, wherein the training data comprises the voice sample, target text of the voice sample and translation knowledge of customized vocabulary contained in the voice sample; and performing fine tuning training on the pre-trained voice translation model by using the training data to obtain a trained voice translation model.
6. The method of claim 5, wherein constructing training data from the cross-modal speech translation data comprises: Extracting custom vocabulary text pairs from the original text and the target text corresponding to the voice sample by using a large model, wherein the custom vocabulary text pairs comprise original language text and target language text of custom vocabulary; converting the original language text of the customized vocabulary into voice to obtain the original language voice of the customized vocabulary; And constructing first training data according to the voice sample, the target text of the target language corresponding to the voice sample, and the original language voice and the target language text of the customized vocabulary contained in the voice sample, wherein the first training data comprises the voice sample, the target text of the voice sample, and the original language voice and the target language text of the customized vocabulary contained in the voice sample.
7. The method as recited in claim 6, further comprising: performing voice recognition on primitive speech sounds of the customized vocabulary to obtain corresponding recognition texts; and if the identification text is inconsistent with the original language text of the customized vocabulary, deleting the first training data of the original language voice containing the customized vocabulary.
8. The method of claim 6, wherein constructing training data from the cross-modal speech translation data further comprises: And constructing second training data according to the voice sample, the target text of the target language corresponding to the voice sample, the original language text and the target language text of the customized vocabulary contained in the voice sample, wherein the second training data comprises the voice sample, the target text of the voice sample, the original language text and the target language text of the customized vocabulary contained in the voice sample.
9. The method of claim 5, wherein using the training data to fine tune training a pre-trained speech translation model to obtain the speech translation model comprises: inputting the training data into a pre-training voice translation model, and generating a prediction result of the target text in the training data through the pre-training voice translation model; Calculating language modeling loss according to the prediction result of the target text; And according to the language modeling loss, adjusting parameters of the pre-training voice translation model to obtain a trained voice translation model.
10. The method of any one of claims 1-4, further comprising: Obtaining a customized vocabulary to be added to the translation knowledge base, and a primitive language text and a target language text of the customized vocabulary; converting the original language text of the customized vocabulary into voice to obtain the original language voice of the customized vocabulary; Converting primitive speech sounds of the customized vocabulary into speech vector representations to obtain the speech vector representations of the customized vocabulary; generating translation knowledge corresponding to the customized vocabulary according to the original language voice, the target language text and the voice vector representation of the customized vocabulary; And constructing a translation knowledge base containing translation knowledge corresponding to the customized vocabulary.
11. The method of claim 10, wherein said converting primitive speech sounds of said customized vocabulary to a speech vector representation, prior to deriving a speech vector representation of said customized vocabulary, further comprises: And carrying out voice recognition on the primitive language voice of the customized vocabulary to obtain a corresponding recognition text, and determining that the recognition text is consistent with the primitive language text of the customized vocabulary.
12. The method of claim 10, wherein the obtaining the customized vocabulary to be added to the translation knowledge base comprises: Acquiring a customized vocabulary input through a front-end interface; And/or the number of the groups of groups, And receiving the custom vocabulary file uploaded through the front-end interface, and reading the custom vocabulary recorded in the custom vocabulary file.
13. A method of speech translation, comprising: Receiving conference voice of an original language and a target language to be translated; searching translation knowledge matched with the conference voice in a translation knowledge base, wherein the translation knowledge base comprises translation knowledge of at least one customized vocabulary, and the translation knowledge of the customized vocabulary comprises original language voice and target language text; Inputting the conference voice and the translation knowledge matched with the conference voice into a voice translation model, translating primitive language voices appearing in the conference voice into corresponding target language texts through the voice translation model, and generating the translation texts of the target languages; And outputting the translation text of the target language.
14. A server for a server, which comprises a server and a server, characterized by comprising the following steps: at least one processor, and A memory communicatively coupled to the at least one processor; Wherein the memory stores instructions executable by the at least one processor to cause the server to perform the method of any of claims 1-13.
15. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method of any of claims 1-13.
16. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-13.

Description

Speech translation method, server, storage medium, and program product Technical Field The present application relates to artificial intelligence technology, and in particular, to a speech translation method, a server, a storage medium, and a program product. Background With the continuous development of artificial intelligence technology, the application of a speech translation model is also becoming wider and wider. The speech translation model may convert speech in one language (referred to as the original language) to translated text in another language (referred to as the target language). At present, the end-to-end voice translation model is insufficient in learning proper nouns, new words, rare words and the like in the field, so that the problem of knowledge illusion is easy to generate, and the accuracy of a voice translation result is low. Disclosure of Invention The application provides a voice translation method, a server, a storage medium and a program product, which are used for solving the problem of low accuracy of a voice translation result of a voice translation model. In a first aspect, the present application provides a speech translation method, including: acquiring original voice of a primitive language to be translated and a target language to be translated; Searching translation knowledge matched with the original voice in a translation knowledge base, wherein the translation knowledge base comprises translation knowledge of at least one customized vocabulary, and the translation knowledge of the customized vocabulary comprises primitive language voice and target language text of the customized vocabulary; Inputting the translation knowledge of the original voice and the original voice matching into a voice translation model, translating primitive language voices appearing in the original voice into corresponding target language texts through the voice translation model, and generating the translation texts of the target language. In a second aspect, the present application provides a speech translation method, including: receiving conference voice of an original language and a target language to be translated; Searching translation knowledge matched with the conference voice in a translation knowledge base, wherein the translation knowledge base comprises translation knowledge of at least one customized vocabulary, and the translation knowledge of the customized vocabulary comprises primitive language voice and target language text of the customized vocabulary; Inputting the conference voice and the translation knowledge matched with the conference voice into a voice translation model, translating primitive language voices appearing in the conference voice into corresponding target language texts through the voice translation model, and generating the translation texts of the target languages; And outputting the translation text of the target language. In a third aspect, the present application provides a server comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor for causing the server to perform the method provided in any of the preceding aspects. In a fourth aspect, the present application provides a computer readable storage medium having stored therein computer executable instructions which, when executed by a processor, implement a method as provided in any of the preceding aspects. In a fifth aspect, the application provides a computer program product comprising a computer program which, when executed by a processor, implements a method as provided in any of the preceding aspects. The application provides a voice translation method, a server, a storage medium and a program product, wherein the original voice of an original language to be translated and a target language to be translated are obtained; searching translation knowledge matched with the original voice in a translation knowledge base, inputting the original voice and the translation knowledge matched with the original voice into a voice translation model, translating primitive voice in the translation knowledge appearing in the original voice into corresponding target language text through the voice translation model, and generating the translation text of the target language. External knowledge intervention is performed on the voice translation model based on a cross-modal translation knowledge base, so that translation errors of customized words are effectively avoided on the premise that parameters of the voice translation model do not need to be retrained, translation accuracy of the voice translation model to customized words such as proper nouns, professional terms and new words is improved, and accuracy of voice translation results is improved. Drawings The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consiste