CN-121997950-A - Speech translation method, device, storage medium, and program product

CN121997950ACN 121997950 ACN121997950 ACN 121997950ACN-121997950-A

Abstract

The application discloses a voice translation method, equipment, a storage medium and a program product, which relate to the technical field of artificial intelligence and comprise the steps of searching a multi-modal term library for a term related to a source language voice fragment, and if a target term related to the source language voice fragment to be translated is searched, translating the source language voice fragment based on each modal information of the target term and pre-configured weights of different modalities to obtain a target language text fragment, wherein the weight of a voice modality is larger than that of a non-voice modality. The application translates the source language voice fragment based on each mode information of the target term related to the source language voice fragment to be translated and the weight of different modes which are pre-configured, thereby enhancing the understanding of the translation process to the term and further improving the translation quality of the term.

Inventors

DU CHENGYUAN
KONG CHANGQING
SONG YANAN
XIONG SHIFU

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260127

Claims (11)

1. A method of speech translation, the method comprising: obtaining a source language voice fragment to be translated; Retrieving terms related to the source language voice fragment from a multimodal term library, wherein the multimodal information of each term in the multimodal term library comprises source language text, target language text and source language voice; And if the target term related to the source language voice fragment is retrieved, translating the source language voice fragment based on each mode information of the target term and the weights of different preset modes to obtain a target language text fragment, wherein the weights of the voice modes are larger than those of the non-voice modes.
2. The method of claim 1, wherein said retrieving terms associated with said source language speech segments in a multimodal term base comprises: The terms associated with the source language speech segment are retrieved in a multimodal term base based at least on the source language speech segment and its transcribed text.
3. The method of claim 2, wherein retrieving terms related to the source language speech segment in a multimodal term base based on the source language speech segment and its transcribed text comprises: determining a sub-voice segment of a target length in the source language voice segment by taking the length of the source language voice of the term as the target length corresponding to each term in the multi-modal term library, and judging whether the term is related to the source language voice segment or not by any one of the following methods corresponding to each sub-voice segment: Calculating a first similarity between the sub-voice fragment and the source language voice of the term, carrying out weighted fusion on the first similarity and the second similarity between the transcribed text of the sub-voice fragment and the source language text of the term to obtain the comprehensive similarity between the sub-voice fragment and the term; Or calculating the first similarity of the sub-speech segment and the source language speech of the term, determining the term with the maximum similarity of the sub-speech segment and the first threshold value as the target term related to the source language speech segment, otherwise, calculating the second similarity of the transcription text of the sub-speech segment and the source language text of the term, and determining the term with the maximum similarity of the sub-speech segment and the second threshold value as the target term related to the source language speech segment; Or obtaining a first multi-modal feature determined based on at least the sub-speech segment and its transcribed text, and a second multi-modal feature determined based on at least the source language speech and the source language text of the term, calculating the similarity of the first multi-modal feature to the second multi-modal term feature, and determining a term having the greatest similarity to the first multi-modal feature and greater than a fifth threshold as a target term related to the source language speech segment.
4. The method of claim 1, wherein translating the source language speech segment based on the respective modality information of the target term and pre-configured weights for different modalities, comprises: coding the source language voice fragment to obtain a coding characteristic sequence; coding the information of each mode of the target term respectively to obtain coding characteristics of each mode of the target term; Weighting and fusing the coding features of all modes of the target term to obtain multi-mode features of the target term; And decoding the coding feature sequence based on the multi-modal feature of the target term to obtain the target language text fragment.
5. The method of claim 1, wherein the multimodal information for each term further includes a related image of the term; the text modality has a greater weight than the image modality.
6. The method of claim 1, wherein translating the source language speech segment based on the respective modality information of the target term and pre-configured weights for different modalities, comprises: replacing the sub-voice fragments related to the target term in the source language voice fragments with the source language voice of the target term to obtain updated source language voice fragments; And translating the updated source language voice fragment based on the modal information of the target term and the pre-configured weights of different modalities to obtain a target language text fragment.
7. The method of claim 1, wherein retrieving terms related to the source language speech segments in a multimodal term base and translating the source language speech segments is accomplished by a large model trained by: pre-training the large model based on multiple tasks to obtain a pre-trained large model, wherein the multiple tasks comprise a text translation task, a term voice recognition task, a voice-image matching task and a text-image matching task; And performing fine tuning training on the pre-trained large model based on the voice translation task to obtain a trained large model.
8. The method of claim 7, wherein fine-tuning the pre-trained large model based on speech translation tasks comprises: Taking the source language voice sample and the multi-modal information of the sample related terms as the input of a large model, taking the target language text label of the target language text output by the large model, which approaches to the source language voice sample, as a target, and updating the parameters of the large model; In the target language text label of the source language voice sample, a preset label is added before the term, and the term is characterized by the preset label.
9. An electronic device comprising at least one processor, a memory coupled to the processor, wherein: the memory is used for storing a computer program; The processor is configured to execute the computer program to enable the electronic device to implement the speech translation method according to any one of claims 1 to 8.
10. A computer program product comprising computer readable instructions which, when run on an electronic device, cause the electronic device to implement the speech translation method of any one of claims 1 to 8.
11. A computer storage medium carrying one or more computer programs which, when executed by an electronic device, enable the electronic device to implement a speech translation method as claimed in any one of claims 1 to 8.

Description

Speech translation method, device, storage medium, and program product Technical Field The present application relates to the field of artificial intelligence technology, and in particular, to a speech translation method, apparatus, storage medium, and program product. Background Speech translation refers to translating speech in a source language into text in a target language. The current speech translation method has lower translation quality for technical terms. Disclosure of Invention In view of the foregoing, the present application provides a speech translation method, apparatus, storage medium, and program product to improve the quality of translation of technical terms. The specific scheme is as follows: The first aspect of the present application provides a speech translation method, the method comprising: obtaining a source language voice fragment to be translated; Retrieving terms related to the source language voice fragment from a multimodal term library, wherein the multimodal information of each term in the multimodal term library comprises source language text, target language text and source language voice; And if the target term related to the source language voice fragment is retrieved, translating the source language voice fragment based on each mode information of the target term and the weights of different preset modes to obtain a target language text fragment, wherein the weights of the voice modes are larger than those of the non-voice modes. In one possible implementation, the retrieving the terms related to the source language speech segments in a multimodal term base includes: The terms associated with the source language speech segment are retrieved in a multimodal term base based at least on the source language speech segment and its transcribed text. In one possible implementation, retrieving terms related to the source language speech segment in a multimodal term base based on the source language speech segment and its transcribed text includes: determining a sub-voice segment of a target length in the source language voice segment by taking the length of the source language voice of the term as the target length corresponding to each term in the multi-modal term library, and judging whether the term is related to the source language voice segment or not by any one of the following methods corresponding to each sub-voice segment: Calculating a first similarity between the sub-voice fragment and the source language voice of the term, carrying out weighted fusion on the first similarity and the second similarity between the transcribed text of the sub-voice fragment and the source language text of the term to obtain the comprehensive similarity between the sub-voice fragment and the term; Or calculating the first similarity of the sub-speech segment and the source language speech of the term, determining the term with the maximum similarity of the sub-speech segment and the first threshold value as the target term related to the source language speech segment, otherwise, calculating the second similarity of the transcription text of the sub-speech segment and the source language text of the term, and determining the term with the maximum similarity of the sub-speech segment and the second threshold value as the target term related to the source language speech segment; Or obtaining a first multi-modal feature determined based on at least the sub-speech segment and its transcribed text, and a second multi-modal feature determined based on at least the source language speech and the source language text of the term, calculating the similarity of the first multi-modal feature to the second multi-modal term feature, and determining a term having the greatest similarity to the first multi-modal feature and greater than a fifth threshold as a target term related to the source language speech segment. In one possible implementation, translating the source language speech segment based on the respective modality information of the target term and the pre-configured weights of different modalities includes: coding the source language voice fragment to obtain a coding characteristic sequence; coding the information of each mode of the target term respectively to obtain coding characteristics of each mode of the target term; Weighting and fusing the coding features of all modes of the target term to obtain multi-mode features of the target term; And decoding the coding feature sequence based on the multi-modal feature of the target term to obtain the target language text fragment. In one possible implementation, the multimodal information of each term further includes a related image of the term; the text modality has a greater weight than the image modality. In one possible implementation, translating the source language speech segment based on the respective modality information of the target term and the pre-configured weights of different modalities includes: replacing the sub-voice