CN-121983023-A - Multi-model translation and speech synthesis method and device based on intelligent scheduling

CN121983023ACN 121983023 ACN121983023 ACN 121983023ACN-121983023-A

Abstract

The application relates to a multi-model translation and speech synthesis method and device based on intelligent scheduling. The method comprises the steps of obtaining voice to be translated, task information and equipment attribute information corresponding to the voice to be translated, extracting voice characteristics and grammar characteristics of the voice to be translated, determining target model identifications of all processing stages according to the task information, the equipment attribute information, the voice characteristics and the grammar characteristics, constructing a voice processing route of the voice to be translated according to the target model identifications, and calling a target model corresponding to the target model identifications to be translated based on the processing sequence of all processing stages and the target model identifications in the voice processing route to perform voice processing on the voice to be translated to obtain the target voice. By adopting the method, the accuracy of voice translation can be improved.

Inventors

HE MINGYANG
GU WEI
CHAI WEI
CHEN CI

Assignees

凌锐蓝信科技(北京)有限公司

Dates

Publication Date: 20260505
Application Date: 20260205

Claims (10)

1. A multi-model translation and speech synthesis method based on intelligent scheduling is characterized by comprising the following steps: acquiring voice to be translated, task information and equipment attribute information corresponding to the voice to be translated, and extracting voice characteristics and grammar characteristics of the voice to be translated; Determining target model identifiers of all processing stages according to the task information, the equipment attribute information, the voice characteristics and the grammar characteristics, and constructing a voice processing route of the voice to be translated according to all the target model identifiers; And calling a target model corresponding to each target model identifier to perform voice processing on the voice to be translated based on the processing sequence of each processing stage and each target model identifier in the voice processing route to obtain target voice.
2. The method according to claim 1, wherein the obtaining the voice to be translated, the task information and the device attribute information corresponding to the voice to be translated, and extracting the voice feature and the grammar feature of the voice to be translated include; Acquiring voice to be translated of a user terminal, task information and equipment attribute information corresponding to the voice to be translated; Determining the original language of the voice to be translated, and extracting the voice characteristics of the voice to be translated according to a voice characteristic perception model; And converting the voice to be translated into an initial translation text, and extracting grammar characteristics of the initial translation text.
3. The method of claim 2, wherein the obtaining the voice to be translated of the user terminal, the task information and the device attribute information corresponding to the voice to be translated, includes: acquiring an initial voice to be translated sent by a user terminal, and acquiring equipment attribute information; Preprocessing the sampling rate and the volume of the initial voice to be translated to obtain the voice to be translated; and acquiring a task scene, a target language and a configuration mode, and constructing task information corresponding to the voice to be translated according to the task scene, the target language and the configuration mode.
4. The method of claim 2, wherein the determining the original language of the speech to be translated and extracting the speech features of the speech to be translated according to a speech feature perception model comprises: Detecting the language of the voice to be translated to obtain the original language of the voice to be translated; Performing voice activity detection on the voice to be translated to obtain speaking time of a speaker in the voice to be translated; and extracting speaking characteristics and voice characteristics of a speaker in the voice to be translated according to the voice characteristic perception model to obtain the voice characteristics of the voice to be translated.
5. The method of claim 1, wherein each processing stage includes a speech recognition stage, a translation stage, and a speech generation stage, wherein each target model identification includes a target speech recognition model identification, a target translation model identification, and a target speech generation model identification, wherein determining the target model identification for each processing stage based on the task information, the device attribute information, the speech feature, and the grammar feature comprises: determining the target voice recognition model identification from the voice recognition model identifications in the voice recognition stage according to the equipment attribute information, the grammar characteristics and the voice characteristics; Determining the target translation model identification in each translation model identification of the translation stage according to the task information and the grammar characteristics; And determining the target voice generation model identification in each voice generation model identification of the voice generation stage according to the task information and the voice characteristics.
6. The method of claim 5, wherein the speech features include a voice quality feature and a speech feature, the grammar features include an original language, and the determining the target speech recognition model identification from the speech recognition model identifications in the speech recognition stage based on the device attribute information, the grammar features and the speech features includes: if the original language is the first language, determining a preset first voice recognition model identifier as a target voice recognition model identifier; If the equipment attribute information meets a preset first voice recognition condition, determining a preset second voice recognition model identifier as a target voice recognition model identifier; And if the voice quality characteristics and the speech speed characteristics in the speaking characteristics meet the preset second voice recognition conditions, determining a preset third voice recognition model identifier as a target voice recognition model identifier.
7. The method of claim 5, wherein the task information includes a task scenario, and wherein the determining the target translation model identifier from among the translation model identifiers in the translation stage according to the task information and the grammar characteristics includes: if the task scene belongs to a real-time scene type, determining a translation model identifier of an interface service class as a target translation model identifier in each translation model identifier of the translation stage; And if the grammar characteristics meet the preset first translation conditions, determining the preset first translation model identification as a target translation model identification in each translation model identification in the translation stage.
8. The method of claim 5, wherein the task information includes task scenes, and wherein the determining the target speech generation model identifier from each speech generation model identifier in the speech generation stage according to the task information and the speech features includes: if the voice characteristics meet a preset first voice generation condition, determining a first voice generation model identifier as a target voice generation model identifier in each voice generation model identifier in the voice generation stage; if the task scene meets a preset second voice generation condition, determining a second voice generation model identifier as a target voice generation model identifier in each voice generation model identifier in the voice generation stage; And if the task scene meets a preset third voice generation condition, determining the third voice generation model identification as a target voice generation model identification in each voice generation model identification in the voice generation stage.
9. The method according to claim 1 or 5, wherein the calling the target model corresponding to each target model identifier to perform voice processing on the voice to be translated based on the processing sequence of each processing stage and each target model identifier in the voice processing route to obtain a target voice includes: According to the sequence of each processing stage, a target voice recognition model corresponding to the target voice recognition model identification, a target translation model corresponding to the target translation model identification and a target voice generation model corresponding to the target voice generation model identification are called; Converting the voice to be translated into a text to be translated through the target voice recognition model, and translating the text to be translated into a target text of a target language according to the target translation model; And generating a target voice file corresponding to the target text based on the target voice generation model, and generating target voice according to the task information and the target voice file.
10. A multi-model translation and speech synthesis device based on intelligent scheduling, the device comprising: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring voice to be translated, task information and equipment attribute information corresponding to the voice to be translated, and extracting voice characteristics and grammar characteristics of the voice to be translated; The determining module is used for determining target model identifiers of all processing stages according to the task information, the equipment attribute information, the voice characteristics and the grammar characteristics, and constructing a voice processing route of the voice to be translated according to all the target model identifiers; and the calling module is used for calling a target model corresponding to each target model identifier to perform voice processing on the voice to be translated based on the processing sequence of each processing stage and each target model identifier in the voice processing route, so as to obtain target voice.

Description

Multi-model translation and speech synthesis method and device based on intelligent scheduling Technical Field The present application relates to the field of speech processing technologies, and in particular, to a method and apparatus for multimodal interpretation and speech synthesis based on intelligent scheduling, a computer device, and a computer readable storage medium. Background In recent years, with the rapid development of speech generation technology and speech cloning technology, more and more of the audio of the original language needs to be translated into the audio of the target language, and the original style and expressivity are preserved. In the conventional art, a fixed Whisper model (Whisper model), NLLB (No Language Left Behind, one language cannot be used too little) model, and a FASTSPEECH (quick speech 2) model are preset in the server. The server translates the speech to be translated into target speech in the target language based on the fixed Whisper model, NLLB model, and FASTSPEECH model. However, in the conventional technology, because there are different languages of the speech to be translated and different translation requirements, the speech to be translated cannot be accurately translated into the target speech based on only the fixed single model node. Thus, the accuracy of current speech translation is low. Disclosure of Invention Based on the foregoing, it is necessary to provide a method, an apparatus, a computer device and a computer readable storage medium for multimodal interpretation and speech synthesis based on intelligent scheduling. In a first aspect, the present application provides a multi-model translation and speech synthesis method based on intelligent scheduling, including: acquiring voice to be translated, task information and equipment attribute information corresponding to the voice to be translated, and extracting voice characteristics and grammar characteristics of the voice to be translated; Determining target model identifiers of all processing stages according to the task information, the equipment attribute information, the voice characteristics and the grammar characteristics, and constructing a voice processing route of the voice to be translated according to all the target model identifiers; And calling a target model corresponding to each target model identifier to perform voice processing on the voice to be translated based on the processing sequence of each processing stage and each target model identifier in the voice processing route to obtain target voice. In one embodiment, the obtaining the voice to be translated, task information and device attribute information corresponding to the voice to be translated, and extracting voice features and grammar features of the voice to be translated includes; Acquiring voice to be translated of a user terminal, task information and equipment attribute information corresponding to the voice to be translated; Determining the original language of the voice to be translated, and extracting the voice characteristics of the voice to be translated according to a voice characteristic perception model; And converting the voice to be translated into an initial translation text, and extracting grammar characteristics of the initial translation text. In one embodiment, the obtaining the voice to be translated of the user terminal, task information and device attribute information corresponding to the voice to be translated includes: acquiring an initial voice to be translated sent by a user terminal, and acquiring equipment attribute information; Preprocessing the sampling rate and the volume of the initial voice to be translated to obtain the voice to be translated; and acquiring a task scene, a target language and a configuration mode, and constructing task information corresponding to the voice to be translated according to the task scene, the target language and the configuration mode. In one embodiment, the determining the original language of the speech to be translated, and extracting the speech features of the speech to be translated according to the speech feature perception model includes: Detecting the language of the voice to be translated to obtain the original language of the voice to be translated; Performing voice activity detection on the voice to be translated to obtain speaking time of a speaker in the voice to be translated; and extracting speaking characteristics and voice characteristics of a speaker in the voice to be translated according to the voice characteristic perception model to obtain the voice characteristics of the voice to be translated. In one embodiment, each processing stage includes a speech recognition stage, a translation stage, and a speech generation stage, each target model identifier includes a target speech recognition model identifier, a target translation model identifier, and a target speech generation model identifier, and determining the target model identifie