CN-122024696-A - Real-time voice translation method based on multi-mode fusion and intelligent terminal

CN122024696ACN 122024696 ACN122024696 ACN 122024696ACN-122024696-A

Abstract

The invention provides a real-time voice translation method based on multi-mode fusion and an intelligent terminal, and relates to the technical field of voice signal processing, wherein the method comprises the following steps of 1, positioning three facial feature points of a left eye pupil center, a right eye pupil center and a lip center, and constructing a time sequence track function by taking time as a horizontal axis and taking normalized motion coordinates as a vertical axis based on motion coordinates of the three facial feature points; and step 2, performing discrete time Laplace transformation on the sequence track function to obtain a complex frequency domain characteristic function, performing partial fractional expansion and inverse Z transformation on the complex frequency domain characteristic function in a convergence domain, reconstructing to obtain a time domain emotion state vector, and calculating an emotion intensity reconciliation factor. According to the invention, through fusion of facial emotion characteristics and voice recognition, translation and synthesis technologies, the precision, fluency and emotion suitability of real-time voice translation are improved, and the text flow continuity of the translation and the synthesized voice fitting scene and emotion tendency are ensured.

Inventors

HUANG ZHENG
XU CHENGSHENG

Assignees

杭州讯意迪科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. The real-time voice translation method based on the multi-mode fusion is characterized by comprising the following steps of: Step 1, positioning three facial feature points of the pupil center of the left eye, the pupil center of the right eye and the lip center, and constructing a time sequence track function taking time as a horizontal axis and taking normalized motion coordinates as a vertical axis based on motion coordinates of the three facial feature points; Step 2, performing discrete time Laplace transformation on the sequence track function to obtain a complex frequency domain characteristic function, performing partial fractional expansion and inverse Z transformation on the complex frequency domain characteristic function in a convergence domain, reconstructing to obtain a time domain emotion state vector, and calculating an emotion intensity harmonic factor; Step 3, inputting the continuous pure voice data blocks into a pre-training stream type voice recognition model for recognition to obtain a text segment at the current time point, and updating and maintaining a recognition candidate word graph based on the text segment; step 4, detecting and identifying whether a preliminary meaningful phrase or clause structure is formed in the candidate word graph, and performing preliminary translation on the phrase or clause structure when the preliminary meaningful phrase or clause structure is detected to be formed, so as to generate candidate translation fragments; Step 5, judging semantic boundaries of the candidate translation fragments, triggering a pre-trained context translation model when judging that the candidate translation fragments reach the complete semantic boundaries, generating a final optimized translation, and correcting or replacing the candidate translation fragments by using the final optimized translation to form a translation text stream; And 6, extracting prosodic features of the source language audio stream, combining the emotion intensity harmonic factors and the regulating and controlling instructions issued by the self-adaptive regulating and controlling center, performing voice synthesis on the translated text stream, and generating and playing the target language voice stream in real time.
2. The method for real-time speech translation based on multi-modal fusion according to claim 1, wherein before the step 1: Capturing or receiving a source language audio stream and a speaker video stream in real time; Real-time noise reduction and voice endpoint detection are performed on the captured or received source language audio stream to generate a continuous block of clean voice data, and face regions are detected from the captured or received speaker video stream.
3. The method for real-time speech translation based on multimodal fusion according to claim 2, wherein the step 1 comprises: positioning three facial feature points of the pupil center of the left eye, the pupil center of the right eye and the lip center based on the face area; Calculating motion coordinates of each feature point based on pixel coordinates of three facial feature points in continuous video frames, and normalizing the motion coordinates to obtain normalized motion coordinates; based on the normalized motion coordinates and the corresponding time stamps, a time sequence track function with time as a horizontal axis and the normalized motion coordinates as a vertical axis is constructed.
4. The method for real-time speech translation based on multi-modal fusion according to claim 3, wherein the step 2 includes: Performing discrete time Laplace transformation operation on the time sequence track function to obtain a complex frequency domain characteristic function related to complex frequency variables; Analyzing the expression of the complex frequency domain characteristic function, and solving the complex frequency variable value range which enables the complex frequency domain characteristic function sequence to converge to obtain the convergence domain of the complex frequency domain characteristic function; in the convergence domain, decomposing the complex frequency domain characteristic function into a form of the sum of a plurality of first-order partial expressions to complete the partial expansion; Performing numerical differentiation on the time sequence track function, calculating the first derivative and the second derivative value of the time sequence track function at each sampling time point, detecting the position where the sign of the second derivative value changes, judging the position as an inflection point of the time sequence track function, and extracting the absolute value of the first derivative value at each inflection point as instantaneous change intensity; And calculating to obtain an inflection point modulation factor based on the number of all the detected inflection points and the average value of the instantaneous change intensity, and calculating to obtain an emotion intensity reconciliation factor by weighting fusion based on the denominator pole value of each first-order fraction after the partial fraction expansion, the corresponding numerator retention value and the inflection point modulation factor.
5. The method for real-time speech translation based on multi-modal fusion according to claim 4, wherein the step 3 includes: Inputting the continuous pure voice data blocks into a pre-trained streaming voice recognition model according to time sequence, wherein the streaming voice recognition model processes the continuous pure voice data blocks and outputs one or more candidate words at the current time point and the probability corresponding to each candidate word; At the initial moment, initializing an identification candidate word graph comprising nodes and directed edges by taking a first candidate word as a starting point, wherein the nodes represent the candidate word, and the directed edges represent the transfer relationship among the candidate word and are weighted by the probability corresponding to the candidate word; And at each subsequent time point, adding the new candidate word as a new node into the recognition candidate word graph, establishing a directed edge from the optional node at the previous time point to the current new node by taking the probability corresponding to the candidate word as a weight, and simultaneously updating the accumulated probability value of each path from the starting point to the current new node to complete incremental updating and maintenance of the recognition candidate word graph.
6. The method for real-time speech translation based on multimodal fusion according to claim 5, wherein said step 4 comprises: after finishing updating the recognition candidate word graphs, starting from a new node corresponding to the current time point, reversely tracing back the recognition candidate word graphs along the directed edge, and acquiring a plurality of candidate paths taking the node as an end point; Part-of-speech tagging and shallow dependency analysis are carried out on the word element sequences corresponding to each candidate path, and when the analysis result judges that the word element sequence of a certain candidate path accords with a preset phrase or clause grammar template, a preliminary meaningful phrase or clause structure is judged to be formed; extracting a source language word element sequence corresponding to the phrase or clause structure, converting the source language word element sequence into a coherent source language text, and performing real-time preliminary translation on the converted source language text to generate candidate translation fragments.
7. The method for real-time speech translation based on multi-modal fusion according to claim 6, wherein the step 5 includes: based on the path with highest cumulative probability from the starting point to the current node in the identified candidate word graph, acquiring a word element sequence corresponding to the path; Simultaneously, extracting semantic embedded vectors of the word sequences, acquiring historical semantic embedded vectors corresponding to the cached historical word sequences, calculating cosine similarity between the semantic embedded vectors of the current word sequences and the semantic embedded vectors of the historical word sequences, judging that the complete semantic boundary is reached when the sentence ending symbol appears in the punctuation predicting result and the cosine similarity is lower than a preset threshold value, and converting the word sequences into complete source language recognition texts when the complete semantic boundary is judged to be reached; The method comprises the steps of inputting a complete source language identification text, a cached historical dialogue text, preset domain knowledge base information and a reconstructed time domain emotion state vector into a pre-trained context translation model together, and carrying out context semantic fusion and disambiguation translation on the pre-trained context translation model to generate a final optimized translation; and checking whether the source language phrase or clause structure corresponding to the generated candidate translation fragment is contained in the complete source language identification text, if so, replacing the candidate translation fragment by using the translation of the corresponding part in the final optimized translation to form a coherent translation text stream, and otherwise, inserting the final optimized translation as a new translation fragment into the translation text stream.
8. The method for real-time speech translation based on multimodal fusion according to claim 7, wherein the step 6 comprises: extracting a fundamental frequency F0 track, a phoneme duration sequence and an energy envelope from a source language audio stream as prosodic features; The method comprises the steps of inputting a translation text stream, prosodic features, emotion intensity reconciliation factors and a regulation and control instruction issued by a self-adaptive regulation and control center into a voice synthesis module, modulating emotion color intensity of a fundamental frequency F0 track and an energy envelope by the voice synthesis module, selecting corresponding acoustic model parameters according to the regulation and control instruction, and fusing the modulated prosodic features to synthesize the translation text stream into a target language voice stream; And playing the target language voice stream in real time through the audio playing device.
9. A real-time speech translation intelligent terminal based on multimodal fusion, which implements the method according to any one of claims 1 to 8, comprising: the system comprises a system bus, a processor, a memory, a power supply component, a network component, a display screen, a loudspeaker, a first camera, a second camera, a first microphone and a second microphone; the memory stores a computer program, and the processor is configured to execute the computer program to implement the steps of: Collecting source language voice signals of at least one speaker through the first microphone and the second microphone; acquiring facial expression attitude image information of a corresponding speaker through the first camera and the second camera; noise reduction and voice endpoint detection processing are carried out on the source language voice signals, and stream voice recognition is carried out on the processed voice signals to obtain source language recognition texts; Performing multi-mode feature fusion on the source language identification text and the expression feature information, and performing machine translation based on the fused features to generate a target language translation text; Extracting prosody features according to the source language voice signals, and carrying out voice synthesis on the target language translation text by combining with expression feature information to generate target language voice signals; And playing the target language voice signal through the loudspeaker, wherein the processor is further used for monitoring the environmental noise intensity in real time and dynamically adjusting the noise reduction parameters and the pickup directivity parameters of the first microphone and the second microphone in the voice acquisition process.
10. The real-time speech translation intelligent terminal based on multi-modal fusion according to claim 9, further comprising: The main body module is used for integrally accommodating the processor, the memory, the power supply component, the network component, the display screen and the loudspeaker; The camera module is installed on the main body module in a pitching mode through the damping rotating shaft, is integrated with the first camera and the first microphone and is used for collecting audio and video information in a first direction; The main body wire outlet hole is arranged in the main body module and used for penetrating the connecting cable; The main body bracket is detachably connected with the main body module and is used for supporting and placing the main body module on a plane; The external module is internally integrated with the camera II and the microphone II and is used for collecting audio and video information in a second direction; The peripheral outlet hole is arranged on the peripheral module and used for penetrating the connecting cable, and the peripheral module is electrically connected with the processor in the main body module through the cable through the peripheral outlet hole and the main body outlet hole; And the peripheral fixing piece is used for fixedly mounting the peripheral module on an external supporting structure.

Description

Real-time voice translation method based on multi-mode fusion and intelligent terminal Technical Field The invention relates to the technical field of voice signal processing, in particular to a real-time voice translation method based on multi-mode fusion and an intelligent terminal. Background In the field of real-time speech translation, the prior art generally adopts a cascading processing flow combining speech recognition and machine translation, and can provide basic language conversion support for scenes such as cross-country conferences, business negotiations and the like. However, in the communication occasions involving emotion transmission and expression details, the existing system still has a certain limitation, and particularly, most of the current translation systems mainly rely on audio information to perform text conversion and speech synthesis, and are less integrated into visual emotion information of a speaker, so that the emotion states of the output speech in the original speech are mostly not fully reflected in intonation and rhythm. For example, in the discussion of remote collaborative design, a designer often transmits design intent and emotion tendency through expressions and gestures when setting up creative concepts, such as smiling to express affirmation of a certain scheme or frowning to express question of details, the existing translation device can generally translate only text content corresponding to voice, and the synthesized translation voice is mostly flat and lacks emotion fluctuation, so that immediate emotion information of the designer is difficult to be transmitted, which may cause a remote collaborator to not accurately perceive the design attitude and expression emphasis, and indirectly affect communication efficiency and collaborative effect. Disclosure of Invention The technical problem to be solved by the invention is to provide a real-time voice translation method based on multi-mode fusion and an intelligent terminal, so that the dynamic balance of instantaneity and translation accuracy is realized, and the dual requirements of instant response and accurate information transmission in cross-language communication are met. In order to solve the technical problems, the technical scheme of the invention is as follows: In a first aspect, a method for real-time speech translation based on multimodal fusion, the method comprising: Step 1, positioning three facial feature points of the pupil center of the left eye, the pupil center of the right eye and the lip center, and constructing a time sequence track function taking time as a horizontal axis and taking normalized motion coordinates as a vertical axis based on motion coordinates of the three facial feature points; Step 2, performing discrete time Laplace transformation on the sequence track function to obtain a complex frequency domain characteristic function, performing partial fractional expansion and inverse Z transformation on the complex frequency domain characteristic function in a convergence domain, reconstructing to obtain a time domain emotion state vector, and calculating an emotion intensity harmonic factor; Step 3, inputting the continuous pure voice data blocks into a pre-training stream type voice recognition model for recognition to obtain a text segment at the current time point, and updating and maintaining a recognition candidate word graph based on the text segment; step 4, detecting and identifying whether a preliminary meaningful phrase or clause structure is formed in the candidate word graph, and performing preliminary translation on the phrase or clause structure when the preliminary meaningful phrase or clause structure is detected to be formed, so as to generate candidate translation fragments; Step 5, judging semantic boundaries of the candidate translation fragments, triggering a pre-trained context translation model when judging that the candidate translation fragments reach the complete semantic boundaries, generating a final optimized translation, and correcting or replacing the candidate translation fragments by using the final optimized translation to form a translation text stream; And 6, extracting prosodic features of the source language audio stream, combining the emotion intensity harmonic factors and the regulating and controlling instructions issued by the self-adaptive regulating and controlling center, performing voice synthesis on the translated text stream, and generating and playing the target language voice stream in real time. In a second aspect, a real-time speech translation intelligent terminal based on multimodal fusion includes: the system comprises a system bus, a processor, a memory, a power supply component, a network component, a display screen, a loudspeaker, a first camera, a second camera, a first microphone and a second microphone; the memory stores a computer program, and the processor is configured to execute the computer program to implement the