CN-121997951-A - Real-time conference simultaneous interpretation method and system based on hybrid architecture

CN121997951ACN 121997951 ACN121997951 ACN 121997951ACN-121997951-A

Abstract

The invention relates to the technical field of voice information processing and real-time machine translation, in particular to a real-time conference simultaneous interpretation method and system based on a hybrid architecture, comprising a user interaction end, a server processing end and a real-time communication link; the user interaction terminal comprises a web page application and local desktop software, equipment/language can be selected after the user interaction terminal is accessed, clicking is started after the equipment/language is selected, and websocket connection is established and a user interface is accessed after the clicking is started. The real-time conference simultaneous interpretation method and system have a hybrid architecture, wherein the hybrid user interaction end is combined with local desktop software through a Web page, so that convenience in use, stability and high performance of recording are considered, and an end-to-end streaming processing pipeline is designed from audio acquisition, streaming transmission and streaming voice recognition to streaming machine interpretation with low delay, so that a real simultaneous interpretation effect is ensured.

Inventors

TIAN YONG

Assignees

深圳火星语盟科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (7)

1. A real-time conference simultaneous interpretation method and system based on a hybrid architecture are characterized by comprising the following steps: The system comprises a user interaction end, a server processing end and a real-time communication link; the user interaction end comprises a web page application and local desktop software, the user interaction end can select equipment/language after entering the user interaction end, clicking is started after the equipment/language is selected, websocket connection is built after clicking is started, a user interface can submit error correction terms, recording is started after the websocket connection is built, an audio stream is collected, recording is started, the audio stream is collected, the pcm audio stream is transmitted in real time, the user can upload term data to enter the server processing end, the server processing end comprises a voice recognition module and a machine translation module, the server processing end can return recognition/translation text to real-time display text in real time, the real-time display text comprises source text and translation text, the server processing end can dynamically update a term library and optimize subsequent processing, the real-time display text can select whether a conference is finished or not, and the connection is disconnected after the conference is finished is confirmed.
2. The method and system for simultaneous interpretation of real-time conferences based on a hybrid architecture as set forth in claim 1, wherein both of the user interaction ends provide a consistent user interface for configuration and display, and the user can use the interface directly through a Web page or invoke more functionally stable local software through buttons on the Web page.
3. The method and system for simultaneous interpretation of real-time conferences based on a hybrid architecture as set forth in claim 1, wherein the speech recognition module is responsible for converting an audio stream into source language text based on a deep learning end-to-end model.
4. The method and system for simultaneous interpretation of real-time conferences based on a hybrid architecture as set forth in claim 1, wherein the machine translation module is preferably a large language model or a neural machine translation model, and is responsible for fast translation of source language text into target language text.
5. The method and system for simultaneous interpretation of real-time conferences based on a hybrid architecture as set forth in claim 4, wherein the real-time communication link is based on WebSocket or similar long connection protocol, and a bi-directional communication is established between the user interaction end and the server processing end for transmitting audio streams and returning text results.
6. The method and system for simultaneous interpretation of real-time conferences based on a hybrid architecture as set forth in claim 3, wherein the server processing terminal pushes the identified source language text and the translated target language text to the user interaction terminal in real time through the same WebSocket connection.
7. The method and system for simultaneous interpretation of a real-time conference based on a hybrid architecture as claimed in claim 2, wherein the speech recognition module and the machine translation module at the processing end of the server side dynamically add the term to the optimized word stock of the current session, and correct the subsequent recognition and translation in real time, so as to continuously improve the accuracy of the translation of the current conference, and the user can submit the "error correction term" in a specific input box at the interaction end in real time.

Description

Real-time conference simultaneous interpretation method and system based on hybrid architecture Technical Field The invention relates to the technical field of voice information processing and real-time machine translation, in particular to a real-time conference simultaneous interpretation method and system based on a hybrid architecture. Background In the global context, international conferences, academic discussions, and the like are increasingly frequent. In these cases, the speaker often uses a general language such as english, and a part of the audience may have a language barrier. Traditional solutions employ specialized manual co-translators, but are costly and limited by the length of time and the language pair type of the translator. In the prior art, some translation software or online services based on speech recognition and machine translation already exist. For example, some cell phone applications may provide near real-time speech translation. However, these general schemes have significant drawbacks in meeting scenarios: 1. the delay is high, and the translation is usually discontinuous and sentence-level, so that the true simultaneous interpretation cannot be realized, and the conference consistency is affected. 2. The method has poor specificity, lack of optimization of conference professional terms, and obviously reduces recognition and translation accuracy under professional scenes. 3. The usability is low, and the method is mostly applied to mobile terminals and is not suitable for stable and long-time use when the PPT demonstration is carried out on a PC terminal. Web page applications are limited by browser permissions and performance, and recording stability may be poor. The main disadvantage of the prior art is that the real-time simultaneous interpretation service with low delay, high accuracy and stability and reliability cannot be provided in the conference demonstration scene. Disclosure of Invention The invention aims to provide a real-time conference simultaneous interpretation method and system based on a hybrid architecture, which are used for solving the problems of the background technology that how to construct a system, collecting the voice of a conference presenter in real time, identifying and translating with high accuracy, displaying the result to a listener in near real time, and simultaneously allowing the translation quality to be dynamically optimized in the conference process, thereby overcoming the cost limit of manual simultaneous transmission and the scene inadaptability of the existing translation software. In order to achieve the purpose, the invention provides the following technical scheme that the real-time conference simultaneous interpretation method and system based on the hybrid architecture comprise the following steps: The system comprises a user interaction end, a server processing end and a real-time communication link; the user interaction end comprises a web page application and local desktop software, the user interaction end can select equipment/language after entering the user interaction end, clicking is started after the equipment/language is selected, websocket connection is built after clicking is started, a user interface can submit error correction terms, recording is started after the websocket connection is built, an audio stream is collected, recording is started, the audio stream is collected, the pcm audio stream is transmitted in real time, the user can upload term data to enter the server processing end, the server processing end comprises a voice recognition module and a machine translation module, the server processing end can return recognition/translation text to real-time display text in real time, the real-time display text comprises source text and translation text, the server processing end can dynamically update a term library and optimize subsequent processing, the real-time display text can select whether a conference is finished or not, and the connection is disconnected after the conference is finished is confirmed. Preferably, both of the user interaction ends provide a consistent user interface for configuration and display, and the user can use the local software directly through the Web page or call up the local software with more stable functions through buttons on the Web page. Preferably, the speech recognition module is responsible for converting the audio stream into the source language text based on a deep-learned end-to-end model. Preferably, the machine translation module is preferably a large language model or a neural machine translation model, and is responsible for rapidly translating the source language text into the target language text. Preferably, the real-time communication link is based on WebSocket or similar long connection protocol, and two-way communication is established between the user interaction end and the server processing end, and is used for transmitting audio streams and returnin