CN-121983057-A - Voice text bidirectional conversion system and method based on asynchronous processing and message queue

CN121983057ACN 121983057 ACN121983057 ACN 121983057ACN-121983057-A

Abstract

The invention discloses a voice text bidirectional conversion system and method based on asynchronous processing and message queues, wherein the system comprises a server, an OBS object storage service, a service system, a conversion gateway and a plurality of universal gateways, wherein the conversion gateway is connected with the server, a voice-to-text module and a text-to-voice module are internally configured, the voice-to-text module is constructed based on FunASR streaming identification frames, communication is carried out through WebSocket protocols and is combined with an asynchronous task processing mechanism to carry out voice identification flow, the text-to-speech module is constructed based on ChatTTS service and is combined with RabbitMQ message queues to realize asynchronous task dispatch and result feedback mechanism. According to the scheme, through unified interface design and modularized processing flow, efficient conversion between voice and text is achieved, and response speed, accuracy and expandability of a voice processing system are improved.

Inventors

WANG QI
LIAN FU
JIN GUANGXUN

Assignees

启明信息技术股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (9)

1. The voice text bidirectional conversion system based on asynchronous processing and message queues is characterized by comprising a server, an OBS object storage service, a service system, a conversion gateway and a plurality of universal gateways, wherein the conversion gateway is connected with the server, and a voice-to-text module and a text-to-voice module are internally configured, wherein: The voice-to-text module is constructed based on FunASR streaming identification frames, communicates through a WebSocket protocol, and combines an asynchronous task processing mechanism to perform voice identification flow; The text-to-speech module is constructed based on ChatTTS services and combines with a RabbitMQ message queue to realize an asynchronous task dispatch and result feedback mechanism.
2. The speech-to-text bi-directional conversion system based on asynchronous processing and message queues according to claim 1, wherein the server processes data including audio chunking, speech recognition result encapsulation, task deduplication and scheduling, task query and status tracking, and exception handling and logging.
3. The voice-text bi-directional conversion system based on asynchronous processing and message queues according to claim 1, wherein the server, the OBS object storage service and the service system are sequentially connected, and a general gateway is respectively arranged between the server and the OBS object storage service and between the OBS object storage service and the service system, wherein the OBS object storage service internal functions comprise loading audio conversion binary, transmitting instructions and using data, and the service system internal functions comprise transmitting instructions and using data.
4. The speech-to-text bi-directional translation system based on asynchronous processing and message queues of claim 1, wherein said generic gateway and translation gateway internal functions each comprise security isolation, VPN access and traffic management.
5. The speech-to-text bi-directional conversion system based on asynchronous processing and message queuing as recited in claim 1, further comprising an error trapping and logging module for logging the entire system flow including request source, processing status, exception information and processing time.
6. The method for bi-directionally converting voice and text based on asynchronous processing and message queue is realized based on the system for bi-directionally converting voice and text based on asynchronous processing and message queue according to any one of claims 1-4, and is characterized by comprising the steps of converting ASR voice into text and converting TTS text into voice, wherein the ASR voice into text comprises the following steps: A1, after receiving an ASR speech to text request initiated by a user, creating a task record, initializing to a PENDING state, and simultaneously loading an audio file through an OBS object storage service and converting the audio file into binary data; Step A2, checking the audio format and the sampling rate, if the audio format and the sampling rate do not meet the standard, the task directly marks failure, segmenting the audio meeting the standard according to the time length, and transmitting the segmented audio into a FunASR stream identification engine for identification; Step A3, after all the audio block processing is completed, calling FunASR a closing method of the recognizer to obtain tail output, and finally integrating the tail output into a complete recognition text; and step A4, writing the identification result into a database, updating the task state into a complete, packaging the task state into JSON output in a uniform structural format, and carrying out integrated calling and storage.
7. The method for bi-directional conversion of speech to text based on asynchronous processing and message queuing as recited in claim 6 wherein said TTS text to speech comprises the steps of: Step B1, performing idempotent processing on the request based on the task ID, if the task exists and is in the processing or finished state, directly returning a corresponding result to avoid repeated processing, if the task is a new task, encoding an input text into a UTF-8 format by the new request, storing the input text in an object, and then generating a new task record and entering a state to be processed; And step B2, pushing the TTS text-to-speech task to a background synthesis service through a message middleware, calling ChatTTS a speech synthesis framework to carry out speech modeling and audio generation on the input text, and updating a database record and providing URL output after storing a synthesis result.
8. The method for bi-directional conversion of speech and text based on asynchronous processing and message queues of claim 7, further comprising querying a speech recognition or speech synthesis task state based on a standard interface provided by the system, returning task processing progress, time stamps and result content, supporting real-time tracking and batch management of speech processing tasks by a user.
9. The method for bi-directional conversion of voice text based on asynchronous processing and message queues according to claim 8, further comprising embedding the entire flow into an error capture and logging module, detailing each operation, exception stack and model response results, and providing trace back information upon task failure.

Description

Voice text bidirectional conversion system and method based on asynchronous processing and message queue Technical Field The invention relates to the technical field of artificial intelligence voice interaction, in particular to a voice text bidirectional conversion system and method based on asynchronous processing and message queues. Background With the rapid development of speech recognition (ASR) and speech synthesis (TTS) technologies, speech interaction has been widely applied to various scenes such as intelligent customer service, vehicle-mounted systems, virtual assistants, and the like. Currently, an open source model such as paraformer, transformer based on deep learning provides a technical basis for realizing voice to text or text to voice conversion. The following significant problems remain in practical deployment and engineering applications: The integrated process is complex, the module coupling degree is high, the voice recognition and the voice synthesis in the traditional system are usually used as two independent processes, the deployment and the calling processes are complex, and the unified interface integration and the module decoupling are not facilitated. The real-time interaction capability is lacking, part of the system still depends on HTTP synchronous requests, and the WebSocket equal-length connection support is lacking, so that the requirements of real-time voice stream processing and continuous interaction cannot be met. The task scheduling is inflexible and cannot cope with high concurrency, the existing system generally lacks a unified asynchronous message scheduling mechanism, the problems of blocking, overtime and the like are easy to occur in a high concurrency environment, and the system stability and expandability are insufficient. The resources are occupied highly, the interface call redundancy is that in some business scenes, the conversion task between the characters and the voice is frequently switched, the existing scheme often loads the model repeatedly, the system resources are wasted, and the interface response time is long. The lack of an end-to-end processing path, that is, most of the current schemes only realize one-way conversion functions (such as voice-to-text or text-to-voice conversion), and the lack of closed loop capability supporting voice-to-text bidirectional conversion, limit the application effect in intelligent interaction and closed loop processing scenes. Therefore, the existing voice processing system still has obvious defects in the aspects of integrated deployment, real-time performance, concurrent processing capacity, bidirectional closed-loop processing capacity and the like. Disclosure of Invention Aiming at the technical problems, the invention provides a voice text bidirectional conversion system and a voice text bidirectional conversion method based on asynchronous processing and message queues, integrates voice recognition and voice synthesis functions, realizes closed loop processing of voice text bidirectional conversion through a unified service architecture and a task management mechanism, remarkably improves the instantaneity, concurrency and expansibility of the system, and is suitable for various man-machine voice interaction scenes with high interaction and high reliability. The invention is realized by adopting the following technical scheme: the voice text bidirectional conversion system based on asynchronous processing and message queues comprises a server, an OBS object storage service, a service system, a conversion gateway and a plurality of universal gateways, wherein the conversion gateway is connected with the server, and a voice-to-text module and a text-to-voice module are configured inside the conversion gateway, wherein: The voice-to-text module is constructed based on FunASR streaming identification frames, communicates through a WebSocket protocol, and combines an asynchronous task processing mechanism to perform voice identification flow; The text-to-speech module is constructed based on ChatTTS services and combines with a RabbitMQ message queue to realize an asynchronous task dispatch and result feedback mechanism. Specifically, the server processes data, including audio blocking, speech recognition result packaging, task deduplication and scheduling, task query and state tracking, and exception handling and logging. The system comprises a server, an OBS object storage service and a service system, wherein the server, the OBS object storage service and the service system are sequentially connected, a general gateway is respectively arranged between the server and the OBS object storage service and between the OBS object storage service and the service system, the internal functions of the OBS object storage service comprise loading of audio conversion binary system, sending of instructions and using data, and the internal functions of the service system comprise sending of instructions and using data. Speci