CN-121565135-B - Text-to-speech real-time streaming conversion method, system, device, medium and program product

CN121565135BCN 121565135 BCN121565135 BCN 121565135BCN-121565135-B

Abstract

The invention discloses a text voice real-time stream conversion method, a system, equipment, a medium and a program product, wherein the method comprises the steps of firstly receiving a request input by a user, and then generating a text content block based on the user request; identifying the text content block type by monitoring the thinking state, and writing the text content block type into a corresponding reasoning queue and answer queue. And after the reasoning contents in the reasoning queue meet the first preset condition, synthesizing the reasoning contents by utilizing the first preset voice synthesis parameters, and outputting the reasoning contents and the reasoning audio blocks in a streaming mode in a unified data structure. And after the residual reasoning content in the reasoning queue is completely output and emptied, the answer content and the answer audio block are output in a streaming mode in a unified data structure. The invention obviously reduces the first packet delay of the answer text and the audio, and effectively improves the fluency when the reasoning and the answer content are switched.

Inventors

CAI MEIJIE
LIU BAIFENG
Deng Chengjing

Assignees

北京点富科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251218

Claims (10)

1. A method for real-time streaming of text to speech, the method comprising: Receiving a request input by a user; Generating a text content block based on the user request; identifying the text content block type by monitoring the thinking state, and writing the text content block type into a corresponding content queue, wherein the content queue comprises an reasoning queue and an answer queue; After the reasoning content in the reasoning queue meets a first preset condition, synthesizing the reasoning content by using a first preset voice synthesis parameter to obtain a reasoning audio block, and outputting the reasoning content and the reasoning audio block in a streaming mode in a unified data structure; When the answer queue writes answer content for the first time, switching to an answer stage, and synthesizing the answer content by using a second preset voice synthesis parameter to obtain an answer audio block; and after all the residual reasoning contents in the reasoning queue are output and emptied, outputting the answer contents and the answer audio blocks in a streaming mode through the unified data structure.
2. The method according to claim 1, wherein the unified data structure is used for simultaneously carrying text segment streams and audio segment streams; the data types of the unified data structure comprise text segments, audio segment start signals and audio segment end signals; the audio segment start signal is used for triggering the local player to start up and pre-buffer before answering the audio block; the audio segment end signal is used to determine an audio play end point and trigger a resource reclamation.
3. The method for real-time streaming conversion of text and speech according to claim 1, wherein after the inferred contents in the inferred queue satisfy a first preset condition, synthesizing the inferred contents by using a first preset speech synthesis parameter, comprising: judging whether the length of the reasoning content in the reasoning queue is smaller than a preset length threshold value or not; if the length of the reasoning content in the reasoning queue is smaller than a preset length threshold, continuing to wait for the reasoning content aggregation in the reasoning queue; If the length of the reasoning content in the reasoning queue is greater than or equal to a preset length threshold, synthesizing the reasoning content in the reasoning queue by using a first preset voice synthesis parameter to obtain a reasoning audio block.
4. The method of claim 1, wherein switching to the answer phase after the answer queue writes the answer content for the first time, and synthesizing the answer content by using a second preset speech synthesis parameter to obtain an answer audio block, comprises: Switching to an answer stage after the answer queue writes answer content for the first time; judging whether the length of the answer content in the answer queue meets a second preset condition or not; If the answer content in the answer queue meets the second preset condition, continuing to wait for the answer content aggregation in the answer queue; And if the answer content in the answer queue does not meet the second preset condition, synthesizing the answer content in the answer queue by using a second preset voice synthesis parameter to obtain an answer audio block.
5. The method of claim 1, wherein streaming the answer content and the answer audio block in the unified data structure after all the residual inference content in the inference queue is output and emptied, comprises: Triggering the inference queue to wash after switching to the answer stage; Synthesizing residual reasoning contents in the reasoning queue by using the first preset voice synthesis parameters to obtain a residual reasoning audio block, outputting the residual reasoning contents and the residual reasoning audio block in a streaming mode in a unified data structure, and emptying the reasoning queue; and after the reasoning queue is emptied, outputting the answer content and the answer audio block in a streaming mode according to the unified data structure.
6. A real-time streaming text-to-speech conversion method as recited in claim 1, characterized in that the method further comprises: Generating a distributed audio stream associated with the session using the audio block; Each terminal is subscribed to the distributed audio stream to acquire and play audio fragments, and the terminal comprises a webpage end, an APP end and an applet end; The distributed audio stream supports synchronous playing and catch-up playing of all terminals.
7. A text-to-speech real-time streaming conversion system is characterized by comprising a LLM output layer, a state sensing layer, a TTS scheduling layer, a buffer manager and a distributed publishing and subscribing module; the LLM output layer is used for receiving a request input by a user and generating a text content block based on the user request; The state sensing layer is used for identifying the text content block type by monitoring the thinking state and writing the text content block type into a corresponding content queue, wherein the content queue comprises an reasoning queue and an answer queue; The TTS scheduling layer and the buffer manager are used for synthesizing the reasoning contents by utilizing first preset voice synthesis parameters after the reasoning contents in the reasoning queue meet a first preset condition to obtain a reasoning audio block, and outputting the reasoning contents and the reasoning audio block in a streaming mode in a unified data structure; when the answer queue writes answer content for the first time, switching to an answer stage, and synthesizing the answer content by using a second preset voice synthesis parameter to obtain an answer audio block; after all the residual reasoning contents in the reasoning queue are output and emptied, the answer contents and the answer audio blocks are output in a streaming mode through the unified data structure; The distributed publishing and subscribing module is used for generating a distributed audio stream associated with the session by utilizing the audio block, and each terminal subscribes to the distributed audio stream to acquire and play the audio fragments.
8. A text voice real-time streaming conversion device is characterized by comprising a processor and a memory; The memory is used for storing one or more program instructions; The processor is configured to execute one or more program instructions for performing the steps of a text-to-speech real-time streaming method according to any of claims 1 to 6.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a text-to-speech real-time streaming method according to any of claims 1 to 6.
10. A computer program product, characterized in that the computer program product comprises computer program instructions which, when executed by a processor, implement the steps of a text-to-speech real-time streaming method according to any of claims 1 to 6.

Description

Text-to-speech real-time streaming conversion method, system, device, medium and program product Technical Field The invention relates to the technical field of large models, in particular to a text-to-speech real-time streaming conversion method, a system, equipment, a medium and a program product. Background The existing AI voice dialogue system mostly adopts a mode of integrally calling Text-To-Speech (TTS) after Text generation is completed, and has the following defects: 1. the first packet has high delay and long waiting time of users; 2. the reasoning process (REASONING) is indistinguishable from the phonetic manifestation of the final answer (TEXT), and the user has difficulty in understanding the thinking state of the AI; 3. The halfway break (Stop/Cancel) only acts on the text side, and tail sound delay or asynchronous audio and interface often occurs at the audio end; 4. When a distributed multi-terminal (webpage, app, applet and the like) is accessed, repeated pulling and splicing of audio are needed, and a unified streaming distribution mechanism is lacked; 5. The buffering strategy is mostly fixed block size or simple time slice, and can not simultaneously consider the continuity of the content and the quick first packet response of the answer content. Therefore, it is needed to invent a text-to-speech real-time stream conversion method to solve the problems of high first packet delay, easy confusion of reasoning and answering processes and the like in the prior art. Disclosure of Invention In view of the above, embodiments of the present invention provide a method, a system, a device, a medium and a program product for real-time streaming conversion of text and speech, which at least partially solve the problems in the prior art. Other features and advantages of the invention will be apparent from the following detailed description, or may be learned by the practice of the invention. In order to achieve the above object, the embodiment of the present invention provides the following technical solutions: According to a first aspect of an embodiment of the present invention, there is provided a text-to-speech real-time streaming method, the method including: Receiving a request input by a user; Generating a text content block based on the user request; identifying the text content block type by monitoring the thinking state, and writing the text content block type into a corresponding content queue, wherein the content queue comprises an reasoning queue and an answer queue; After the reasoning content in the reasoning queue meets a first preset condition, synthesizing the reasoning content by using a first preset voice synthesis parameter to obtain a reasoning audio block, and outputting the reasoning content and the reasoning audio block in a streaming mode in a unified data structure; When the answer queue writes answer content for the first time, switching to an answer stage, and synthesizing the answer content by using a second preset voice synthesis parameter to obtain an answer audio block; and after all the residual reasoning contents in the reasoning queue are output and emptied, outputting the answer contents and the answer audio blocks in a streaming mode through the unified data structure. Further, the unified data structure is used for simultaneously carrying text segment streams and audio segment streams; the data types of the unified data structure comprise text segments, audio segment start signals and audio segment end signals; the audio segment start signal is used for triggering the local player to start up and pre-buffer before answering the audio block; the audio segment end signal is used to determine an audio play end point and trigger a resource reclamation. Further, after the reasoning content in the reasoning queue meets a first preset condition, synthesizing the reasoning content by using a first preset voice synthesis parameter, including: judging whether the length of the reasoning content in the reasoning queue is smaller than a preset length threshold value or not; if the length of the reasoning content in the reasoning queue is smaller than a preset length threshold, continuing to wait for the reasoning content aggregation in the reasoning queue; If the length of the reasoning content in the reasoning queue is greater than or equal to a preset length threshold, synthesizing the reasoning content in the reasoning queue by using a first preset voice synthesis parameter to obtain a reasoning audio block. Further, after the answer content is written into the answer queue for the first time, switching to an answer stage, and synthesizing the answer content by using a second preset voice synthesis parameter to obtain an answer audio block, including: Switching to an answer stage after the answer queue writes answer content for the first time; judging whether the length of the answer content in the answer queue meets a second preset condition or not; If the answer conten