CN-122024714-A - Intelligent dialogue system and method supporting real-time voice interrupt

CN122024714ACN 122024714 ACN122024714 ACN 122024714ACN-122024714-A

Abstract

The invention discloses an intelligent dialogue system and method supporting real-time voice interruption, and relates to the technical field of man-machine interaction. The system comprises a voice activity detection module, a voice recognition module, a dialogue management and LLM interaction module, a voice synthesis module and a breaking control module. The interactive system comprises a conversation management and LLM interaction module, a large language model, a disruption control module, a voice synthesis module and a voice activity detection module, wherein the conversation management and LLM interaction module is used for managing conversation states and contexts and calling LLM to generate a reply text, the context and a new instruction text are recombined when the conversation is broken, the large language model is called to generate a new reply text, the disruption control module is used for sending an interrupt signal to the voice synthesis module when the voice activity detection module detects effective user input, and triggering the conversation management and LLM interaction module to recombined the context to generate a new reply, and the voice activity detection module and the voice synthesis module operate simultaneously and are not blocked. The invention supports interruption at any time, ensures the consistency of the session semantics after interruption through a context management mechanism, and ensures smooth user experience.

Inventors

WANG PEIQI
WANG WEIDI
YUAN JIACHENG

Assignees

福建星网智慧科技有限公司
福建星网锐捷通讯股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251223

Claims (6)

1. An intelligent dialog system supporting real-time speech disruption, comprising: the voice activity detection module is used for detecting whether user voice appears in the audio stream in real time and outputting effective voice fragments; The voice recognition module is connected with the voice activity detection module and used for converting the effective voice fragments output by the voice activity detection module into instruction texts; the dialogue management and LLM interaction module is connected with the voice recognition module and is used for managing dialogue states and contexts and calling a large language model to generate a reply text; The voice synthesis module is connected with the dialogue management and LLM interaction module and is used for converting text replies generated by the large language model into audio streams and playing the audio streams; The interrupt control module is connected with the voice activity detection module, the dialogue management and LLM interaction module and the voice synthesis module and is used for sending an interrupt signal to the voice synthesis module when the voice activity detection module detects effective user input and triggering the dialogue management and LLM interaction module to re-integrate the context to generate a new reply; The voice activity detection module and the voice synthesis module operate simultaneously and are not blocked.
2. The system of claim 1, wherein the voice activity detection module uses an energy-based endpoint detection algorithm to quickly respond to voice initiation.
3. The system of claim 1, wherein the interrupt control module monitors the trigger event of the voice activity detection module and the broadcast state of the voice synthesis module simultaneously to make an interrupt decision.
4. The system of claim 1, wherein the input stream processing pipeline of the voice activity detection module and the output stream processing pipeline of the voice synthesis module are implemented and maintain a persistent connection using a full duplex communication mechanism.
5. The system of claim 1, wherein the full duplex communication mechanism comprises multithreading, asynchronous I/O, message queuing, or WebSocket.
6. A method of intelligent conversations supporting real-time speech disruption, wherein a system according to any one of claims 1-5 is provided, the method comprising: monitoring the audio flow of the audio input interface in real time through a voice activity detection module, and after the voice of a user is obtained, sending the effective voice fragments into a voice recognition module to be transcribed into instruction texts; the voice recognition module transmits the instruction text to the dialogue management and LLM interaction module, and after the context is organized, the large language model is called to generate a reply text, and the reply text is converted into audio through the voice synthesis module and played; The voice activity detection module continuously works during the period of playing the audio stream by the voice synthesis module, if the user voice is detected again during the period, the interrupt control module immediately triggers the voice synthesis module to send an interrupt signal to immediately stop playing, the conversation management and LLM interaction module is notified, the incomplete conversation context is integrated with new user input to generate a continuous new reply, and then the voice synthesis module is triggered again to play.

Description

Intelligent dialogue system and method supporting real-time voice interrupt Technical Field The invention relates to the technical field of man-machine interaction, in particular to an intelligent dialogue system and method supporting real-time voice interruption. Background The existing intelligent voice system (such as an intelligent sound box and a mobile phone voice assistant) has the common problem that the system cannot effectively monitor user input during voice broadcasting (TTS), and a user can send a new instruction after waiting for the system to finish broadcasting, so that interaction flow is stiff and low in efficiency. The system can not understand the semantic association of the interrupt instruction and the previous dialog, the consistency of the reply is poor, the traditional voice endpoint detection is only used for judging the beginning and the end of the voice in the interaction scene, and the traditional voice endpoint detection can not be deeply combined with a dialog state machine to realize the real-time interrupt function. Disclosure of Invention The invention aims to solve the technical problem of providing an intelligent dialogue system and method supporting real-time voice interruption, which break through the limit that 'input can not be performed during output' in traditional voice interaction, ensure the consistency of dialogue semantics after interruption through a context management mechanism and ensure smooth user experience. In a first aspect, the present invention provides an intelligent dialog system supporting real-time speech disruption, comprising: the voice activity detection module is used for detecting whether user voice appears in the audio stream in real time and outputting effective voice fragments; The voice recognition module is connected with the voice activity detection module and used for converting the effective voice fragments output by the voice activity detection module into instruction texts; the dialogue management and LLM interaction module is connected with the voice recognition module and is used for managing dialogue states and contexts and calling a large language model to generate a reply text; The voice synthesis module is connected with the dialogue management and LLM interaction module and is used for converting text replies generated by the large language model into audio streams and playing the audio streams; The interrupt control module is connected with the voice activity detection module, the dialogue management and LLM interaction module and the voice synthesis module and is used for sending an interrupt signal to the voice synthesis module when the voice activity detection module detects effective user input and triggering the dialogue management and LLM interaction module to re-integrate the context to generate a new reply; The voice activity detection module and the voice synthesis module operate simultaneously and are not blocked. Further, the voice activity detection module adopts an energy-based endpoint detection algorithm to quickly respond to voice initiation. Further, the interrupt control module monitors the triggering event of the voice activity detection module and the broadcasting state of the voice synthesis module at the same time, and makes an interrupt decision. Further, the input stream processing pipeline of the voice activity detection module and the output stream processing pipeline of the voice synthesis module adopt a full duplex communication mechanism to realize and maintain a persistent connection. Further, the full duplex communication mechanism includes multithreading, asynchronous I/O, message queuing, or WebSocket. In a second aspect, the present invention provides an intelligent dialogue method for supporting real-time speech interruption, which needs to provide a system as described in the first aspect, the method includes: monitoring the audio flow of the audio input interface in real time through a voice activity detection module, and after the voice of a user is obtained, sending the effective voice fragments into a voice recognition module to be transcribed into instruction texts; the voice recognition module transmits the instruction text to the dialogue management and LLM interaction module, and after the context is organized, the large language model is called to generate a reply text, and the reply text is converted into audio through the voice synthesis module and played; The voice activity detection module continuously works during the period of playing the audio stream by the voice synthesis module, if the user voice is detected again during the period, the interrupt control module immediately triggers the voice synthesis module to send an interrupt signal to immediately stop playing, the conversation management and LLM interaction module is notified, the incomplete conversation context is integrated with new user input to generate a continuous new reply, and then the voice synthesis module i