CN-122024738-A - Multi-user intelligent dialogue breaking processing method and system based on voiceprint separation

CN122024738ACN 122024738 ACN122024738 ACN 122024738ACN-122024738-A

Abstract

The invention discloses a multi-user intelligent dialogue breaking processing method and system based on voiceprint separation, and relates to the technical field of artificial intelligence and voice interaction. The method comprises the steps of extracting voiceprint features from audio data packets, separating audio according to the voiceprint features, distinguishing different speakers, distributing unique conversation IDs for each speaker, judging whether each audio data packet is effective voice, carrying out multidimensional analysis on texts and original audio based on a large model, wherein the multidimensional analysis comprises semantic analysis and emotion and intonation analysis, routing semantic analysis result information into independent conversation threads of corresponding users according to the conversation IDs, judging whether the conversation threads are broken according to emotion and intonation analysis results and the current state of a system, and generating corresponding text answers based on the large model and complete context information of each user. The multi-user intelligent dialogue breaking processing method and system based on voiceprint separation provided by the invention realize natural, smooth, personified and breaking personalized voice dialogue in a multi-user environment.

Inventors

WANG WEIDI
YUAN JIACHENG

Assignees

福建星网智慧科技有限公司
福建星网锐捷通讯股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251223

Claims (10)

1. A multi-user intelligent dialogue breaking processing method based on voiceprint separation is characterized by comprising the following steps: the audio collection and processing process is that mixed voice audio is collected in real time and preprocessed to generate an audio data packet; The voice print extracting and separating process includes extracting voice print features from the audio data package, separating audio according to the voice print features, distinguishing different speakers, distributing unique conversation ID for each speaker, and judging whether each audio data package is effective voice; The voice recognition process is to convert the separated effective voices with conversation IDs into instruction texts respectively; the multi-dimensional analysis process is to carry out multi-dimensional analysis on the text and the original audio based on the large model, and comprises semantic analysis and emotion and intonation analysis; the multi-thread dialogue management process is to route the semantic analysis result information to the independent dialogue thread of the corresponding user according to the dialogue ID, and judge whether to break according to the emotion and intonation analysis result and the current state of the system; Generating a corresponding text answer based on the big model and the complete context information of each user; and the voice synthesis process is used for converting the text answer into voice output.
2. The method of claim 1, wherein the multi-dimensional analysis process further comprises analyzing acoustic features of the audio to determine a speech input pattern, and wherein the response generation process further comprises setting an output pattern based on the speech input pattern and the emotion and intonation analysis results.
3. The method of claim 2, wherein setting the output pattern based on the speech pattern analysis result and the emotion and intonation analysis result comprises: The mode is a private call, namely, a private response mode is triggered, and the voice synthesis module outputs the voice with extremely low volume or whisper color or displays characters only on matched equipment; the emotion is confusing, namely after an answer is generated, an active inquiry is added; emotion is urgent, and a concise and efficient instructive reply is generated.
4. The method of claim 1, wherein in the multithreaded conversation management process, each user's independent conversation thread comprises: an independent dialogue collector for accumulating the latest and historical dialogue characters of the user; an independent context manager for maintaining the user's answered content, preferences, and history of conversations; And the breaking judgment device is used for judging whether breaking occurs according to the semantics, emotion and the current state of the system.
5. The method of claim 1, wherein in the multi-thread dialogue management process, whether the system is interrupted is judged according to emotion and intonation analysis results and the current state of the system, and the method specifically comprises the following steps: Responding to other users, namely judging whether to interrupt the current broadcasting or not by combining the emotion label and instruction content of the current user so as to respond to the user; Judging whether the new input forms self-breaking or not, if so, immediately stopping the current response, and storing the instruction text into a dialogue collector; idle, directly entering the next step.
6. A multi-user intelligent conversation break processing system based on voiceprint separation, comprising: the audio acquisition processing module is used for acquiring and preprocessing the mixed voice audio in real time to generate an audio data packet; the voiceprint extraction and separation module is connected with the audio acquisition processing module and is used for extracting voiceprint characteristics from the audio data packets, separating audio according to the voiceprint characteristics, distinguishing different speakers, distributing unique session IDs for each speaker, and judging whether each audio data packet is effective voice; The voice recognition module is connected with the voiceprint extraction and separation module and is used for respectively converting the separated effective voices with the session ID into instruction texts; The multi-dimensional analysis module is connected with the voice recognition module and is used for carrying out multi-dimensional analysis on the instruction text based on the large model, including semantic analysis and emotion and intonation analysis; the multithread dialogue management module is connected with the multidimensional analysis module and is used for routing semantic analysis result information into independent dialogue threads of corresponding users according to the dialogue ID and judging whether the system is broken or not according to emotion and intonation analysis results and the current state of the system; the response generation module is connected with the multithreading dialogue management module and is used for generating corresponding text answers based on the big model and the complete context information of each user; And the voice synthesis module is connected with the response generation module and is used for converting the text answer into voice output.
7. The system of claim 6, wherein the multi-dimensional analysis module is further configured to analyze acoustic features of the audio to determine a speech input pattern, and wherein the response generation module is further configured to set an output pattern based on the speech input pattern and the emotion and intonation analysis result.
8. The system of claim 7, wherein setting the output pattern based on the speech pattern analysis result and the emotion and intonation analysis result comprises: The mode is a private call, namely, a private response mode is triggered, and the voice synthesis module outputs the voice with extremely low volume or whisper color or displays characters only on matched equipment; the emotion is confusing, namely after an answer is generated, an active inquiry is added; emotion is urgent, and a concise and efficient instructive reply is generated.
9. The system of claim 6, wherein in the multithreaded dialog management module, each user's independent dialog thread comprises: an independent dialogue collector for accumulating the latest and historical dialogue characters of the user; an independent context manager for maintaining the user's answered content, preferences, and history of conversations; And the breaking judgment device is used for judging whether breaking occurs according to the semantics, emotion and the current state of the system.
10. The system of claim 6, wherein the multithreaded dialog management module determines whether to interrupt according to emotion and intonation analysis results and a current state of the system, and the method specifically comprises: Responding to other users, namely judging whether to interrupt the current broadcasting or not by combining the emotion label and instruction content of the current user so as to respond to the user; Judging whether the new input forms self-breaking or not, if so, immediately stopping the current response, and storing the instruction text into a dialogue collector; idle, directly entering the next step.

Description

Multi-user intelligent dialogue breaking processing method and system based on voiceprint separation Technical Field The invention relates to the technical field of artificial intelligence and voice interaction, in particular to a multi-user intelligent dialogue breaking processing method and system based on voiceprint separation. Background The existing intelligent dialogue system has obvious limitation under the multi-user scene. When multiple people speak simultaneously or alternately, the system cannot effectively distinguish different speakers, so that instructions are mixed up, and the instructions are up and down Wen Cuoluan, and accurate personalized response cannot be provided. In addition, the existing system lacks an effective processing mechanism for the midway interruption of the user, the interaction process is passive and hard, the emotion state of the user is difficult to perceive, and the active dialogue cannot be initiated at a proper time. In the prior art, although some systems support simple voiceprint recognition, the systems are mostly used for identity authentication, cannot be deeply fused with dialogue management, interrupt processing, context isolation and emotion calculation, and are difficult to realize smooth, natural and anthropomorphic dialogue experience in a complex multi-user real-time interaction environment. Furthermore, the prior art lacks intelligent recognition and differentiated response mechanisms for specific speech patterns (e.g., low-pitched whisper), and fails to provide a close interactive experience in situations where silence or privacy protection is desired. Disclosure of Invention The invention aims to solve the technical problem of providing a multi-user intelligent dialogue breaking processing method and system based on voiceprint separation, which realize natural, smooth, personified and breaking personalized voice dialogue in a multi-user environment. In a first aspect, the present invention provides a method for processing multi-user intelligent session interrupt based on voiceprint separation, including: the audio collection and processing process is that mixed voice audio is collected in real time and preprocessed to generate an audio data packet; extracting voiceprint characteristics from the audio data packets, separating audio according to the voiceprint characteristics, distinguishing different speakers, distributing unique session IDs for each speaker, and judging whether each audio data packet is effective voice; The voice recognition process is to convert the separated effective voices with conversation IDs into instruction texts respectively; The multi-dimensional analysis process is to carry out multi-dimensional analysis on the instruction text based on the large model, and comprises semantic analysis and emotion and intonation analysis; the multi-thread dialogue management process is to route the semantic analysis result information to the independent dialogue thread of the corresponding user according to the dialogue ID, and judge whether to break according to the emotion and intonation analysis result and the current state of the system; Generating a corresponding text answer based on the big model and the complete context information of each user; and the voice synthesis process is used for converting the text answer into voice output. The multi-dimensional analysis process further comprises the steps of analyzing the acoustic characteristics of the audio and judging a voice input mode, and the response generation process further comprises the step of setting an output mode according to the voice input mode and emotion and intonation analysis results. Further, setting the output mode according to the speech pattern analysis result and the emotion and intonation analysis result specifically includes: The mode is a private call, namely, a private response mode is triggered, and the voice synthesis module outputs the voice with extremely low volume or whisper color or displays characters only on matched equipment; the emotion is confusing, namely after an answer is generated, an active inquiry is added; emotion is urgent, and a concise and efficient instructive reply is generated. Further, in the multi-thread dialogue management process, the independent dialogue thread of each user includes: an independent dialogue collector for accumulating the latest and historical dialogue characters of the user; an independent context manager for maintaining the user's answered content, preferences, and history of conversations; And the breaking judgment device is used for judging whether breaking occurs according to the semantics, emotion and the current state of the system. Further, in the multithreading dialogue management process, whether the system is interrupted or not is judged according to the emotion and intonation analysis result and the current state of the system, and the method specifically comprises the following steps: Responding to othe