CN-120853551-B - Real-time voice interaction method and system based on large model

CN120853551BCN 120853551 BCN120853551 BCN 120853551BCN-120853551-B

Abstract

The invention provides a real-time voice interaction method and a real-time voice interaction system based on a large model, which are used for collecting multiple rounds of historical dialogue data of a user, carrying out deep understanding on a context and adjusting a subsequent strategy according to the evolution of historical dialogue content. Through accurate construction of dynamic contexts and powerful semantic understanding capability of large models, the system can better understand user intention and dialogue logic, generated replies are more in line with human language habits, and interaction naturalness is greatly improved. The intelligent decision of the large model is optimized based on the reinforcement learning algorithm and the human feedback, so that the large model can be continuously learned in real-time interaction, and the reply strategy is adjusted according to the user feedback and the conversation progress, so that the correlation, the consistency and the user satisfaction of the reply are improved. By setting the interrupt mechanism, the interrupt intention of the user can be effectively processed in the real-time voice interaction process, and the effectiveness of the real-time voice interaction is ensured, so that the accuracy and fluency of Gao Shishi voice interaction are improved.

Inventors

WANG YUSEN

Assignees

广东超腾信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20250814

Claims (5)

1. The real-time voice interaction method based on the large model is characterized by comprising the following steps of: S1, acquiring historical dialogue data of a user for preprocessing, wherein the dialogue data comprises voice data, text data and corresponding transcription and interaction data; S2, extracting audio features from the preprocessed data, wherein the audio features comprise MFCC features and prosodic features, inputting the audio features into an ASR model, converting the audio features into text stream data, and constructing a dynamic context based on the text stream data, wherein the dynamic context comprises conversation history, scene information of a current conversation, user portrait information and confidence of the ASR model; S3, loading a large model by combining pre-configured parameters based on the constructed dynamic context to perform fine tuning training, and optimizing intelligent decisions of the large model based on a reinforcement learning algorithm and human feedback; S4, generating a latest reply text according to a dynamic context constructed by user voice data in a current scene and combining large model parameters updated in real time by user dialogue progress, synthesizing voice based on a TTS model, and setting an interrupt mechanism to judge playing time to perform real-time interaction of voice; In step S2, the ASR model is an end-to-end model based on a transducer architecture, the fused feature sequence of MFCC features and prosodic features is input into the ASR model, and the ASR model converts the fused feature sequence into corresponding text stream data by performing time sequence modeling and acoustic modeling on the fused feature sequence; In step S2, a dynamic context update mechanism is designed to update contents of each field in real time: for the dialogue history field, adding the processed user text stream and the system reply into the dialogue history, and marking a corresponding time stamp to form a time sequence record of user input-system output; For the scene information field, according to scene switching keywords extracted from a text stream, updating scene labels in combination with a preset scene recognition rule, synchronously loading business knowledge corresponding to a new scene from a pre-constructed knowledge base, and if scene switching is not detected, keeping current scene information unchanged, and supplementing real-time business knowledge related to the scene according to conversation progress, wherein the knowledge base comprises a relational database, a document database and a vector database; for the user portrait field, adding new user attribute information extracted from the text stream into the user portrait information field, and if the new user attribute information conflicts with the existing user information, processing the new user attribute information through a preset rule; For the confidence level field of the ASR model, the confidence level corresponding to the current text flow output by the ASR model is recorded in the field, so that a subsequent system can conveniently adopt different strategies according to the confidence level; In step S2, the conversation history length is managed based on a sliding window mechanism, when the history record is overlong due to more conversation rounds, the latest N rounds of key conversation contents are reserved, early redundant information is deleted, the context information is ensured to be within the input length limit of a large model, invalid contents in scene information and user portrait information are filtered, and after each round of conversation is finished, the integrity and consistency of the dynamic context are checked; S3, the execution process comprises the following steps: Constructing a training sample, wherein the training sample comprises a dynamic context sample and a human feedback data sample, and the feedback data sample comprises sequencing data, scoring data and correction data; Taking a dynamic context as a core, taking a large model to be optimized as an intelligent agent, defining elements of a reinforcement learning framework, wherein the elements comprise states, actions and environments, and designing a multidimensional rewarding function comprising correlation rewards, consistency rewards, task completion rewards, user satisfaction rewards and ASR confidence correlation rewards; Selecting a large model to load preset parameters as an initial strategy model, and performing fine-tuning supervision training on the initial strategy model by taking dynamic context-ideal reply as training data; training a reward model for evaluating the recovery quality by taking a human feedback data sample as a training sample, and outputting the reward model as a reward value; Based on an initial strategy model, combining rewards output by reinforcement learning and rewards output by a rewards model as rewards signals, and iteratively optimizing the strategy model through a PPO algorithm to enable generated replies to more accord with dynamic context demands and human preferences; an interruption mechanism is set in the process of real-time voice interaction, and the method comprises the following steps: When playing voice, ASR recognition is started in parallel, and voice activity of a user is detected in real time by combining with VAD so as to judge whether the user has breaking intention; if the user is detected to have the breaking intention, immediately stopping playing the current TTS voice, and executing the following strategies: a. Storing the unplayed reply text and generating a simple response; b. regenerating a reply integrating the unplayed content and the new intention of the user based on the interrupted content of the user; c. Neglecting unplayed content, and directly responding to interrupted content of a user; If the user is not detected to have the breaking intention, automatically starting the ASR after the current TTS voice is played, and waiting for the next round of input of the user.
2. The method for real-time voice interaction based on large model according to claim 1, wherein before iterative optimization of strategy model based on PPO algorithm, a value model is initialized first for predicting the expected prize value of the current dynamic context, and the calculation of dominant function is assisted, and the calculation formula is: A t = G t -V(s); Where A t represents the difference between the actual return and the expected return, a positive value indicates that the return is over-expected, a negative value indicates that the return is not over-expected, and is used to guide the policy update direction, G t represents the cumulative return at the current time in multiple sessions, V(s) represents the expected prize value of the value model for the current state, and s is the dynamic context.
3. The method for real-time voice interaction based on the large model according to claim 2, wherein the total loss function is obtained by combining the PPO algorithm with the value model, and the calculation formula is as follows: ; Where L total denotes the total loss function, L CLIP denotes the clipping objective function, used to optimize the return preference of the policy model, L VF denotes the cost function loss, used to optimize the prediction accuracy of the cost model, λ denotes the cost loss weight, minimize the total loss by gradient descent, update the new policy model and the cost model parameters.
4. A real-time voice interaction system based on a large model, applying a real-time voice interaction method based on a large model as claimed in any one of claims 1-3, characterized in that it comprises mutually communicating: The data processing module is used for acquiring dialogue data of a user and preprocessing the dialogue data; The feature extraction module is used for extracting audio features from the preprocessed data, wherein the audio features comprise MFCC features and prosodic features; The conversion module is used for converting the extracted audio characteristics into text data based on an ASR model, and synthesizing and playing the reply text generated by the large model based on a TTS model; the dynamic context construction module is used for constructing a dynamic context based on text data converted by the ASR model, wherein the context comprises conversation history, scene information of the current conversation, user portrait information and confidence of the ASR model; and the reply text generation module loads the large model according to the historical dialogue data and the pre-configured parameters to carry out fine tuning training, optimizes the intelligent decision of the large model based on the reinforcement learning algorithm and human feedback, and updates the parameters of the large model in real time according to the dynamic context constructed by the user voice data in the current scene and combines the user dialogue progress to generate the latest reply text.
5. The large model based real-time voice interaction system according to claim 4, wherein the conversion module comprises an ASR unit, a TTS unit and a voice interaction unit; the ASR unit is used for converting the extracted audio features into text stream data based on an ASR model; The TTS unit is used for converting the reply text generated by the large model into voice based on the TTS model; the voice interaction unit is used for playing the voice synthesized by the TTS, and a breaking mechanism is arranged to judge the playing time to carry out real-time interaction of the voice.

Description

Real-time voice interaction method and system based on large model Technical Field The invention relates to the technical field of voice processing, in particular to a real-time voice interaction method and system based on a large model. Background With the rapid development of artificial intelligence technology, real-time voice interaction is gradually evolving from a traditional artificial mode to an intelligent and automatic direction. There are still limitations in the current state of the art with respect to real-time voice interaction techniques. Firstly, only by recognizing keywords in the questioning voice and combining with a preset fixed script or rule condition to trigger corresponding reply, once the questioning content exceeds a preset range, effective reply cannot be given, so that a dialogue flow is interrupted, the preset script and rule are required to be frequently modified and updated for different business scenes by the existing system, a large amount of manpower and material resources are consumed for maintenance, the updating period is long, market change is difficult to quickly adapt, and new errors are easily introduced by frequent modification. Secondly, the current system can only judge based on a single-round dialogue, and lacks the deep understanding capability of the context of multiple rounds of dialogue, so that the system cannot adjust the subsequent strategies according to the evolution of the historical dialogue content, thereby causing the break of dialogue logic and being difficult to solve complex problems or mine user intentions. Thirdly, when the robot plays the reply voice, the existing system cannot effectively process the interrupt intention of the user, or the interrupt is completely not allowed, so that the user experience is poor, or the interrupt mechanism is too simple, the conversation is disordered due to misjudgment, and the communication efficiency and the user satisfaction are affected. In summary, the invention aims to solve the technical problem of how to improve the accuracy and fluency of real-time voice interaction. Disclosure of Invention In order to solve the problems, the invention provides a real-time voice interaction method and a real-time voice interaction system based on a large model, which are characterized in that a dynamic context is built and input into the large model, and a reinforcement learning algorithm and human feedback are introduced to optimize intelligent decisions of the large model, so that the large model is continuously learned in real-time interaction, a reply strategy can be adjusted according to user feedback and conversation progress, and a breaking mechanism is arranged, and in the real-time voice interaction process, the breaking intention of a user can be effectively processed, so that the accuracy and fluency of the real-time voice interaction are improved. In order to achieve the above purpose, the invention adopts the following technical scheme: the invention provides a real-time voice interaction method based on a large model, which comprises the following steps: S1, acquiring historical dialogue data of a user for preprocessing, wherein the dialogue data comprises voice data, text data and corresponding transcription and interaction data; S2, extracting audio features from the preprocessed data, wherein the audio features comprise MFCC features and prosodic features, inputting the audio features into an ASR model, converting the audio features into text stream data, and constructing a dynamic context based on the text stream data, wherein the dynamic context comprises conversation history, scene information of a current conversation, user portrait information and confidence of the ASR model; S3, loading a large model by combining pre-configured parameters based on the constructed dynamic context to perform fine tuning training, and optimizing intelligent decisions of the large model based on a reinforcement learning algorithm and human feedback; And S4, generating the latest reply text according to the dynamic context constructed by the user voice data in the current scene and the large model parameters updated in real time in combination with the user dialogue progress, synthesizing the voice based on the TTS model, and setting an interrupt mechanism to judge the playing time to perform real-time interaction of the voice. In step S2, the ASR model is an end-to-end model based on a transducer architecture, the fused feature sequence of the MFCC features and the prosodic features is input into the ASR model, and the ASR model converts the fused feature sequence into corresponding text stream data by performing time sequence modeling and acoustic modeling on the fused feature sequence. In step S2, a dynamic context update mechanism is designed to update contents of each field in real time: for the dialogue history field, adding the processed user text stream and the system reply into the dialogue history, and marki