CN-122025196-A - Large-model intelligent stream-tuning dialogue method, system, equipment and storage medium

CN122025196ACN 122025196 ACN122025196 ACN 122025196ACN-122025196-A

Abstract

The invention relates to the technical field of artificial intelligence, specifically, a method, a system, a device and a storage medium for large-model intelligent streaming conversation are provided, including: acquiring voice input of a user, and generating text information through voice recognition; based on text information, constructing dialogue context, and generating semantic enhancement information through RAG knowledge retrieval; inputting the semantic enhancement information into a large language model to execute natural language processing analysis, and generating a dialogue decision result; and generating response content based on the dialogue decision result, outputting interaction information through a response generation technology, and carrying out structural storage on dialogue data. The invention improves the professional and structuring degree of the stream dispatching question and answer, and is convenient for the rapid recording and subsequent analysis application of stream dispatching data.

Inventors

HOU YONGSHENG
SUN CHENGXU
YANG GAOCHAO

Assignees

山东浪潮智慧医疗科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251202

Claims (10)

1. A large model intelligent streaming conversation method, comprising: acquiring voice input of a user, and generating text information through voice recognition; Based on text information, constructing dialogue context, and generating semantic enhancement information through RAG knowledge retrieval; Inputting the semantic enhancement information into a large language model to execute natural language processing analysis, and generating a dialogue decision result; and generating response content based on the dialogue decision result, outputting interaction information through a response generation technology, and carrying out structural storage on dialogue data.
2. The method of claim 1, wherein obtaining user speech input, generating text information by speech recognition, comprises: Performing audio acquisition on user voice input, and receiving a continuous audio stream by utilizing an audio acquisition module; Performing format unification processing on user voice input, and converting original voice data into an input format receivable by a voice recognition model; extracting characteristics of voice input of a user, and generating acoustic characteristics based on voice signals; when text information is generated through voice recognition, inputting acoustic features into a voice recognition model, and executing acoustic decoding and language decoding to generate preliminary text information; After text information is generated through voice recognition, confidence screening and result calibration are carried out on the preliminary text information, effective text information is generated, and in the process of generating the text information, the effective text information is output in a stream recognition mode.
3. The method of claim 1, wherein constructing a dialog context based on text information comprises: segmenting the text information to obtain an initial text unit which can be used for context assembly; the initial text units are arranged in sequence to generate text units with a time sequence structure of dialogue contents; Executing semantic annotation on the text unit to obtain a semantic mark; Combining the semantic mark with the text unit to obtain a context structure which can be input into the retrieval module; Content screening is carried out on the context structure, and text fragments with the association degree with the current round of dialogue higher than a preset threshold value are reserved; and adding the current round of text information into the context structure by adopting a streaming context updating mode, so that the context structure keeps a continuous text source.
4. The method of claim 1, wherein generating semantic enhancement information via RAG knowledge retrieval comprises: vectorization processing is carried out on the dialogue context, and a token embedding mode based on word segmentation sequences is adopted to segment texts into token sequences according to natural semantic boundaries; coding the text segment by adopting a vector coding algorithm expressed by a pre-training language, inputting a token sequence into a transducer coder, and executing multi-head attention calculation and feedforward network calculation to obtain a fixed dimension vector; Inputting the fixed dimension vector into a vector retriever, and acquiring a plurality of first candidate knowledge lists by using a vector similarity retrieval structure based on cosine similarity or inner product similarity; Keyword retrieval is carried out on the first candidate knowledge list, and a search word list is generated by adopting a query construction mode based on the keyword list; performing term matching on the text content based on the inverted index, scanning index items corresponding to the search word list by adopting an inverted index table, extracting text records containing all search words, and generating a plurality of second candidate knowledge lists; Combining the first candidate knowledge list and the second candidate knowledge list, comparing the unique identifiers of the knowledge lists one by one, de-duplicating the lists with the same identifiers, and adding the lists with different identifiers to generate a comprehensive candidate knowledge set; Performing ranking processing on the comprehensive candidate knowledge set, calculating the matching quantity of vector similarity scores and keywords for each knowledge in the set, generating a comprehensive ranking score through a weighted summation algorithm, and ranking the set from high to low according to the comprehensive ranking score to obtain a ranked knowledge list; Writing the ordered knowledge list into the RAG buffer area in a key value structure, and inputting the ordered knowledge list and the dialogue context into a large language model in a sequential splicing mode in an reasoning stage to generate semantic enhancement information.
5. The method of claim 4, wherein inputting the semantically enhanced information into the large language model performs natural language processing analysis, generating dialog decision results, comprising: Sequentially splicing the semantic enhancement information and the dialogue context to generate an input sequence of a large language model; Performing word segmentation processing on the input sequence, and converting the input sequence into a token sequence; Inputting the token sequence into an encoder end of a large language model, performing embedding mapping, position coding calculation and multi-head attention calculation, and generating a context representation; Inputting the context representation into a feedforward network structure, performing linear transformation and activation function calculation, and generating a semantic reasoning representation; Executing intention classification reasoning on the semantic reasoning representation, and outputting an intention prediction result by adopting a classification head structure of a preset intention class; Performing entity extraction reasoning on the semantic reasoning expression, and outputting entity boundaries and entity categories by adopting a sequence labeling structure; and executing decision assembly processing on the intention prediction result and the entity extraction result, and generating a dialogue decision result based on a preset rule.
6. The method of claim 1, wherein generating response content based on the dialog decision result, outputting the interaction information via a response generation technique, and storing the dialog data in a structured manner, comprises: When generating response content based on dialogue decision result, executing syntax tree construction processing to the intention prediction result and entity extraction result, and generating syntax tree by using dependency syntax analyzer; executing rule replacement processing on the syntax tree, replacing node marks by using preset syntax rules, and generating a response text sequence; Performing symbol standardization processing on the response text sequence, and performing regular expression replacement on the number, the time, the place and the name to obtain a resolvable text segment; inputting the text segment into a voice synthesis module, performing mapping processing of the text to the phonemes, and generating a phoneme sequence by using a phoneme dictionary; performing acoustic feature generation processing on the phoneme sequence, and generating an acoustic parameter sequence based on the vocoder; inputting the acoustic parameter sequence into a vocoder decoder to generate audio data of the interactive information; When the dialogue data is stored in a structuring way, field analysis processing is carried out on dialogue decision results, and an intention field, an entity field and an inference field are converted into a key value structure by using a JSON serialization rule; Performing database writing processing on the key value structure, and writing the key value structure into a preset data table by using INSERT operation of the relational database; And executing additional storage processing on the data table, and adding the key value structure record of the current round into the dialogue data record set to form structured dialogue data.
7. A large model intelligent streaming dialog system, comprising: the text information recognition module is used for acquiring voice input of a user and generating text information through voice recognition; The semantic information generation module is used for constructing a dialogue context based on the text information and generating semantic enhancement information through RAG knowledge retrieval; the dialogue result generation module is used for inputting the semantic enhancement information into the large language model to execute natural language processing analysis and generate dialogue decision results; and the dialogue data storage module is used for generating response contents based on dialogue decision results, outputting interaction information through a response generation technology and carrying out structural storage on dialogue data.
8. The system of claim 7, wherein the semantic information generation module comprises: the natural semantic segmentation unit is used for performing vectorization processing on the dialogue context and segmenting the text into a token sequence according to a natural semantic boundary by adopting a token embedding mode based on a word segmentation sequence; The fixed dimension vector generation unit is used for encoding the text segment by adopting a vector encoding algorithm expressed by a pre-training language, inputting a token sequence into a transducer encoder, and executing multi-head attention calculation and feedforward network calculation to obtain a fixed dimension vector; the first candidate knowledge list generation unit inputs the fixed dimension vector into a vector retriever, and obtains a plurality of first candidate knowledge lists by using a vector similarity retrieval structure based on cosine similarity or inner product similarity; The keyword retrieval unit is used for performing keyword retrieval on the first candidate knowledge list and generating a retrieval word list by adopting a query construction mode based on the keyword list; The second candidate list generating unit is used for carrying out term matching on the text content based on the inverted index, scanning index items corresponding to the search word list by adopting the inverted index table, extracting text records containing all search words, and generating a plurality of second candidate knowledge lists; The comprehensive candidate knowledge set generating unit is used for combining the first candidate knowledge list and the second candidate knowledge list, comparing the unique identifiers of the knowledge lists one by one, de-duplicating the lists with the same identifiers, and adding the lists with different identifiers to generate a comprehensive candidate knowledge set; The ranking unit is used for performing ranking processing on the comprehensive candidate knowledge set, calculating the matching quantity of the vector similarity score and the keywords for each knowledge in the set, generating a comprehensive ranking score through a weighted summation algorithm, and ranking the set from high to low according to the comprehensive ranking score to obtain a ranked knowledge list; The semantic enhancement information generating unit is used for writing the ordered knowledge list into the RAG cache area in a key value structure, inputting the ordered knowledge list and the dialogue context into the large language model in a sequential splicing mode in an reasoning stage, and generating semantic enhancement information.
9. An apparatus for large model intelligent streaming, comprising: The memory is used for storing a large-model intelligent stream-tuning dialogue program; a processor for implementing the steps of the large model intelligent streaming session method according to any of claims 1-6 when executing the large model intelligent streaming session program.
10. A computer readable storage medium storing a computer program, characterized in that the readable storage medium has stored thereon a large model intelligent streaming session program, which when executed by a processor implements the steps of the large model intelligent streaming session method according to any of claims 1-6.

Description

Large-model intelligent stream-tuning dialogue method, system, equipment and storage medium Technical Field The invention belongs to the technical field of artificial intelligence, in particular to a large-model intelligent stream-tuning dialogue method, a system, equipment and a storage medium. Background In the fields of public health emergency management and infectious disease prevention and control, epidemiological investigation is a core basic work of identifying infectious agents, judging transmission chains and formulating prevention and control measures. Typical flow processes need to develop around critical information such as symptom appearance time, activity trajectories, contact history, exposure scenes, etc., and generally require speech interviews, multiple questions and answers, critical information extraction, and standardized record finishing. Along with the development of artificial intelligence technologies such as voice recognition, voice synthesis, large language model, retrieval enhancement generation and the like, the industry starts to try to improve information acquisition efficiency and recording quality by using an intelligent means, so that a flow adjustment process is changed from a highly dependent manual question by question to an automatic and structured intelligent interaction mode. The existing flow-tuning mode is still mainly based on manual telephone interviews, wherein obvious technical defects exist in terms of efficiency, data quality and professionals, on one hand, the manual query and recording mode is long in time consumption, and is difficult to quickly complete information collection of large-scale crowds during infectious disease transmission, and on the other hand, the query modes, understanding capabilities and recording standards of different flow-tuning operators are inconsistent, so that the information structuring degree is low, the data quality is uneven, and the method is difficult to directly use for subsequent analysis and decision. Meanwhile, the traditional voice recognition technology has limited accuracy in terms of recognizing medical terms and professional expressions, and a general large language model is easy to generate inaccurate contents under the condition of no professional knowledge constraint, so that the standardization and the strictness of flow question and answer are difficult to ensure. In addition, many existing systems only have basic data entry capability, lack of systematic fusion between voice interaction, knowledge enhancement, intelligent decision making and standardized output, and cannot form a closed-loop automatic flow adjustment process. Disclosure of Invention Aiming at the defects in the prior art, the invention provides a large-model intelligent stream-tuning dialogue method, a large-model intelligent stream-tuning dialogue system, large-model intelligent stream-tuning dialogue equipment and a large-model intelligent stream-tuning dialogue storage medium, so as to solve the technical problems. In a first aspect, the present invention provides a large-model intelligent streaming conversation method, including: acquiring voice input of a user, and generating text information through voice recognition; Based on text information, constructing dialogue context, and generating semantic enhancement information through RAG knowledge retrieval; Inputting the semantic enhancement information into a large language model to execute natural language processing analysis, and generating a dialogue decision result; and generating response content based on the dialogue decision result, outputting interaction information through a response generation technology, and carrying out structural storage on dialogue data. In an alternative embodiment, obtaining user speech input, generating text information by speech recognition, includes: Performing audio acquisition on user voice input, and receiving a continuous audio stream by utilizing an audio acquisition module; Performing format unification processing on user voice input, and converting original voice data into an input format receivable by a voice recognition model; extracting characteristics of voice input of a user, and generating acoustic characteristics based on voice signals; when text information is generated through voice recognition, inputting acoustic features into a voice recognition model, and executing acoustic decoding and language decoding to generate preliminary text information; After text information is generated through voice recognition, confidence screening and result calibration are carried out on the preliminary text information, effective text information is generated, and in the process of generating the text information, the effective text information is output in a stream recognition mode. In an alternative embodiment, building a dialog context based on text information includes: segmenting the text information to obtain an initial text unit which can be used