CN-121999775-A - Intelligent voice interaction method and system

CN121999775ACN 121999775 ACN121999775 ACN 121999775ACN-121999775-A

Abstract

The application discloses an intelligent voice interaction method and system, which relate to the technical field of voice interaction and comprise the steps of obtaining user voice, determining emotion label and pronunciation rule characteristics, outputting a language mode, extracting confusion words, filling word characteristics and reference words based on standard word frequency, determining accurate meaning of each confusion word based on a resolution rule, determining accurate meaning of the reference words according to a layering mechanism, generating response voice and operation instructions according to the accurate meaning by combining the emotion label and the filling word characteristics, executing the operation instructions to follow a closed loop flow of instruction resolution, safety verification, decision planning, executing monitoring and feedback updating, and tightly cooperating with an automatic driving core module, ensuring that the instructions are accurately converted into vehicle behaviors, accurately judging the user voice mode, effectively solving the problem of insufficient understanding ability of non-standard languages and spoken languages, improving the accuracy of semantic resolution, and enabling voice interaction to be more consistent with natural language logic.

Inventors

YAO QING

Assignees

鸿灌环境技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260325

Claims (10)

1. An intelligent voice interaction method is characterized by comprising the following steps: S1, acquiring user voice, determining emotion tags and pronunciation rules, inputting the pronunciation rules into a language recognition model, and outputting a language mode, wherein the language mode is marked as a standard mode if the language is a standard language and a standard grammar, as a non-standard mode if the language is a non-standard language, and as a spoken mode if the language is a standard language and a non-standard grammar; S2, for the standard mode, word segmentation is carried out on the voice text to obtain standard word frequency, and response voice and operation instructions are generated by combining with the emotion labels; for a non-standard language mode and a spoken language mode, extracting confusion words, filling word features and reference words based on standard word frequency, determining accurate meaning of each confusion word based on a resolution rule, determining accurate meaning of the reference words according to a layering mechanism, generating response voice and an operation instruction according to the accurate meaning by combining emotion labels and filling word features, and executing the operation instruction according to vehicle real-time data, environment data and user instruction control priority; And S3, calculating an attenuation curve referring to the resolution accuracy in real time, calculating layered storage efficiency, data processing efficiency and recognition accuracy according to the historical data and the attenuation curve, and recognizing a curve peak change trend to determine an optimal recognition rule.
2. The method of claim 1, wherein the user's speech is converted in real time using an automatic speech recognition model, a phonetic text is output, the filler word locations are marked during the speech recognition phase, prosodic features of the filler words are extracted, a user-personalized filler word pattern library is created, the user's phonetic text is obtained and compared with the filler word pattern library, and filler word density, filler word type, and filler word locations are identified as filler word features.
3. The method of claim 1, wherein extracting the confusion words is based on a pre-trained language model to calculate semantic confidence of the context of the words, comprising pre-labeling a dataset, wherein each word is labeled as a confusion word, as training data, model training using the training set, parameter verification using the verification set, optimizing model parameters to obtain a trained complete pre-trained language model, outputting a list of semantic confidence scores of each word, the lowest confidence being identified as a confusion word, and labeling the locations of low confidence words using an attention weight matrix.
4. The method of claim 1, wherein determining the accurate meaning of each confusion word based on the resolution rule includes centering around the confusion word, taking a plurality of words before and after to form a context window, extracting window content in real time by using a sliding window algorithm, obtaining vector representations of each meaning of the confusion word by using a function word vector model, calculating similarity between each meaning vector representation and the context vector, presetting a correlation similarity threshold, eliminating corresponding meanings if the similarity is smaller than the correlation similarity threshold, marking the residual meaning as candidate meaning, inputting the candidate meaning and meaning data characteristics into a conditional random model to obtain probability ordering, selecting the meaning with highest probability as the accurate meaning, and outputting confidence, wherein the meaning data characteristics include words, part-of-speech labels, position information, semantic similarity scores and confidence.
5. The method of claim 1, wherein the accurate meaning of the reference word is determined according to a layering mechanism, the layering mechanism is that historical importance scores and belonging levels of the reference word are searched, subjects are omitted based on semantic role labeling identification, a plurality of candidate entities are generated through capturing a pointing relation across dialogue rounds through an attention mechanism, an entity-relation diagram is built, nodes are entities, edges are relations, indirect pointing relations are identified, neighbor nodes are determined, multi-round message passing is conducted through GNN, matching scores of the reference word and the candidate entities are calculated respectively, and the candidate entity with the highest matching score is selected as the accurate meaning of the reference word.
6. The method of claim 1, wherein constructing a memory importance assessment model generates importance scores for each of the index words, and the importance scores are divided into different storage levels, wherein the storage levels comprise a short-term storage layer, a medium-term storage layer and a long-term storage layer, the utilization rate of each index word in each storage layer is monitored, cleaning is performed if the utilization rate is lower than 40%, and the index words are automatically migrated to an upper-layer memory if the utilization rate is higher than 70%, and the importance scores are calculated according to the following formula: , wherein, Is represented by the i-th reference word, Is an importance score that is a score of importance, Refers to the chain integrity score as such, Is a frequency of use score that is used, Is a context correlation score that is a function of the context, Is a personalized weight score that is based on the weight of the user, Is a time series continuity score; Respectively, weight coefficients of the corresponding scores, and are initialized and respectively set as And the sum of the weight coefficients is 1.
7. The method of claim 1, wherein the language recognition model comprises a non-standard language prosody template library and a standard grammar index, wherein similarity calculation is performed on the pronunciation rule features of the user and the non-standard language prosody template library, a similarity threshold is preset, if the similarity threshold is larger than the similarity threshold, a non-standard language type with highest similarity is selected as a reference non-standard language and marked as a non-standard mode, if the similarity threshold is not larger than the similarity threshold, the pronunciation rule features are marked as standard languages, the pronunciation rule features are compared with the standard grammar index, the standard grammar is marked as a standard mode, otherwise the non-standard grammar is marked as a spoken language mode, and the standard grammar index comprises sentence integrity, vocabulary standardization and sentence complexity.
8. The method of claim 1, wherein a set of fundamental acoustic features including a sequence of fundamental frequencies, energy features, spectral features, and duration features, and prosodic features including phoneme-level prosodic features, syllable-level prosodic features, and sentence-level prosodic features are extracted from a user's speech, and the fundamental acoustic features include a sequence of fundamental frequencies, energy features, spectral features, and duration features.
9. The method of claim 8, wherein the pitch rule vector is generated based on phoneme-level and pitch-level fundamental frequency features, the phoneme-level fundamental frequency modes and pitch-level fundamental frequency contours are used as feature inputs, a Gaussian mixture model is adopted to perform cluster analysis on the fundamental frequency features, each cluster center represents a tone mode, and the matching degree of the current voice and each tone mode is calculated through maximum posterior probability estimation; The rhythm rule vector is generated based on the time length characteristics and the pause characteristics, the phoneme time length characteristics, the syllable time length rules and the sentence level pause modes are used as characteristic input, and the statistical characteristics of each rhythm parameter are calculated to obtain the rhythm rule vector; the tone intensity rule vector is generated based on energy characteristics and accent modes, phoneme-level energy distribution, syllable-level energy envelope and sentence-level energy trend are taken as characteristic input, a dynamic time warping algorithm is adopted to align energy envelope templates, a main component analysis is used for extracting a main mode of energy distribution, accent modes are identified, and the tone intensity rule vector is generated.
10. An intelligent voice interaction system applied to the method as claimed in any one of claims 1 to 9, wherein the system comprises a voice feature extraction module, a voice feature extraction module and a voice interaction module, wherein the voice feature extraction module is used for acquiring user voices in real time for preprocessing and determining emotion tags and pronunciation rule features; the language mode recognition module is used for judging the language mode of the user voice according to the pronunciation rule characteristics and the grammar index and extracting key semantic information; the word frequency dividing module is used for generating standard word frequencies according to different language modes and word segmentation standards thereof, and extracting confusing words, filling word characteristics and reference words; The semantic meaning resolution module is used for determining the accurate meaning of each confusion word according to the resolution rule and determining the accurate meaning of the reference word according to the layering mechanism; and the response generation and optimization module is used for generating response voice and operation instructions of emotion matching and context adaptation according to the semantic representation, the emotion labels and the vehicle state, and optimizing interaction experience.

Description

Intelligent voice interaction method and system Technical Field The invention relates to the technical field of voice interaction, in particular to an intelligent voice interaction method and system. Background With the development of automatic driving technology, voice interaction is gradually paid attention as a more natural and convenient control mode, and a control system of a vehicle is becoming increasingly complex. Traditional man-machine interaction modes (such as physical buttons and touch screens) can be used for dispersing the attention of a user in a driving scene, so that safety is affected. The voice interaction provides a more natural and convenient control mode. However, existing vehicle-mounted voice systems are mostly limited to infotainment, simple navigation or vehicle body control (e.g. air conditioning, car window), and have low integration with the autopilot core decision-making module. In an automatic driving scene, a driver needs to efficiently and accurately interact with a vehicle through voice so as to realize fine control and real-time feedback of an automatic driving system. The ubiquitous instruction understanding is stiff, only a preset fixed sentence pattern can be processed, and natural language instructions containing complex intentions and multi-condition constraints cannot be understood. The instruction understanding is separated from the current state (such as speed, position and driving mode), traffic environment and history dialogue of the vehicle, and misoperation is easy to generate. It is difficult to safely and consistently translate high-level user intent (such as "following a lead vehicle") into fine control parameters executable by an autopilot system. However, the existing voice system is difficult to meet the above requirements, and cannot accurately understand the voice command of the driver with nonstandard accent, has insufficient expression processing capability for spoken language and nonstandard expression, has poor performance when processing complex semantics and reference relations, and reduces data processing efficiency and recognition accuracy. These problems severely restrict the wide application of voice interactions in the field of autopilot. Disclosure of Invention The intelligent voice interaction method and system solve the problems that in the prior art, the voice system is limited in function and cannot accurately understand and rapidly process complex semantics and reference relations, realize accurate recognition of user voice modes, generate response voice and operation instructions adapting to contexts, and improve the technical effects of voice interaction accuracy and naturalness. The application provides an intelligent voice interaction method, which comprises the following steps: S1, acquiring user voice, determining emotion labels and pronunciation rules, inputting the pronunciation rules into a language recognition model, outputting a language mode, marking the language mode as a standard mode if the language is standard language and standard grammar, marking the language mode as a non-standard mode if the language is non-standard language, and marking the language mode as a spoken language mode if the language is standard language and non-standard grammar. S2, for the standard mode, word segmentation is carried out on the voice text to obtain standard word frequency, and response voice and operation instructions are generated by combining with the emotion labels; for a non-standard language mode and a spoken language mode, extracting confusion words, filling word features and reference words based on standard word frequency, determining the accurate meaning of each confusion word based on a resolution rule, determining the accurate meaning of the reference word according to a layering mechanism, and generating response voice and an operation instruction according to the accurate meaning by combining emotion labels and filling word features; And S3, calculating an attenuation curve referring to the resolution accuracy in real time, calculating layered storage efficiency, data processing efficiency and recognition accuracy according to the historical data and the attenuation curve, and recognizing a curve peak change trend to determine an optimal recognition rule. Further, the automatic voice recognition model is used for converting the voice of the user in real time, outputting voice texts, marking the filling word positions in the voice recognition stage, extracting prosodic features of the filling words, establishing a personalized filling word pattern library, acquiring the voice texts of the user, comparing the voice texts with the filling word pattern library, and recognizing the filling word densities, the filling word types and the filling word positions as filling word features. Further, extracting the confusion words is based on a pre-training language model to calculate semantic confidence of the words in the context, an