CN-122024707-A - Intelligent voice recognition and interaction system and method integrating AI visual information

CN122024707ACN 122024707 ACN122024707 ACN 122024707ACN-122024707-A

Abstract

The invention discloses an intelligent voice recognition and interaction system and method for fusing AI visual information, which relate to the technical field of voice recognition, and effectively solve the problems of low recognition rate, unknown indication and complex awakening in a complex sound field environment by constructing a cross-mode dynamic gating fusion network, combining a real-time signal-to-noise ratio and visual confidence, dynamically distributing audio-video weight, realizing high robustness recognition based on lip language in a high-noise environment, aiming at indication ambiguity in natural language, and realizing intention analysis of what you see is what you control in a three-dimensional semantic map by utilizing projection and collision detection of sight and gesture rays.

Inventors

LIN MINYI
LIU JIEQI
CHEN JIANGLIANG

Assignees

福建云端智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. An intelligent voice recognition and interaction method integrating AI visual information is characterized by comprising the following steps: s1, collecting sound field signals and light field signals in the environment, and aligning the sound field signals and the light field signals through time stamps to obtain an audio stream and an RGB-D video stream; s2, extracting acoustic feature vectors of the audio stream in the S1, and visual voice vectors and three-dimensional space interaction vectors of the RGB-D video stream; S3, mapping the acoustic feature vector and the visual voice vector extracted in the S2 into a potential semantic space through a cross-modal dynamic gating fusion network, dynamically distributing fusion weights of an auditory channel and a visual channel according to a signal-to-noise ratio index and a lip shielding rate index, generating multi-modal voice semantic representation, and decoding to obtain a preliminary voice text sequence; S4, performing finger word triggering detection on the preliminary voice text sequence generated in the S3, when the finger word is detected, transmitting a sight ray and a gesture ray in the constructed local three-dimensional semantic map by using the three-dimensional space interaction vector extracted in the S2, calculating intersection probability and space distance score of the ray and an object bounding box in the scene, and analyzing a physical object ID corresponding to the finger word; And S5, monitoring the residence time and the lip movement state of the three-dimensional space interaction vector in the S2 relative to a preset interaction area in real time, and combining the preliminary voice text sequence generated in the S3 with the physical object ID analyzed in the S4 to generate a final equipment control instruction when the intention triggering is judged.
2. The intelligent voice recognition and interaction method fusing AI visual information of claim 1, wherein S1 comprises the steps of: S11, configuring a six-microphone annular array, setting an audio sampling rate and bit depth, simultaneously configuring an RGB-D depth camera, setting video resolution, frame rate and depth map measuring range, and obtaining a multi-mode perception hardware environment; S12, executing hardware-level time synchronization, generating a unified synchronization pulse signal by using an FPGA, and simultaneously triggering analog-to-digital conversion of a microphone array and exposure acquisition of a depth camera to obtain an original multi-channel audio stream and an RGB-D video stream; s13, carrying out beam forming processing on the multichannel audio stream, calculating a sound source arrival direction DOA by using a generalized cross correlation algorithm, enhancing a voice signal in a target direction, inhibiting background noise and outputting the audio stream; based on the RGB-D video stream, a face detection algorithm is utilized to locate a face region of a user, background pixels are removed based on depth information, and only foreground visual information in a user and interaction range is reserved, so that the denoised RGB-D video stream is obtained.
3. The intelligent voice recognition and interaction method fusing AI visual information of claim 1, wherein S2 comprises the steps of: S21, adopting a pre-trained Conformer encoder structure, framing and windowing the audio stream output by the S1, extracting the characteristics of a Mel filter bank, splicing a first-order difference and a second-order difference, capturing long-term dependence through a multi-layer self-attention mechanism, and outputting an acoustic characteristic vector; s22, intercepting and normalizing a region of interest (ROI) of a user lip in an RGB-D video stream into a gray image sequence, extracting a continuous motion mode of the lip in the gray image sequence in a time dimension by utilizing ResNet-3D network, and outputting an acoustic feature vector and a visual voice vector; And S23, detecting hand bone key points and eyeball center points of a user in the RGB-D video stream by utilizing a bone detection model, and back-projecting two-dimensional pixel coordinates to a camera coordinate system by combining Z-axis information of depth maps of the bone key points and the eyeball center points to generate a three-dimensional hand key point coordinate set and a three-dimensional eyeball gazing direction vector so as to obtain a three-dimensional space interaction vector.
4. The intelligent voice recognition and interaction method fusing AI visual information of claim 1, wherein S3 comprises the steps of: s31, calculating the ratio of the signal energy of the current audio frame to the background noise energy of the non-voice section to obtain an environment signal-to-noise ratio; S32, calculating initial visual confidence coefficient, detecting whether a shielding object exists in a lip area, whether the illumination intensity is lower than a preset threshold value or not and whether the head posture deflection angle exceeds a head yaw angle threshold value or not, if so, performing value reduction processing on the initial visual confidence coefficient to obtain final visual confidence coefficient, otherwise, taking the initial visual confidence coefficient as the final visual confidence coefficient; S33, constructing a cross-modal dynamic gating unit, defining an audio gating coefficient and a visual gating coefficient of the cross-modal dynamic gating unit, wherein the gating coefficient is obtained through calculation of a full-connection layer and a Sigmoid activation function; Inputting the acoustic feature vector, the visual voice vector, the normalized environmental signal-to-noise ratio and the normalized final visual confidence to a cross-modal dynamic gating unit to obtain a current audio gating coefficient and a current visual gating coefficient; S34, calculating fusion characteristics by using a fusion formula based on the current audio gating coefficient and the current visual gating coefficient; s35, inputting the fusion characteristics into a transducer decoder, and combining a bundle search algorithm to output a preliminary voice text sequence with highest probability.
5. The intelligent voice recognition and interaction method fusing AI visual information of claim 1, wherein S4 comprises the steps of: s41, constructing a reference word dictionary which contains words with space directivity, traversing the voice text sequence generated in the S3, marking as a trigger point if the words in the reference word dictionary are matched, and intercepting three-dimensional space interaction vectors in a preset time window before and after the trigger point to obtain an average sight vector and an average gesture vector; S42, performing instance segmentation and bounding box fitting on objects in a scene by utilizing point cloud data acquired by a depth camera to generate an object list containing object category labels ID, center coordinates and three-dimensional dimensions, and obtaining a local three-dimensional semantic map; s43, in the local three-dimensional semantic map, taking the average sight line vector and the average gesture vector in S41 as directions, and respectively taking the eyebrows and the index tips of the users as starting points to construct corresponding sight line rays and gesture rays; s44, respectively calculating the intersection depth and Euclidean distance between the sight ray and the gesture ray and the object bounding box for each object in the scene, defining a fusion pointing score formula, and calculating a final fusion pointing score by combining the adjustable weight; S45, selecting the object with the highest fusion pointing score as the pointing object, and binding the object ID of the pointing object with the voice intention.
6. The intelligent speech recognition and interaction method with AI visual information of claim 5, wherein S44 defines a fused directional score formula, and calculating a final score in combination with adjustable weights comprises the steps of: S441, defining a sight line ray, a gesture ray and an object bounding box; S442, calculating Euclidean distance between the sight ray and the center of the object bounding box and a Boolean value of whether the sight ray and the gesture ray intersect with the center of the object bounding box; S443, defining a sight confidence weight and a gesture confidence weight, detecting the height of the wrist of the user relative to the abdomen in real time, judging that the user points explicitly if the height of the wrist of the user relative to the abdomen is larger than a preset height threshold value, and judging that the user looks implicitly if the user looks implicitly; based on the Euclidean distance between the sight ray and the center of the object bounding box and the Boolean value of whether the sight ray, the gesture ray and the object bounding box intersect, calculating to obtain a sight ray score and a gesture ray score through a single ray score calculation formula; based on the gaze ray score and the gesture ray score, a final fusion pointing score is calculated.
7. The intelligent voice recognition and interaction method fusing AI visual information of claim 1, wherein S5 comprises the steps of: S51, setting a conical virtual space in front of the field of view of a user, wherein the conical virtual space covers a physical area containing intelligent equipment and a sensor thereof to obtain an interactive area ROI; S52, calculating an included angle between the sight line vector of the user and the interaction region ROI in real time, and judging that the user looks at the locked state if the included angle is smaller than an included angle threshold value and the duration exceeds a duration threshold value; s53, calculating the opening amplitude and the change rate of the lip key points in S2, and judging that the lip is intended to trigger if continuous lip movement signals are detected to exceed a silence threshold value in a gaze locking state; S54, after receiving the intention trigger, moving from a standby monitoring state to an active listening state, and combining the initial voice text sequence generated in the S3 with the physical object ID analyzed in the S4 to generate a final equipment control instruction; s55, after the processing is completed, if the instruction confidence coefficient of the generated equipment control instruction is higher than the instruction first confidence coefficient threshold value, directly issuing the control instruction; If the confidence coefficient is between the second confidence coefficient threshold value and the first confidence coefficient threshold value, voice inquiry confirmation is initiated, and if the line of sight moves out of the ROI area to exceed the line of sight moving out time threshold value, the system automatically returns to a standby monitoring state.
8. The intelligent voice recognition and interaction method fusing AI visual information as set forth in claim 1, further comprising an adaptive online learning mechanism, comprising the steps of: s61, defining positive and negative feedback signals; S62, constructing a contrast loss function, calculating the characteristic distance between the predicted object and the real intended object as a punishment item for a negative sample, and reducing the prediction error under the current environmental parameters for a positive sample; And S63, collecting feedback behaviors of the user on the interaction result as current positive and negative feedback signals, and iteratively updating the weight parameters of the cross-modal dynamic gating fusion network and the ray weighting coefficients of the three-dimensional space semantic alignment engine based on the current positive and negative feedback signals and the loss function.
9. The intelligent voice recognition and interaction method fused with AI visual information of claim 8, wherein the step of iteratively updating the weighting parameters of the cross-modal dynamic gating fusion network and the ray weighting coefficients of the three-dimensional space semantic alignment engine in S63 comprises the steps of: S631, constructing a chromosome population, wherein chromosomes in the population are used as the weight parameters of a cross-modal dynamic gating fusion network and the ray weighting coefficients of a three-dimensional space semantic alignment engine; s632, selecting, crossing and mutating chromosomes in the chromosome population by comparing the loss function, repeating iteration, obtaining an optimal chromosome when the set maximum iteration number is reached or the preset loss value threshold value is smaller, and respectively applying the weight parameters of the cross-modal dynamic gating fusion network corresponding to the optimal chromosome and the ray weighting coefficient of the three-dimensional space semantic alignment engine to the cross-modal dynamic gating and the three-dimensional space semantic alignment engine.
10. An intelligent voice recognition and interaction system fusing AI visual information, characterized in that an intelligent voice recognition and interaction method fusing AI visual information as set forth in any one of claims 1-9 is implemented, the system comprising: The multi-mode sensing front end module is used for integrating a microphone array and a depth camera and is responsible for synchronous acquisition and preprocessing of data to obtain a multi-channel audio stream and an RGB-D video stream; the double-flow feature extraction module comprises an acoustic feature extraction unit and a visual feature extraction unit which are parallel, and is used for processing the multi-channel audio stream and the RGB-D video stream to generate an acoustic feature vector, a visual voice vector and a three-dimensional space interaction vector; The fusion reasoning engine module is used for running a cross-mode dynamic gating network and a three-dimensional space semantic alignment algorithm, processing an acoustic feature vector, a visual voice vector and a three-dimensional space interaction vector, and carrying out intention analysis and object locking to obtain a primary voice text sequence and a physical object ID; And the interactive control center module comprises a state machine logic and instruction distribution interface and realizes state circulation and equipment control of the system based on the preliminary voice text sequence and the physical object ID.

Description

Intelligent voice recognition and interaction system and method integrating AI visual information Technical Field The invention belongs to the technical field of voice recognition, and particularly relates to an intelligent voice recognition and interaction system and method integrating AI visual information. Background With the rapid development of artificial intelligence and internet of things, intelligent human-computer interaction has become a core technology in the fields of intelligent home, intelligent cabins, medical care and the like. The global intelligent voice market scale keeps growing at a high rate every year, and users' demands for natural interactions are becoming increasingly stringent. From early key control to current voice control, man-machine interaction approaches are evolving towards more intuitive and efficient directions. The existing mainstream voice interaction scheme is highly dependent on audio single mode, and has the core flow of generally performing beam forming and sound source positioning through a microphone array to capture user voice, activating equipment by means of specific wake-up words to enter a listening state, then performing single automatic voice recognition (ASR) on the voice section by a system to convert acoustic signals into text instructions, performing face detection and tracking only through a camera to assist in sound source positioning and improving the distinguishing capability of speaking, or performing simple post-splicing or weighted average on extracted lip visual features and audio features, wherein the method cannot realize deep interaction and collaborative characterization learning of the audio-visual features at an encoder level, and further lacks a mechanism for performing adaptive weight adjustment according to dynamic conditions such as real-time environmental noise, visual shielding and the like, so that the robustness under a complex scene is improved to a limited extent. The method and the device have the following problems that 1, the anti-noise capability is poor, the recognition rate which is simply dependent on the audio frequency is rapidly reduced under the high-noise environment and is easily interfered by background human voice or television voice, 2, the semantic meaning is unknown, a user is habitually used to use natural languages with pronouns such as 'open the word', 'that heightened point', and the like, the prior art cannot understand semantic information brought by space positions, so that interaction failure is caused, 3, the interaction experience is unnatural, the forcible wake-up word interrupts natural expression logic of the user, and false wake-up or wake-up failure is easy to cause. Disclosure of Invention (One) solving the technical problems Aiming at the problems in the related art, the invention provides an intelligent voice recognition and interaction system and method integrating AI visual information, so as to overcome the technical problems in the prior art. (II) technical scheme In order to solve the technical problems, the invention is realized by the following technical scheme: s1, collecting sound field signals and light field signals in the environment, and aligning the sound field signals and the light field signals through time stamps to obtain an audio stream and an RGB-D video stream; s2, extracting acoustic feature vectors of the audio stream in the S1, and visual voice vectors and three-dimensional space interaction vectors of the RGB-D video stream; S3, mapping the acoustic feature vector and the visual voice vector extracted in the S2 into a potential semantic space through a cross-modal dynamic gating fusion network, dynamically distributing fusion weights of an auditory channel and a visual channel according to a signal-to-noise ratio index and a lip shielding rate index, generating multi-modal voice semantic representation, and decoding to obtain a preliminary voice text sequence; S4, performing finger word triggering detection on the preliminary voice text sequence generated in the S3, when the finger word is detected, transmitting a sight ray and a gesture ray in the constructed local three-dimensional semantic map by using the three-dimensional space interaction vector extracted in the S2, calculating intersection probability and space distance score of the ray and an object bounding box in the scene, and analyzing a physical object ID corresponding to the finger word; S5, monitoring the residence time and the lip movement state of the three-dimensional space interaction vector in the S2 relative to a preset interaction area in real time, and when the intention triggering is judged, combining the preliminary voice text sequence generated in the S3 with the physical object ID analyzed in the S4 to generate a final equipment control instruction; Preferably, the step S1 includes the steps of: S11, configuring a six-microphone annular array, setting an audio sampling rate and bit depth, simu