CN-122024701-A - Intelligent voice interaction method and device

CN122024701ACN 122024701 ACN122024701 ACN 122024701ACN-122024701-A

Abstract

The intelligent voice interaction method comprises the following steps of collecting original audio signals through an audio input module and converting the original audio signals into digital audio signals, carrying out feature extraction and classification on the digital audio signals by a wake-up module, carrying out keyword detection on the extracted features by a main control module according to a wake-up word detection model, converting the original audio signals into digital audio signals by an audio encoding and decoding module, sending the digital audio signals to a cloud server to generate reply words and digital audio data by the main control module, sending the reply words to an interaction module to display, converting the digital audio data into analog audio signals by the main control module, sending the analog audio signals to an audio power amplification module, converting the analog audio signals into sound by the audio power amplification module, completing audio playing, and enabling the key word detection by the wake-up word detection model by the main control module again to realize continuous voice conversation. The design does not need manual operation, and can realize low-power-consumption operation.

Inventors

WANG LIN
LIU HENGYI

Assignees

武汉商学院

Dates

Publication Date: 20260512
Application Date: 20251223

Claims (10)

1. The intelligent voice interaction method is characterized by comprising the following steps of: collecting an original audio signal through an audio input module (2) and converting the original audio signal into a digital audio signal; The wake-up module (4) performs feature extraction and classification on the digital audio signals; The main control module (9) detects wake-up words according to the wake-up word detection model, and if the wake-up words are not detected, the intelligent voice interaction device keeps low power consumption standby; The audio encoding and decoding module (3) acquires an original audio signal of the audio input module (2) in real time according to the wake-up signal, converts the original audio signal into a digital audio signal, and the main control module (9) sends the digital audio signal to the cloud server (1) to generate reply characters and digital audio data; The main control module (9) sends the reply text to the interaction module (6) for display, the audio coding and decoding module (3) converts the digital audio data into an analog audio signal and sends the analog audio signal to the audio power amplification module (5), and the audio power amplification module (5) converts the analog audio signal into sound to finish audio playing; The main control module (9) re-enables the wake-up word detection model to detect wake-up words, and the steps are repeated to realize continuous voice dialogue.
2. The method of claim 1, wherein the pre-processing of the digital audio signal and the segmentation of the continuous digital audio signal into frame-by-frame audio signals is performed before the feature extraction and classification of the digital audio signal, and then the windowing of each frame of audio signal is performed by applying a windowing function.
3. The method for intelligent voice interaction of claim 2, wherein the preprocessing of the digital audio signal comprises: Analyzing the audio characteristics of the digital audio signal without a voice section, estimating the spectrum characteristics of background noise, calculating the power spectrum ratio of voice to the background noise, designing an optimal filter, and removing the background noise through the optimal filter; Delaying and filtering the digital audio signal, matching the transmission path of the echo received by the audio input module (2), updating the filter coefficient to enable the filter output to be close to the echo signal, and enabling the digital audio signal to subtract the predicted echo; The energy of a high-frequency part in the digital audio signal is improved through the high-pass filter, so that the high-frequency resolution of the digital audio signal is enhanced.
4. The method for intelligent voice interaction of claim 2, wherein the feature extraction and classification of the digital audio signal comprises: performing short-time Fourier transform on each frame of audio signal after windowing, and converting a time domain signal into a frequency domain to acquire a frequency spectrum of each frame; according to the Mel frequency scale, uniformly distributing a group of filters on Mel frequency domains, and then converting the filters on the Mel frequency domains into linear frequency domains to obtain corresponding Mel filter groups; the frequency spectrum of each frame of audio signal passes through a Mel filter group, the output value of each filter is calculated, and the logarithm of each output value of the Mel filter group is taken to obtain logarithmic Mel energy; Discrete cosine transform is performed on the logarithmic mel energy, the logarithmic mel energy is converted from the frequency domain to the cepstral domain, and the main spectral features are extracted.
5. The intelligent voice interaction method of claim 1, wherein the wake-up word detection of the extracted features comprises: Normalizing the extracted spectrum features and inputting the normalized spectrum features to an input layer of a neural network; The convolution layer of the neural network carries out convolution operation on the input characteristics to extract local characteristics; Applying an activation function to the output of the convolutional layer; Flattening the feature vector and inputting the feature vector into a full-connection layer for global information integration; converting the output of the full connection layer into probability distribution by using a function at the output layer; and carrying out classification decision according to the set probability threshold, if the probability of the wake-up word class is higher than the probability threshold, judging the wake-up word, and otherwise, judging the wake-up word as a non-wake-up word.
6. The method for intelligent voice interaction according to claim 1, wherein the main control module (9) sends the digital audio signal to the cloud server to generate the reply text and the digital audio data, and the method comprises the following steps: The main control module (9) starts recording, continuously reads the digital audio signal, finishes recording and sends recorded audio data to the cloud server (1) when continuous multi-frame silence is detected or the maximum recording duration is reached, the cloud server (1) converts the audio data into text, then inputs the text into the AI model to generate reply characters and digital audio data, and sends the reply characters and the digital audio data to the main control module (9).
7. The method of claim 6, wherein the continuously reading the digital audio signal until a continuous multi-frame silence is detected or a maximum recording duration is reached comprises: Sequentially detecting each frame of digital audio signals, detecting the digital audio signal of the next frame when the digital audio signal of the current frame does not detect data, starting recording until the data is detected, sampling the digital audio signal of each frame, and simultaneously counting; Detecting whether the sampling value of each frame of digital audio signal is lower than a certain threshold value, if so, representing silence, counting, and ending recording when silence is detected by continuous multiframes and the frame number reaches the set silence frame number; And ending recording when the frame number of the digital audio signal reaches the set maximum recording duration.
8. The method of claim 7, wherein said inputting text into the AI model generates reply text and digital audio data, comprising: The text is transmitted into an AI model through an api, the AI model converts the input text into a Token sequence and maps the Token sequence into a high-dimensional vector, then long-range dependency relationship and semantic association in the input sequence are captured, then an output sequence is gradually generated through autoregressive decoding, the next Token with the highest probability is selected according to the Token generated by the current context and history in each step, the generated Token sequence is restored into a reply text of a natural language text through a decoder, and TTS service is called to combine with emotion parameters to synthesize anthropomorphic digital audio data.
9. The intelligent voice interaction device is characterized by being applied to the intelligent voice interaction method according to claim 1, and comprises a main control module (9), an audio input module (2), an audio coding and decoding module (3), a wake-up module (4) and an audio power amplification module (5), wherein the output end of the audio input module (2) is respectively connected with the input ends of the audio coding and decoding module (3) and the wake-up module (4), the output end of the audio coding and decoding module (3) is connected with the input end of the audio power amplification module (5), the output end of the wake-up module (4) is connected with the main control module (9), the main control module (9) is connected with the audio coding and decoding module (3) and the audio power amplification module (5), and the main control module (9) is connected with a cloud server (1) through a wireless module (8). The audio input module (2) is used for acquiring audio signals; The audio coding and decoding module (3) is used for converting the acquired audio signals into digital signals and sending the digital signals to the main control module (9) for processing, converting the digital audio signals processed by the main control module (9) into analog signals and sending the analog signals to the audio power amplification module (5); the wake-up module (4) is used for detecting wake-up words in the audio signal and sending signals to the main control module (9); The main control module (9) is used for converting the digital signals into characters, transmitting the characters to the cloud server (1) to generate reply characters, converting the reply characters into digital audio signals, switching from a standby state to a working state according to the signals sent by the wake-up module (4) to start interaction, and switching from the working state to the standby state after the interaction is finished.
10. An intelligent speech interaction device according to claim 9, characterized in that the audio input module (2) comprises a microphone (21) and an electret microphone (22), the microphone (21) being connected to the input of the audio codec module (3), the electret microphone (22) being connected to the input of the wake-up module (4).

Description

Intelligent voice interaction method and device Technical Field The invention relates to the technical field of intelligent voice control, in particular to an intelligent voice interaction method and device. Background As voice interaction technology and intelligent control technology are continuously developed, devices having a voice recognition function and capable of performing related operations according to input voice contents are increasing. In order to realize real-time voice interaction and recognition, the voice interaction function is in a working state for a long time, unnecessary energy consumption of a system is high due to larger power consumption of voice interaction and control, if a voice recognition switch is manually started to perform voice interaction, content is only executed by voice input after the switch is started, manual operation is needed before each voice interaction, full-automatic voice interaction operation cannot be completely realized, and therefore convenience of intelligent voice interaction can be reduced. Disclosure of Invention The invention aims to overcome the defects and problems of larger power consumption and poorer convenience in the prior art and provides an intelligent voice interaction method and device with lower power consumption and better convenience. In order to achieve the above purpose, the technical solution of the present invention is an intelligent voice interaction method, which comprises the following steps: Collecting an original audio signal through an audio input module and converting the original audio signal into a digital audio signal; The wake-up module performs feature extraction and classification on the digital audio signals; The main control module detects the wake-up word according to the extracted features, and if the wake-up word is not detected, the intelligent voice interaction device keeps low power consumption and stands by; The audio encoding and decoding module acquires an original audio signal of the audio input module in real time according to the wake-up signal, converts the original audio signal into a digital audio signal, and the main control module sends the digital audio signal to the cloud server to generate reply characters and digital audio data; The main control module sends the reply text to the interaction module for display, the audio encoding and decoding module converts the digital audio data into an analog audio signal and sends the analog audio signal to the audio power amplification module, and the audio power amplification module converts the analog audio signal into sound to complete audio playing; the main control module re-enables the wake-up word detection model to detect wake-up words, and the steps are repeated to realize continuous voice dialogue. Before the characteristic extraction and classification of the digital audio signals, the digital audio signals are preprocessed, continuous digital audio signals are divided into audio signals frame by frame, and then a window function is applied to each frame of audio signals for windowing. The preprocessing of the digital audio signal comprises: Analyzing the audio characteristics of the digital audio signal without a voice section, estimating the spectrum characteristics of background noise, calculating the power spectrum ratio of voice to the background noise, designing an optimal filter, and removing the background noise through the optimal filter; delaying and filtering the digital audio signal, matching the transmission path of the echo received by the audio input module, updating the filter coefficient to enable the filter output to be close to the echo signal, and enabling the digital audio signal to subtract the predicted echo; The energy of a high-frequency part in the digital audio signal is improved through the high-pass filter, so that the high-frequency resolution of the digital audio signal is enhanced. The feature extraction and classification of the digital audio signal comprises: performing short-time Fourier transform on each frame of audio signal after windowing, and converting a time domain signal into a frequency domain to acquire a frequency spectrum of each frame; according to the Mel frequency scale, uniformly distributing a group of filters on Mel frequency domains, and then converting the filters on the Mel frequency domains into linear frequency domains to obtain corresponding Mel filter groups; The frequency spectrum of each frame of audio signal passes through a Mel filter group, the output of each filter is calculated, and the logarithm of each output of the Mel filter group is taken to obtain logarithmic Mel energy; Discrete cosine transform is performed on the logarithmic mel energy, the logarithmic mel energy is converted from the frequency domain to the cepstral domain, and the main spectral features are extracted. The wake-up word detection of the extracted features comprises the following steps: Normalizing the extrac