CN-119673153-B - Voice interaction method and related equipment

CN119673153BCN 119673153 BCN119673153 BCN 119673153BCN-119673153-B

Abstract

The embodiment of the application provides a voice interaction method and related equipment, wherein the method comprises the steps of collecting user voice, inputting the user voice into a first voice model of an audio signal processor to obtain a preliminary screening confidence coefficient, inputting the user voice into a second voice model of an application processor to obtain a target confidence coefficient when the preliminary screening confidence coefficient is larger than or equal to a preliminary screening confidence coefficient threshold value, calculating a sound source angle corresponding to the user voice through the application processor, determining a target confidence coefficient threshold value corresponding to the sound source angle, determining the user voice as a voice command when the target confidence coefficient is larger than or equal to the target confidence coefficient threshold value, and executing voice interaction in response to the voice command. According to the embodiment of the application, the first voice model of the audio signal processor side is used for primarily screening the user voice, so that the system power consumption is reduced, and whether the user voice is a voice instruction can be accurately judged by applying the second voice model of the processor side.

Inventors

GAO FEI
WANG ZHICHAO
WU BIAO
CHANG WENLEI
GAO HUAN
XIA RISHENG

Assignees

荣耀终端股份有限公司

Dates

Publication Date: 20260508
Application Date: 20230912

Claims (20)

1. A voice interaction method applied to an electronic device, the method comprising: collecting audio data, inputting the audio data into a first voice model of an audio signal processor of the electronic equipment to obtain a preliminary screening confidence coefficient, wherein the preliminary screening confidence coefficient is used for representing the accuracy of a judging result of judging that the audio data is close-range voice by the first voice model, and the close-range voice is voice output by a sound source with a distance between the close-range voice and the electronic equipment being smaller than or equal to a first preset distance; When the preliminary screening confidence coefficient is larger than or equal to a preliminary screening confidence coefficient threshold value, inputting the audio data into a second voice model of an application processor of the electronic equipment to obtain target confidence coefficient, wherein the target confidence coefficient is used for representing the accuracy of a judging result of the second voice model for judging that the audio data is a voice instruction; calculating a sound source angle corresponding to the audio data through the application processor, determining a preset sound source angle range within which the sound source angle corresponding to the audio data falls, and determining a target confidence threshold corresponding to the preset sound source angle range according to the preset sound source angle range and the corresponding relation between a plurality of preset sound source angle ranges and a plurality of target confidence thresholds; And when the target confidence coefficient is greater than or equal to the target confidence coefficient threshold value, determining the audio data as the voice instruction, and responding to the voice instruction to execute voice interaction.
2. The voice interaction method according to claim 1, wherein the first voice model is generated by training, as training data, audio features corresponding to a plurality of voice instructions, which are voice output from a sound source having a distance from the electronic device that is less than or equal to the first preset distance, as a positive sample of training the first voice model, and audio features corresponding to a plurality of non-voice instructions, which are voice output from a sound source having a distance from the electronic device that is greater than or equal to a second preset distance, as a negative sample of training the first voice model.
3. The method of claim 2, wherein the first speech model is a convolutional neural network model, and the inputting the audio data into the first speech model of the audio signal processor of the electronic device, to obtain the preliminary screening confidence level, comprises: Preprocessing the audio data and extracting the audio characteristics of the preprocessed audio data; inputting the audio features into the first voice model, and extracting convolution features of the audio features through a convolution layer of the first voice model; performing nonlinear conversion processing on the convolution characteristics of the audio characteristics through an activation layer of the first voice model; pooling the convolution features of the audio features through a pooling layer of the first voice model; And classifying the convolution characteristics of the audio characteristics through the full connection layer of the first voice model, determining the judgment result of the audio characteristics, and calculating the preliminary screening confidence coefficient.
4. The voice interaction method according to claim 1, wherein the second voice model includes a plurality of sub-voice models, the plurality of sub-voice models are generated by training, as training data, audio features corresponding to a plurality of voice instructions, which are voice output from a sound source having a distance from the electronic device less than or equal to the first preset distance, as positive samples for training the second voice model, and audio features corresponding to a plurality of non-voice instructions, which are voice output from a sound source having a distance from the electronic device greater than or equal to a second preset distance, as negative samples for training the second voice model.
5. The voice interaction method of claim 4, wherein the plurality of sub-models of the second voice model include a convolutional neural network model and a long-term memory network, the inputting the audio data into the second voice model of the application processor of the electronic device resulting in a target confidence level comprising: Calculating the corresponding confidence coefficient through a convolutional neural network model in the second voice model; Calculating the corresponding confidence coefficient through a long-term and short-term memory network in the second voice model; and calculating the target confidence according to the confidence calculated by the convolutional neural network model and the long-term and short-term memory network.
6. The voice interaction method of claim 5, wherein the target confidence level is an average, a weighted average, or a median of confidence levels calculated for the plurality of sub-voice models.
7. The voice interaction method according to claim 1, wherein the number of model parameters of the second voice model is larger than the number of model parameters of the first voice model, and/or the number of feature extraction layers of the second voice model is larger than the number of feature extraction layers of the first voice model, and/or the number of feature samples for training the second voice model is larger than the number of feature samples for training the first voice model.
8. The voice interaction method of claim 1, wherein the calculating, by the application processor, the sound source angle corresponding to the audio data comprises: Receiving the dual-channel audio data through the application processor, and extracting the audio characteristics of the dual-channel audio data; performing cross-correlation calculation on the two-channel audio data through a generalized cross-correlation-phase transformation algorithm to obtain a cross-correlation function between the two-channel audio data; And carrying out normalization processing on the cross-correlation function between the two-channel audio data, carrying out phase transformation weighting on the cross-correlation function after normalization processing, converting the correlation of the audio signals of the two-channel audio data in time into phase information of a frequency domain, and determining the angle of a sound source of the two-channel audio data relative to a microphone according to the phase information of the frequency domain.
9. The voice interaction method of claim 1, wherein the method further comprises: inputting audio data into a third speech model of the application processor when the target confidence level is greater than or equal to the target confidence level threshold; And when the third voice model judges that the audio data is voice of a designated user, determining that the audio data is the voice instruction, and responding to the voice instruction to execute voice interaction.
10. The voice interaction method of claim 9, wherein the method further comprises: acquiring a plurality of sections of voice acquired by a microphone of the electronic equipment, and extracting voiceprint characteristics of each section of voice; Training a Gaussian mixture model corresponding to the voice of the appointed user according to the voiceprint characteristics of each section of voice; and determining the trained Gaussian mixture model as a text-independent voiceprint model corresponding to the voice of the appointed user.
11. The voice interaction method of claim 10, wherein the inputting audio data into the third voice model of the application processor comprises: Extracting voiceprint features of the audio data; Inputting the voiceprint characteristics of the audio data into the text-independent voiceprint model corresponding to the voice of the appointed user, and obtaining a judging result of whether the audio data is the voice of the appointed user and the confidence of the judging result.
12. The voice interaction method of claim 1, wherein the method further comprises: and performing first noise reduction processing on the audio data through the audio signal processor.
13. The voice interaction method of claim 12, wherein the performing, by the audio signal processor, a first noise reduction process on the audio data comprises: And respectively acquiring the audio data through two microphones of the electronic equipment, and carrying out noise reduction processing on the two-channel audio data acquired by the two microphones through an acoustic echo cancellation algorithm by the audio signal processor.
14. The voice interaction method of claim 13, wherein the audio signal processor performs noise reduction processing on the two-channel audio data collected by the two microphones through an acoustic echo cancellation algorithm, comprising: the audio signal processor preprocesses the double-channel audio data and converts the double-channel audio data into frequency domain signals; Carrying out self-adaptive noise reduction on the frequency domain signals corresponding to the two-channel audio data by adopting a regularized minimum mean square error filtering algorithm; Adjusting the weight of a filter according to the difference between the frequency domain signal and the expected output signal, and completing noise reduction of the two-channel audio data by the regularized minimum mean square error filtering algorithm when the difference between the frequency domain signal and the expected output signal is smaller than or equal to a preset difference value; smoothing the two-channel audio data by using a Kalman filtering algorithm, and establishing a state model and a measurement model of the two-channel audio data; predicting the state of the two-channel audio data at the current moment according to a state transition equation, a state model of the two-channel audio data acquired at the previous moment and the measurement model; correcting the state of the predicted two-channel audio data at the current moment according to the state of the two-channel audio data actually observed at the current moment; and predicting the state of the two-channel audio data at the next moment according to the state transition equation, the state model of the two-channel audio data acquired at the predicted current moment and the measurement model until the difference between the two adjacent state estimation values is smaller than or equal to a preset state estimation value and the covariance matrix is smaller than or equal to a preset covariance matrix, so that the noise reduction of the two-channel audio data is completed.
15. The voice interaction method of claim 1, wherein the method further comprises: and performing second noise reduction processing on the audio data through the application processor.
16. The voice interaction method of claim 15, wherein the performing, by the application processor, a second noise reduction process on the audio data comprises: Preprocessing the audio data by the application processor to obtain a plurality of pieces of sub-audio data; extracting audio characteristics of each piece of sub-audio data, and identifying whether each piece of sub-audio data is pure audio data or not through the application processor; If it is determined that a segment of sub-audio data is the clean speech data, outputting the segment of sub-audio data, or If it is determined that a piece of sub-audio data is not the clean voice data, subtracting the clean voice data from the piece of sub-audio data to obtain a difference value between the piece of sub-audio data and the clean voice data; and filtering noise in the segment of sub-audio data according to the difference value between the segment of sub-audio data and the pure voice data, and completing the second noise reduction processing of the audio data.
17. The voice interaction method of claim 1, wherein the method further comprises: Judging whether the audio parameters of the audio data are larger than or equal to a preset audio parameter threshold value or not through the audio signal processor; and if the audio parameters of the audio data are greater than or equal to the preset audio parameter threshold, inputting the audio data into the first voice model to obtain the preliminary screening confidence coefficient.
18. The voice interaction method of claim 17, wherein the audio parameter is at least one of energy, amplitude and intensity of the audio data.
19. The voice interaction method of claim 1, wherein the method further comprises: Judging whether the audio parameters of the audio data are larger than or equal to a preset audio parameter threshold value or not through the audio signal processor, and judging whether the first pose parameters of the electronic equipment are larger than or equal to a first preset pose parameter threshold value or not; and if the audio parameters of the audio data are greater than or equal to the preset audio parameter threshold, and the first pose parameters of the electronic equipment are greater than or equal to the first preset pose parameter threshold, inputting the audio data into a first voice model of the audio signal processor, and obtaining the preliminary screening confidence.
20. The voice interaction method of claim 19, wherein the first pose parameter is a variance of triaxial acceleration of the electronic device, and the first preset pose parameter threshold comprises a preset triaxial acceleration threshold.

Description

Voice interaction method and related equipment Technical Field The application relates to the technical field of terminals, belongs to the voice processing technology, and particularly relates to a voice interaction method and related equipment. Background With the development of terminal technology, electronic devices such as smart phones, personal computers, smart speakers and the like have a voice assistant function, so that a user can perform voice interaction with the electronic devices to control the electronic devices to execute specified operations, such as playing music, making a call, setting an alarm clock or backlog, and the like. When the user interacts with the electronic equipment in voice, the electronic equipment collects the user voice, and recognizes whether the user voice is a voice command through the voice recognition model, and if the user voice is the voice command, the corresponding operation is executed in response to the voice command. The voice recognition model is generally disposed in an Audio signal Processor (ADSP) of the electronic device, and because the Audio signal Processor has a smaller memory and limited processing capacity, the network structure of the voice recognition model is simpler, the iteration capacity is weaker, the voice recognition accuracy of the voice recognition model is lower, so that a voice instruction cannot be recognized or a user voice is mistakenly recognized as the voice instruction, and further, the electronic device cannot respond to the user instruction correctly to perform voice interaction. Disclosure of Invention In view of the foregoing, it is necessary to provide a voice interaction method and related device, which solve the problem that the voice recognition accuracy of the voice recognition model provided in the audio signal processor is low, and the electronic device cannot respond to the user command correctly to perform voice interaction. The voice interaction method comprises the steps of collecting audio data, inputting the audio data into a first voice model of an audio signal processor of the electronic device to obtain initial screening confidence, determining the accuracy of a judging result of the audio data for short-distance voice through the first voice model, inputting the audio data into a second voice model of an application processor of the electronic device to obtain target confidence when the initial screening confidence is larger than or equal to the initial screening confidence threshold, determining the accuracy of a judging result of the audio data for voice command through the second voice model, calculating a sound source angle corresponding to the audio data through the application processor, determining the target confidence threshold corresponding to the sound source angle, determining the audio data for the voice command when the target confidence is larger than or equal to the target confidence threshold, and executing voice interaction in response to the voice command. Through the technical scheme, when the audio data are collected, the audio data are initially judged through the first voice model, the audio data which obviously do not belong to the close range voice are screened out, the second voice model does not need to identify the audio data which obviously do not belong to the close range voice, in addition, because the first voice model is simpler and operates on one side of the audio signal processor, the system power consumption of the voice identification process is lower. After the first voice model preliminarily judges that the audio data are close-range voice, the audio data are sent to the second voice model on one side of the application processor, and as the memory of the application processor is larger and the calculation power is stronger, compared with the first voice model, the second voice model is more complex and the judgment result is more accurate, so that whether the audio data are voice instructions can be accurately judged. In one possible implementation manner, the first voice model is generated by training the audio features corresponding to the voice instructions and the audio features corresponding to the non-voice instructions as training data, the voice instructions are voices output by a sound source with a distance from the electronic device being smaller than or equal to the first preset distance, the voice instructions are positive samples for training the first voice model, and the non-voice instructions are voices output by a sound source with a distance from the electronic device being greater than or equal to the second preset distance, and the voice instructions are negative samples for training the first voice model. Through the technical scheme, the voice features corresponding to the voice instructions are used as positive samples, the voice features corresponding to the non-voice instructions are used as negative samples, the first voice model is trained, and