CN-122024730-A - Voice recognition method and system of Bluetooth headset

CN122024730ACN 122024730 ACN122024730 ACN 122024730ACN-122024730-A

Abstract

The invention is applicable to the technical field of voice recognition, and provides a voice recognition method and a voice recognition system of a Bluetooth headset. And when the similarity is larger than the energy judgment threshold, starting from the short-time audio frame with the similarity larger than the threshold, sending subsequent environmental audio data to the mobile terminal, and identifying a voice instruction corresponding to the subsequent environmental audio data through the mobile terminal. Compared with the traditional time domain analysis, the discrimination method based on the frequency spectrum features has stronger noise resistance, so that the Bluetooth headset can effectively recognize the wake-up instruction of the user even in a noisy environment.

Inventors

WEI XINWEI
ZHANG ZUODONG
Lin Tingchi

Assignees

深圳市高为通信技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. The voice recognition method of the Bluetooth headset is characterized by comprising the following steps of: the Bluetooth headset acquires real-time environment audio data and segments the real-time environment audio data into a plurality of short-time audio frames; when the audio energy corresponding to the short-time audio frame is larger than preset energy, calculating the similarity and the energy judgment threshold between the spectrum profile corresponding to the short-time audio frame and the spectrum profile corresponding to the wake-up word; And when the similarity is larger than the energy judgment threshold, starting from the short-time audio frame with the similarity larger than the energy judgment threshold, sending subsequent environmental audio data to the mobile terminal, and identifying a voice instruction corresponding to the subsequent environmental audio data through the mobile terminal.
2. The method for recognizing voice of bluetooth headset according to claim 1, wherein when the similarity is greater than an energy decision threshold, the step of transmitting subsequent environmental audio data to the mobile terminal from a short-time audio frame in which the similarity is greater than the energy decision threshold, and recognizing a voice command corresponding to the subsequent environmental audio data by the mobile terminal, further comprises: After the mobile terminal recognizes the voice command, the mobile terminal sends an ending command to the Bluetooth headset; and after receiving the ending instruction, the Bluetooth headset is switched to a low-power consumption monitoring mode, wherein the low-power consumption monitoring mode is to execute a monitoring task with preset power consumption.
3. The method for recognizing the voice of the bluetooth headset according to claim 1, wherein the step of calculating the similarity and the energy decision threshold between the spectrum profile corresponding to the short-time audio frame and the spectrum profile corresponding to the wake-up word when the audio energy corresponding to the short-time audio frame is greater than a preset energy comprises: Calculating the spectrum entropy of the short-time audio frame, wherein the spectrum entropy is used for measuring the energy distribution uniformity degree of a frequency spectrum; When the spectral entropy is smaller than a preset value and the audio energy corresponding to the short-time audio frame is larger than the preset energy, calculating the spectral profile of the short-time audio frame; And calculating the energy judgment threshold based on the spectral entropy, wherein the energy judgment threshold=basic energy threshold× (1+alpha× (spectral entropy value-spectral entropy reference value)), alpha is an adjusting factor, and the spectral entropy reference value is a spectral entropy estimated value of experimental background noise.
4. The method for voice recognition of a bluetooth headset according to claim 3, wherein the step of calculating the spectral entropy of the short-time audio frame comprises: acquiring a frequency domain energy spectrum of the short-time audio frame; Dividing the frequency domain energy spectrum into M sub-bands, wherein the M sub-bands comprise 0 hz to 500 hz, 500 hz to 1500 hz, 1500 hz to 2500 hz, and 2500 hz to 4000 hz; Calculating the energy ratio of each sub-band; And according to the energy duty ratio of each sub-band, calculating the relative difference between the maximum value and the average value, and taking the relative difference as the corresponding spectral entropy of the short-time audio frame.
5. The method for recognizing voice of bluetooth headset according to claim 1, wherein when the similarity is greater than an energy decision threshold, the step of transmitting subsequent environmental audio data to the mobile terminal from a short-time audio frame in which the similarity is greater than the energy decision threshold, and recognizing a voice command corresponding to the subsequent environmental audio data by the mobile terminal, comprises: When the similarity is larger than an energy judgment threshold, starting from a short-time audio frame with the similarity larger than the energy judgment threshold, sending subsequent environmental audio data to the mobile terminal; the mobile terminal receives an input voice signal, and inputs the voice signal into a first transducer encoder in an acoustic model to obtain a first acoustic feature sequence; inputting the first acoustic feature sequence into an intermediate predictor to obtain a first intermediate text hypothesis sequence, wherein the intermediate predictor consists of a linear projection layer and a Softmax layer; The method comprises the steps of inputting a first intermediate text hypothesis sequence into a dynamic language model to obtain a corresponding first text semantic vector, wherein the dynamic language model consists of a basic language model and an adapter network, the basic language model is a converter decoder framework, and the adapter network is inserted into at least one converter layer of the basic language model in a residual connection mode; generating a first language context vector according to the first acoustic feature sequence and the first text semantic vector; the first acoustic feature sequence and the first language context vector are subjected to weighted fusion through a language context gate to obtain a first fusion feature sequence; Inputting the first fusion feature sequence into a subsequent transducer encoder to obtain a second acoustic feature sequence, wherein the subsequent transducer encoder is positioned after the first transducer encoder; performing iterative processing in an N-layer encoder corresponding to the acoustic model based on the processing procedure of the second acoustic feature sequence to obtain a final acoustic feature sequence; And inputting the final acoustic feature sequence into a time sequence classification decoder to obtain the voice instruction.
6. The method of claim 5, wherein generating a first language context vector from the first acoustic feature sequence and the first text semantic vector comprises: linearly projecting the first acoustic feature sequence to obtain an acoustic query vector; Splicing the acoustic query vector with the first text semantic vector; and inputting the spliced vectors into the dynamic language model, calculating through an internal attention mechanism of the dynamic language model, and outputting the first language context vector.
7. The method for recognizing a voice of a bluetooth headset according to claim 5, wherein the step of performing weighted fusion on the first acoustic feature sequence and the first language context vector through a language context gate to obtain a first fused feature sequence comprises: Splicing the first acoustic feature sequence with the first language context vector; inputting the spliced vectors to a full-connection layer, and performing Sigmoid activation function processing to obtain gating weights; and taking the gating weight as the weight corresponding to the language context gate, and carrying out weighted fusion on the first acoustic feature sequence and the first language context vector to obtain a first fusion feature sequence.
8. A voice recognition system of a bluetooth headset, the voice recognition system of the bluetooth headset comprising: the acquisition unit is used for acquiring real-time environment audio data by the Bluetooth headset and dividing the real-time environment audio data into a plurality of short-time audio frames; the computing unit is used for computing the similarity and the energy judgment threshold between the frequency spectrum outline corresponding to the short-time audio frame and the frequency spectrum outline corresponding to the wake-up word when the audio energy corresponding to the short-time audio frame is larger than the preset energy; And the communication unit is used for sending subsequent environmental audio data to the mobile terminal from the short-time audio frame with the similarity larger than the energy judgment threshold when the similarity is larger than the energy judgment threshold, and recognizing a voice instruction corresponding to the subsequent environmental audio data through the mobile terminal.
9. A terminal device, characterized in that it comprises a memory, a processor and a speech recognition program of a bluetooth headset stored on the memory and operable on the processor, the speech recognition program of the bluetooth headset being configured to implement the steps of the speech recognition method of the bluetooth headset according to any of claims 1 to 7.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps in the voice recognition method of a bluetooth headset according to any of claims 1 to 7.

Description

Voice recognition method and system of Bluetooth headset Technical Field The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition method and system of a Bluetooth headset. Background With the development of wireless technology, bluetooth headsets have been widely used in daily life as a portable audio device. The user's demand for bluetooth headset functionality is increasing, especially in terms of speech recognition and man-machine interaction. Traditional speech recognition technology mainly relies on high-performance computing equipment, and in the case of large environmental noise, accuracy and instantaneity of recognition are affected. When the existing Bluetooth headset is used for voice recognition, a fixed audio input mode is generally adopted, and real-time analysis and adaptability to environmental audio are lacked. This results in noisy environments where the user's voice instructions may not be accurately recognized, thereby affecting the user experience. Disclosure of Invention In view of this, the embodiments of the present invention provide a method and a system for voice recognition of a bluetooth headset, so as to solve the technical problem that the existing bluetooth headset generally adopts a fixed audio input mode when performing voice recognition, and lacks real-time analysis and adaptability to environmental audio. A first aspect of an embodiment of the present invention provides a method for recognizing voice of a bluetooth headset, where the method for recognizing voice of a bluetooth headset includes: the Bluetooth headset acquires real-time environment audio data and segments the real-time environment audio data into a plurality of short-time audio frames; when the audio energy corresponding to the short-time audio frame is larger than preset energy, calculating the similarity and the energy judgment threshold between the spectrum profile corresponding to the short-time audio frame and the spectrum profile corresponding to the wake-up word; And when the similarity is larger than the energy judgment threshold, starting from the short-time audio frame with the similarity larger than the energy judgment threshold, sending subsequent environmental audio data to the mobile terminal, and identifying a voice instruction corresponding to the subsequent environmental audio data through the mobile terminal. Further, when the similarity is greater than the energy decision threshold, the method starts from a short-time audio frame with the similarity greater than the energy decision threshold, sends subsequent environmental audio data to the mobile terminal, and after the step of identifying a voice command corresponding to the subsequent environmental audio data by the mobile terminal, the method further includes: After the mobile terminal recognizes the voice command, the mobile terminal sends an ending command to the Bluetooth headset; and after receiving the ending instruction, the Bluetooth headset is switched to a low-power consumption monitoring mode, wherein the low-power consumption monitoring mode is to execute a monitoring task with preset power consumption. Further, when the audio energy corresponding to the short-time audio frame is greater than a preset energy, the step of calculating the similarity and the energy decision threshold between the spectrum profile corresponding to the short-time audio frame and the spectrum profile corresponding to the wake-up word includes: Calculating the spectrum entropy of the short-time audio frame, wherein the spectrum entropy is used for measuring the energy distribution uniformity degree of a frequency spectrum; When the spectral entropy is smaller than a preset value and the audio energy corresponding to the short-time audio frame is larger than the preset energy, calculating the spectral profile of the short-time audio frame; And calculating the energy judgment threshold based on the spectral entropy, wherein the energy judgment threshold=basic energy threshold× (1+alpha× (spectral entropy value-spectral entropy reference value)), alpha is an adjusting factor, and the spectral entropy reference value is a spectral entropy estimated value of experimental background noise. Further, the step of calculating the spectral entropy of the short-time audio frame comprises: acquiring a frequency domain energy spectrum of the short-time audio frame; Dividing the frequency domain energy spectrum into M sub-bands, wherein the M sub-bands comprise 0 hz to 500 hz, 500 hz to 1500 hz, 1500 hz to 2500 hz, and 2500 hz to 4000 hz; Calculating the energy ratio of each sub-band; And according to the energy duty ratio of each sub-band, calculating the relative difference between the maximum value and the average value, and taking the relative difference as the corresponding spectral entropy of the short-time audio frame. Further, when the similarity is greater than the energy decision threshold, the ste