CN-118782064-B - Voice signal processing method, device and equipment

CN118782064BCN 118782064 BCN118782064 BCN 118782064BCN-118782064-B

Abstract

The application discloses a voice signal processing method, a device and equipment. The method comprises the steps of obtaining a first sound source signal to be played, performing frequency division compression processing on the first sound source signal to obtain a second sound source signal, obtaining a reference signal for performing echo cancellation according to the second sound source signal, continuously transmitting the second sound source signal and playing the second sound source signal through a loudspeaker, obtaining a voice collection signal to be recognized, wherein the voice collection signal is an audio signal collected by a microphone during the period of playing the second sound source signal, the voice collection signal comprises a target audio signal to be recognized, performing echo cancellation on the voice collection signal based on the reference signal, and performing voice recognition on the voice signal after echo cancellation to obtain a voice recognition result corresponding to the target audio signal to be recognized. By adopting the method, the problems of low voice recognition rate and low voice wake-up rate are solved.

Inventors

WANG HONGYU
SONG GANG
LIANG XIAOTAO

Assignees

浙江未来精灵人工智能科技有限公司

Dates

Publication Date: 20260508
Application Date: 20240702

Claims (13)

1. A method of speech signal processing, the method being applied to an audio device, the audio device comprising a speaker and a microphone, comprising: acquiring a first sound source signal to be played, and performing frequency division compression processing on the first sound source signal to enable the signal intensity of a partial frequency band to be compressed so as to obtain a second sound source signal, wherein the partial frequency band is determined according to the frequency range of a target audio signal to be identified; Obtaining a reference signal for echo cancellation according to the second sound source signal, and continuously transmitting the second sound source signal and playing the second sound source signal through the loudspeaker; Acquiring a voice acquisition signal to be recognized, wherein the voice acquisition signal is an audio signal acquired by the microphone during the period that the second sound source signal is played by the loudspeaker, and comprises the target audio signal to be recognized; And carrying out echo cancellation on the voice acquisition signal based on the reference signal, and carrying out voice recognition on the voice signal after echo cancellation to obtain a voice recognition result corresponding to the target audio signal to be recognized.
2. The method of claim 1, wherein the performing speech recognition based on the echo cancelled speech signal comprises: In a voice recognition algorithm for voice recognition, the recognition weight of the voice signal corresponding to the partial frequency band in the voice signal after echo cancellation is increased.
3. The method of claim 1, wherein the obtaining the first audio signal to be played, and performing frequency division compression processing on the first audio signal, so that the signal strength of the partial frequency band is compressed to obtain the second audio signal, includes: Dividing the total bandwidth of the first sound source signal into a series of non-overlapping frequency bands, and selecting part of the frequency bands from the frequency bands to compress the signal intensity; The method comprises the steps of selecting part of frequency bands from the frequency bands to compress the signal intensity, wherein the step of uniformly selecting or unevenly selecting the frequency bands to compress the signal intensity comprises the step of uniformly selecting or unevenly selecting the frequency bands to compress.
4. A method according to claim 3, wherein dividing the total bandwidth of the first source signal into a series of non-overlapping frequency bands, selecting a portion of the frequency bands for signal strength compression, comprises: performing Fourier transform on the first sound source signal to obtain a corresponding first frequency signal; Dividing the first frequency signal into a plurality of non-overlapping frequency bands, and identifying each frequency band as an index representing the frequency band information, each index being a frequency point of the first frequency signal; Determining a compression frequency point from the frequency points of the first frequency signal, compressing the frequency band signal identified by the compression frequency point, and taking the first frequency signal after partial signal compression as a second frequency signal; and performing inverse Fourier transform on the second frequency signal, and taking a time domain signal obtained by the inverse Fourier transform as the second sound source signal.
5. The method of claim 4, wherein said dividing the first frequency signal into a plurality of non-overlapping frequency bands comprises: Determining a frequency point bandwidth for dividing a frequency band according to the voice signal sampling frequency and the number of points of Fourier transformation, and dividing the first frequency signal into a plurality of non-overlapping frequency bands according to the frequency point bandwidth; The determining a compression frequency point from the frequency points of the first frequency signal includes: and uniformly or unevenly selecting partial frequency division points from the frequency points to serve as the compression frequency points.
6. The method as recited in claim 4, further comprising: the method comprises the steps of dynamically setting a compression ratio at a compression frequency point according to the voice signal intensity at the compression frequency point or setting the compression ratio according to the linear relation with the voice signal intensity, wherein the compression ratio represents the signal compression amplitude of a frequency band marked by the compression frequency point; and/or the number of the groups of groups, And (5) carrying out the emptying treatment on part of the compressed frequency points.
7. The method as recited in claim 4, further comprising: And controlling the duty ratio of the compression frequency points in the total number of frequency points so as to reduce the influence of compression processing on the playing tone quality of the first sound source signal.
8. The method of claim 4, wherein the echo cancellation of the speech acquisition signal based on the reference signal and the speech recognition based on the echo cancelled speech signal comprise: the voice acquisition signal comprises the loudspeaker signal and the target audio signal to be identified, wherein the target audio signal to be identified is a voice wake-up signal, and the loudspeaker signal is generated by playing a second sound source signal compressed on the compression frequency point through the loudspeaker; performing analog-to-digital conversion on the voice acquisition signal to obtain a digital microphone signal; The reference signal is taken as an echo digital estimation signal corresponding to the loudspeaker signal, the reference signal is removed from the digital microphone signal, and a signal which is cleaner on the compression frequency point is obtained and is taken as a voice signal after echo cancellation; In a voice wake-up algorithm for voice recognition, the recognition weight of the voice signal corresponding to the compression frequency point is increased so as to recognize the voice wake-up word corresponding to the voice wake-up signal.
9. A method according to claim 3, wherein said deriving a reference signal for echo cancellation from said second source signal comprises: After digital-to-analog conversion and power amplification are carried out on the second sound source signal, a signal before entering a loudspeaker is used as a first reference signal, and a digital reference signal obtained by carrying out analog-to-digital conversion on the first reference signal is used as the reference signal for carrying out echo cancellation; Or alternatively And taking the signal before the digital-to-analog conversion of the second sound source signal as a second reference signal, and taking the second reference signal as the reference signal for echo cancellation.
10. A voice signal processing device, which is characterized by being applied to an audio device, wherein the audio device comprises a loudspeaker and a microphone, and comprises: The frequency division compression unit is used for acquiring a first sound source signal to be played, performing frequency division compression processing on the first sound source signal so that the signal intensity of a partial frequency band is compressed to obtain a second sound source signal, wherein the partial frequency band is determined according to the frequency range of a target audio signal to be identified; The reference signal acquisition unit is used for obtaining a reference signal for echo cancellation according to the second sound source signal, and continuously transmitting the second sound source signal and playing the second sound source signal through the loudspeaker; the sound pickup unit is used for acquiring a voice acquisition signal to be recognized, wherein the voice acquisition signal is an audio signal acquired by the microphone during the period that the second sound source signal is played by the loudspeaker, and the voice acquisition signal comprises the target audio signal to be recognized; And the recognition unit is used for carrying out echo cancellation on the voice acquisition signal based on the reference signal, and carrying out voice recognition on the voice signal after the echo cancellation to obtain a voice recognition result corresponding to the target audio signal to be recognized.
11. An audio device comprising a speaker, a microphone and an apparatus as claimed in claim 10.
12. A voice awakening system is characterized by comprising a loudspeaker, a plurality of microphones, a frequency division module, a reference signal acquisition module, a digital-to-analog converter, a power amplifier, an analog-to-digital converter, an acoustic echo canceller and an awakening processing module; The frequency division compression module is used for obtaining a first sound source signal to be played, carrying out frequency division compression processing on the first sound source signal so that the signal intensity of a part of frequency bands is compressed to obtain a second sound source signal, and determining the part of frequency bands according to the frequency range of a voice wake-up signal; The reference signal acquisition module is used for acquiring a reference signal for echo cancellation according to the second sound source signal, wherein the reference signal is transmitted to the acoustic echo canceller; The microphones are used for acquiring voice acquisition signals to be recognized, wherein the voice acquisition signals are audio signals acquired by the microphones during the period that the second sound source signals are played by the loudspeaker, and the audio signals comprise voice wake-up signals; The echo canceller is used for performing echo cancellation on the digital microphone signal according to the reference signal to obtain an echo cancelled voice signal, wherein the voice signal is a cleaner voice wake-up signal; The wake-up processing module is used for carrying out voice recognition based on the voice signal after echo cancellation to obtain a voice recognition result corresponding to the voice wake-up signal, wherein the voice recognition result comprises wake-up words used for triggering a voice wake-up function.
13. An electronic device, comprising: a memory for storing a computer program which, when executed by the processor, performs the method of any one of claims 1-9.

Description

Voice signal processing method, device and equipment Technical Field The present application relates to the field of signal processing technologies, and in particular, to a method and an apparatus for processing a speech signal, and an electronic device. The application also relates to an audio device and a voice wake-up system. Background With the development of speech signal processing technology, a speech wake-up function gradually becomes an important speech interaction mode. Intelligent audio devices (e.g., smart speakers, smart phones, smart televisions, etc.) that provide voice wake-up functionality have both speakers and microphones to collect and play sound signals. Intelligent audio devices often integrate other audio playing functions, so that voice wake-up while audio is played by a speaker is a common application scenario in practice. The microphone picks up the speaker signal played by the speaker and other audio signals such as a voice wake-up signal from the user, the speaker signal picked up by the microphone being referred to as an echo. The echo cancellation effect is an important factor affecting the recognition rate of the voice wake-up signal. In the prior art, an intelligent audio device generally adopts an Acoustic Echo Canceller (AEC) to perform voice recognition on an audio signal such as a voice wake-up signal, and a reference signal of the Acoustic Echo Canceller (AEC) adopts a signal before a speaker, such as a signal after power amplification or a signal before digital-to-analog conversion, but the reference signal has a larger difference from a signal after nonlinear distortion influence caused by the influence of a speaker channel and a cavity structure of a voice cavity, so that the problem of low voice recognition rate of the voice wake-up signal due to poor echo cancellation effect exists. Particularly, when the audio signal is played at a high volume, the loudspeaker signal picked up by the microphone is far larger than the voice wake-up signal, so that the signal to noise ratio is low, the voice recognition rate is further reduced, and the equipment is difficult to wake up or wake up by mistake. Therefore, how to solve the problems of low speech recognition rate and low wake-up rate is a problem to be solved. The above information disclosed in the background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art. Disclosure of Invention The voice signal processing method provided by the embodiment of the application solves the problem of low voice recognition rate and low voice wake-up rate, and improves the man-machine interaction efficiency. The embodiment of the application provides a voice signal processing method which is applied to audio equipment, wherein the audio equipment comprises a loudspeaker and a microphone, and comprises the steps of obtaining a first sound source signal to be played, carrying out frequency division compression processing on the first sound source signal so as to enable the signal intensity of a part of frequency band to be compressed to obtain a second sound source signal, obtaining a reference signal for carrying out echo cancellation according to the second sound source signal, continuously transmitting the second sound source signal and playing the second sound source signal through the loudspeaker, obtaining a voice acquisition signal to be recognized, wherein the voice acquisition signal is an audio signal acquired by the microphone during the playing of the second sound source signal by the loudspeaker, and comprises a target audio signal to be recognized, carrying out echo cancellation on the voice acquisition signal based on the reference signal, and carrying out voice recognition on the voice signal after echo cancellation so as to obtain a voice recognition result corresponding to the target audio signal to be recognized. Optionally, the voice recognition based on the voice signal after the echo cancellation comprises the step of increasing the recognition weight of the voice signal corresponding to the partial frequency band in the voice recognition algorithm for voice recognition. Optionally, the obtaining the first audio signal to be played and performing frequency division compression processing on the first audio signal to enable the signal intensity of a part of frequency bands to be compressed to obtain a second audio signal, which includes dividing the total bandwidth of the first audio signal into a series of non-overlapping frequency bands, and selecting part of the frequency bands to perform signal intensity compression, wherein the selecting part of the frequency bands to perform signal intensity compression includes uniformly selecting or unevenly selecting the frequency bands to perform compression. Optionally, the dividing the total bandwidth of th