CN-116193313-B - Speech enhancement method, device, electronic apparatus, storage medium, and program

CN116193313BCN 116193313 BCN116193313 BCN 116193313BCN-116193313-B

Abstract

The disclosure provides a voice enhancement method, a voice enhancement device, electronic equipment, a storage medium and a program, and relates to the technical field of audio processing. The method comprises the specific steps of responding to the fact that a user wears the earphone, sending prompt audio data through a loudspeaker in the earphone, collecting audio data in an auditory canal, determining a target occlusion effect curve according to the prompt audio data and the audio data in the auditory canal, filtering a second sound signal to obtain a third sound signal, calculating to obtain a noise signal according to the first sound signal, and enhancing the third sound signal. The method and the device realize the enhancement of the third sound signal by prompting the audio data and the audio data in the auditory canal to determine the frequency response curve in the auditory canal and determining the occlusion effect curve, avoid the high occupation ratio of the voice enhancement to the operation resource and improve the efficiency of the voice enhancement.

Inventors

ZHOU LINGSONG

Assignees

小米科技(武汉)有限公司
北京小米移动软件有限公司
北京小米松果电子有限公司

Dates

Publication Date: 20260505
Application Date: 20221223

Claims (13)

1. A method of speech enhancement, comprising: Responding to the detection that a user wears the earphone, sending out prompt audio data through a loudspeaker in the earphone, and collecting audio data in an auditory canal; Calculating to obtain a frequency response curve according to the prompt audio data and the audio data in the auditory canal, and determining a target occlusion effect curve according to the frequency response curve; Collecting a first sound signal outside an ear canal, and collecting a second sound signal inside the ear canal, wherein the first sound signal is collected by a conversation microphone; filtering the second sound signal according to the target occlusion effect curve to obtain a third sound signal; and calculating to obtain a noise signal according to the first sound signal, and enhancing the third sound signal according to the noise signal to obtain an enhanced sound signal.
2. The method according to claim 1, wherein the step of calculating a frequency response curve from the cue audio data and the in-ear-canal audio data comprises: Calculating a time domain signal corresponding to the prompt audio data to perform Fourier transform so as to generate a prompt audio domain signal, and performing Fourier transform on the time domain signal of the audio data in the auditory canal so as to generate a frequency domain signal in the auditory canal; Calculating a cross power spectrum of the cue audio frequency domain signal and the in-ear-canal frequency domain signal according to the cue audio frequency domain signal and the in-ear-canal frequency domain signal, and calculating a self power spectrum of the cue audio frequency domain signal; Dividing the cross power spectrum by the self power spectrum to obtain the frequency response curve.
3. The method of claim 1, wherein after the step of calculating a noise signal from the first sound signal further comprises: And multiplying the preset passive noise reduction curve with the noise signal point to obtain a leakage noise signal.
4. The method according to claim 1, wherein said step of determining a target occlusion effect curve from said frequency response curve comprises in particular: Obtaining a preset mapping table, wherein the mapping table comprises a corresponding relation between a preset frequency response curve and a preset blocking effect curve; And determining a preset occlusion effect curve corresponding to the frequency response curve in the mapping table as the target occlusion effect curve.
5. The method of claim 1, wherein the step of filtering the second sound signal according to the target occlusion effect curve to obtain a third sound signal comprises: and carrying out convolution operation on the target occlusion effect curve and the second sound signal to obtain the third sound signal.
6. The method according to claim 1, wherein the step of enhancing the third sound signal in accordance with the noise signal comprises in particular: Acquiring a preset reference power spectrum, wherein the reference power spectrum is a power spectrum corresponding to a signal only containing human voice; dividing the reference power spectrum by the power spectrum of the noise signal to obtain a posterior signal-to-noise ratio; Calculating the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the previous signal frame to obtain a prior signal-to-noise ratio predicted value of the current signal frame; And determining a filter function according to the prior signal-to-noise ratio, and enhancing the third sound signal according to the filter function.
7. The method of claim 6, wherein the formulating of the posterior signal-to-noise ratio is expressed as: , wherein, For the posterior signal-to-noise ratio, n is the signal frame, k is the frequency bin, For the power spectrum of the third sound signal, A power spectrum for the noise signal; the prior signal to noise ratio is formulated as: , wherein, For the a priori signal to noise ratio, For the reference power spectrum.
8. The method of claim 7, wherein the computing the a priori signal to noise ratio of the prior signal to noise ratio of the previous signal frame to obtain the a priori signal to noise ratio prediction value of the current signal frame is expressed as: , wherein, As a smoothing factor, the smoothing factor is used, For a priori signal-to-noise ratio predictions, Is the reference power spectrum predictor of the frame preceding the current signal frame, Is the power spectrum of the noise signal one frame before the current signal frame, To choose And a maximum value between 0.
9. The method according to claim 8, wherein said step of determining a filter function from said a priori signal to noise ratio, and enhancing said third sound signal from said filter function comprises: Obtaining a filter function according to the prior signal-to-noise ratio predicted value of the current signal frame, and formulating as: , wherein, Is the filter function; And performing gain control on the third sound signal according to the filter function and the posterior signal-to-noise ratio so as to enhance the third sound signal and obtain the enhanced sound signal.
10. A speech enhancement apparatus, comprising: the first acquisition module is used for responding to the detection that the user wears the earphone, sending out prompt audio data through a loudspeaker in the earphone and acquiring audio data in an auditory canal; The occlusion effect curve acquisition module is used for calculating a frequency response curve according to the prompt audio data and the audio data in the auditory canal and determining a target occlusion effect curve according to the frequency response curve; The second acquisition module is used for acquiring a first sound signal outside the auditory canal and acquiring a second sound signal inside the auditory canal, and the first sound signal is acquired by the communication microphone; the filtering module is used for filtering the second sound signal according to the target occlusion effect curve to obtain a third sound signal; The voice enhancement module is used for calculating a noise signal according to the first voice signal and enhancing the third voice signal according to the noise signal so as to obtain an enhanced voice signal.
11. An electronic device, comprising: A processor; A memory for storing the processor-executable instructions; Wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 9.
12. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any of claims 1 to 9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.

Description

Speech enhancement method, device, electronic apparatus, storage medium, and program Technical Field The present disclosure relates to the field of audio processing technologies, and in particular, to a method, an apparatus, an electronic device, a storage medium, and a program for enhancing speech. Background In the related art, the voice enhancement technology is applied to the conversation earlier, so that the ambient noise around the speaker is reduced. Initially, based mainly on the conventional signal processing manner, noise is estimated by a gaussian-like hypothesis, a gain value is calculated according to the estimated noise, and a noisy signal is modulated, so that a clean speech signal is finally obtained. With the recent increase in computing power, neural networks have been used to improve the performance of algorithms, particularly in a multi-speaker, non-stationary noise scenario. However, the current voice enhancement algorithm occupies more operation resources, and consumes too high electric quantity, so that the duration of the earphone product is shorter. Disclosure of Invention The disclosure provides a voice enhancement method, a device, an electronic apparatus, a storage medium and a program, so as to at least solve the problem that voice enhancement occupies more operation resources in the related art. The technical scheme of the present disclosure is as follows: according to a first aspect of embodiments of the present disclosure, there is provided a speech enhancement method, including: Responding to the detection that a user wears the earphone, sending out prompt audio data through a loudspeaker in the earphone, and collecting audio data in an auditory canal; Calculating to obtain a frequency response curve according to the prompt audio data and the audio data in the auditory canal, and determining a target occlusion effect curve according to the frequency response curve; collecting a first sound signal outside the auditory canal, and collecting a second sound signal inside the auditory canal; filtering the second sound signal according to the target occlusion effect curve to obtain a third sound signal; and calculating to obtain a noise signal according to the first sound signal, and enhancing the third sound signal according to the noise signal to obtain an enhanced sound signal. Optionally, the step of calculating the frequency response curve according to the prompt audio data and the audio data in the ear canal specifically includes: Calculating a time domain signal corresponding to the prompt audio data to perform Fourier transform so as to generate a prompt audio domain signal, and performing Fourier transform on the time domain signal of the audio data in the auditory canal so as to generate a frequency domain signal in the auditory canal; Calculating a cross power spectrum of the cue audio frequency domain signal and the in-ear-canal frequency domain signal according to the cue audio frequency domain signal and the in-ear-canal frequency domain signal, and calculating a self power spectrum of the cue audio frequency domain signal; Dividing the cross power spectrum by the self power spectrum to obtain the frequency response curve. Optionally, after the step of calculating the noise signal according to the first sound signal, the method further includes: And multiplying the preset passive noise reduction curve with the noise signal point to obtain a leakage noise signal. Optionally, the step of determining a target occlusion effect curve according to the frequency response curve specifically includes: Obtaining a preset mapping table, wherein the mapping table comprises a corresponding relation between a preset frequency response curve and a preset blocking effect curve; And determining a preset occlusion effect curve corresponding to the frequency response curve in the mapping table as the target occlusion effect curve. Optionally, the step of filtering the second sound signal according to the target occlusion effect curve to obtain a third sound signal specifically includes: and carrying out convolution operation on the target occlusion effect curve and the second sound signal to obtain the third sound signal. Optionally, the step of enhancing the third sound signal according to the noise signal specifically includes: Acquiring a preset reference power spectrum, wherein the reference power spectrum is a power spectrum corresponding to a signal only containing human voice; dividing the reference power spectrum by the power spectrum of the noise signal to obtain a posterior signal-to-noise ratio; Calculating the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the previous signal frame to obtain a prior signal-to-noise ratio predicted value of the current signal frame; And determining a filter function according to the prior signal-to-noise ratio, and enhancing the third sound signal according to the filter function. Optionally, the formulation of the po