CN-120913582-B - Speech processing method, medium, electronic device, and program product

CN120913582BCN 120913582 BCN120913582 BCN 120913582BCN-120913582-B

Abstract

The application relates to the technical field of voice processing, and discloses a voice processing method, a medium, electronic equipment and a program product. The method comprises the steps of firstly judging the noisy degree of the environment where the noisy voice data to be processed are located, for example, obtaining parameters representing the noisy degree such as the signal-to-noise ratio of the noisy voice data to be processed, and then setting different denoising reasoning times based on the parameters representing the noisy degree, for example, setting different denoising reasoning times for the noisy voice data with different signal-to-noise ratios. Therefore, the denoising effect is guaranteed, and meanwhile, the calculation resource waste in the denoising process is avoided.

Inventors

SHI QIANG
LI YU
TANG WEI
WANG TAIHUI

Assignees

荣耀终端股份有限公司

Dates

Publication Date: 20260512
Application Date: 20240430

Claims (20)

1. A method of speech processing, the method comprising: Acquiring first voice data to be noise reduced; Calculating a noisy degree parameter of a first acquisition environment of the first voice data; Determining target reasoning times according to the noisy degree parameters, wherein the higher the noisy degree of the environment indicated by the noisy degree parameters is, the larger the corresponding target reasoning times are; And denoising the first voice data by adopting a diffusion model according to the target reasoning times to obtain denoised second voice data.
2. The method of claim 1, wherein the noise level parameter is a signal-to-noise ratio, and a greater signal-to-noise ratio indicates a lower ambient noise level.
3. The method of claim 2, wherein the signal-to-noise ratio is calculated by: Detecting a voice activity sequence of the first voice data, wherein data with a first value in the voice activity sequence is voice data, and data with a second value in the voice activity sequence is noise data; counting first energy and first data quantity of data with the value of the first value in the voice activity sequence, and counting second energy and second data quantity of data with the value of the second value in the voice activity sequence; Using the formula Calculating the signal-to-noise ratio; Where snr is the signal-to-noise ratio, p s is the first energy, l s is the first data amount, p n is the second energy, and l n is the second data amount.
4. The method of claim 3, wherein the sequence of voice activities is implemented using at least one of an energy detection algorithm, zero crossing detection, hidden Markov model HMM, long and short term memory network LSTM.
5. The method of claim 2, wherein the target number of inferences is determined by: determining that the signal-to-noise ratio belongs to a target threshold range, and determining the reasoning times corresponding to the target threshold range as the target reasoning times; The target threshold range is one of a plurality of preset threshold ranges, and the number of reasoning times corresponding to the preset threshold range with the larger minimum value is smaller.
6. The method of claim 1, wherein the second voice data is obtained by: converting the first voice data from a time domain to a frequency domain to obtain first frequency domain data; Inputting the first frequency domain data into the diffusion model, and denoising the first frequency domain data according to the target reasoning times to obtain second frequency domain data, wherein the first frequency domain data and the second frequency domain data have a preset linear relationship; And converting the second frequency domain data from a frequency domain to a time domain to obtain the second voice data.
7. The method of claim 6, wherein the second frequency domain data is obtained by: acquiring N sampling times according to the numerical value N of the target reasoning times, and inputting the N sampling time points into the diffusion model; Corresponding to i < N, aiming at an ith sampling time point, acquiring ith correction information corresponding to ith frequency domain data by the diffusion model, denoising the ith frequency domain data through the ith correction information to obtain (i+1) th frequency domain data, wherein the 1 st frequency domain data is the first frequency domain data, the ith frequency domain data and the (i+1) th frequency domain data have the preset linear relation, and i is a positive integer; corresponding to i=n, the (i+1) th frequency domain data is taken as the second frequency domain data.
8. The method of claim 7, wherein the i+1th frequency domain data is obtained by: For an ith sampling time point, deriving time from the ith frequency domain data by the diffusion model to obtain the ith correction information, wherein the ith correction information is used for indicating the data change speed from the ith frequency domain data to the (i+1) th frequency domain data; and denoising the ith frequency domain data according to the ith correction information by adopting a correction algorithm to obtain the (i+1) th frequency domain data.
9. The method of claim 8, wherein the correction algorithm comprises at least one of a euler algorithm and Cha Sen extrapolated RK algorithm.
10. The method of claim 8, wherein the ith correction information is a ratio of ith frequency domain data to N, and wherein the (i+1) th frequency domain data is derived from the ith frequency domain data plus the ith correction information.
11. The method according to any one of claims 6 to 10, wherein the diffusion model is trained by: acquiring first training time domain data and second training time domain data of training voice, wherein the second training time domain data is obtained by superposing preset time domain noise data on the first training time domain data; converting the first training time domain data and the second training time domain data from time domain to frequency domain to obtain first training frequency domain data and second training frequency domain data respectively; And training the diffusion model according to the first training frequency domain data and the second training frequency domain data to obtain the trained diffusion model.
12. The method of claim 11, wherein the noise indicated by the predetermined time domain noise data comprises at least one of wind, whistling, keyboard, human speech.
13. The method of claim 11, wherein the training the diffusion model based on the first training frequency domain data and the second training frequency domain data comprises: Acquiring training sampling time; According to the training sampling time, performing linear interpolation on the first training frequency domain data and the second training frequency domain data based on the preset linear relation to obtain third training frequency domain data; superposing preset frequency domain noise data on the third training frequency domain data to obtain fourth training frequency domain data, wherein the dimension of the third training frequency domain data is the same as the dimension of the preset frequency domain noise data; inputting the fourth training frequency domain data into the diffusion model, and deriving time from the fourth training frequency domain data through the diffusion model to obtain first training correction information; And adjusting network parameters of the diffusion model according to the difference between the first training correction information and the preset correction information.
14. The method of claim 13, wherein the third training frequency domain data xt is obtained by the following formula: x t =t*x+(1-t)*y, wherein x represents the first training frequency domain data, y represents the second training frequency domain data, t represents the training sampling time, and x and y in t x+ (1-t) y have the preset linear relationship.
15. The method of claim 13, wherein the fourth training frequency domain data x t_noise is obtained by: x t_noise =μ+δz, wherein μ represents a mean value of the third training frequency domain data, δ represents a standard deviation of the third training frequency domain data, and z represents the preset frequency domain noise data.
16. The method of claim 15, wherein 0≤δ≤1, and wherein the value of δ increases and decreases as the training sampling time increases.
17. The method of claim 13 wherein the predetermined frequency domain noise data is gaussian noise data having a mean of 0 and a variance of 1.
18. The method of claim 14, wherein the predetermined correction information is y-x.
19. The method of claim 18, wherein the difference between the first training correction information and the preset correction information is determined by a loss function loss = mse (v, y-x), wherein mse () represents a mean square error.
20. The method of claim 11, wherein the first frequency domain data, the first training frequency domain data, and the second training frequency domain data are obtained using a short-time fourier transform, STFT, technique.

Description

Speech processing method, medium, electronic device, and program product Technical Field The present application relates to the field of speech processing technology, and in particular, to a speech processing method, a medium, an electronic device, and a program product. Background With the popularization of intelligent terminal devices such as mobile phones and intelligent sound boxes, the wider the application of voice interaction scenes such as voice calls or man-machine interaction. In general, in a voice interaction scenario, an intelligent terminal device collects voice through a microphone, and the voice collected by the microphone is often interfered by noise, and the noise seriously affects the conversation experience and the man-machine interaction quality. For example, in a voice call, the voice collected by the microphone may include not only effective voice but also environmental noise such as whistling, so that a user cannot hear the call contents of a call object during the call. At present, as shown in fig. 1, the conventional method adopts a discriminant voice denoising model to denoise the voice collected by a microphone. Specifically, in the training process of the speech denoising model shown in fig. 1, training noisy speech 1 may be input into a neural network to perform denoising processing, and corresponding training denoised speech 3 is output. And then, performing network weight adjustment on the neural network by comparing the difference between the training denoising voice 3 and the corresponding training clean voice 2 so as to obtain a trained voice denoising model. Further, in the test process shown in fig. 1, the noisy speech 4 collected in real time by the microphone is input into the denoising model, and the denoised speech 5 is output. In the training process of the discriminant denoising model, a large amount of marking data is needed for training data such as noisy speech 1 and the like so as to mark speech data and noise. If these labeling data are erroneous or inaccurate, the model may not correctly distinguish between valid speech and noise, resulting in valid speech in training noisy speech 1 being also misrecognized as noise and removed. Then, during the test, the model may also misrecognize the valid voices in the noisy voice 4 as noise, especially when the model has not learned the data in the noisy voice 4 during the training process, there may be a possibility that the model misrecognizes the valid voices therein as noise, thereby damaging the valid voices during the voice denoising process. Disclosure of Invention The embodiment of the application provides a voice processing method, a medium, electronic equipment and a program product, which can automatically match the denoising reasoning times of a diffusion model according to the noisy degree of the environment, ensure the denoising effect and avoid the waste of computing resources in the denoising process. In a first aspect, an embodiment of the present application provides a speech processing method, where the method includes obtaining first speech data to be denoised (hereinafter referred to as to-be-processed speech data), calculating a noisy degree parameter of a first collecting environment of the first speech data, determining a target inference number according to the noisy degree parameter, where the higher the noisy degree of the environment indicated by the noisy degree parameter is, the larger the corresponding target inference number is, denoising the first speech data according to the target inference number by using a diffusion model, and obtaining denoised second speech data (hereinafter referred to as denoised speech data). Therefore, the application can automatically match the corresponding denoising reasoning times according to the parameters representing the noisy degree of the noisy voice data, and ensure the denoising effect. At this time, unnecessary denoising reasoning can be avoided from being performed in a quiet environment, and thus, the waste of computing resources in the denoising process is avoided. In a possible implementation of the first aspect, the noisy degree parameter is a signal-to-noise ratio, and a larger signal-to-noise ratio indicates a lower environmental noisy degree. In a possible implementation of the first aspect, the signal-to-noise ratio is calculated by detecting a voice activity sequence of the first voice data, wherein data with a first value in the voice activity sequence is voice data and data with a second value in the voice activity sequence is noise data, counting a first energy and a first data amount of the data with the first value in the voice activity sequence, and counting a second energy and a second data amount of the data with the second value in the voice activity sequence, and adopting a formulaCalculating the signal-to-noise ratio, wherein snr is the signal-to-noise ratio, p s is the first energy, l s is the first data amount, p n is the seco