CN-120913576-B - Voice enhancement method, device, equipment and medium based on stream matching

CN120913576BCN 120913576 BCN120913576 BCN 120913576BCN-120913576-B

Abstract

The invention relates to the technical field of data processing, and discloses a voice enhancement method, a device, equipment and a medium based on stream matching, which can be applied to the fields of finance and medical treatment; the method comprises the steps of carrying out stream matching input construction processing based on standardized time-frequency representation data and inputting the stream matching input construction processing to a stream matching model to obtain enhanced time-frequency representation data, carrying out inverse standardization processing on the enhanced time-frequency representation data to obtain inverse standardization time-frequency representation data, and carrying out time domain reconstruction processing on the inverse standardization time-frequency representation data to obtain enhanced voice waveform data. In the invention, aiming at the problems of poor fidelity and large reasoning time delay caused by discrete quantization commonly existing in the existing voice enhancement method, enhanced voice waveform data can be obtained by introducing a stream matching modeling and time-frequency-time domain consistent processing chain, thus improving the fidelity of voice details and reducing the time delay.

Inventors

SHI YAN
CHEN MINCHUAN

Assignees

平安科技(上海)有限公司

Dates

Publication Date: 20260512
Application Date: 20250826

Claims (8)

1. A method of stream matching-based speech enhancement, comprising: acquiring noise voice data and performing frequency conversion processing to obtain noise time-frequency representation data; Carrying out standardization processing on the noise time-frequency representation data to obtain standardized time-frequency representation data; performing stream matching input construction processing based on the standardized time-frequency representation data to obtain stream matching input representation data; The method comprises the steps of inputting stream matching input representing data into a characteristic coding submodule of a stream matching model to obtain stream matching characteristic representing data, carrying out context modeling on the stream matching characteristic representing data to obtain context enhancement characteristic representing data, inputting the context enhancement characteristic representing data into a speed field estimation submodule to output speed field estimation data, calculating time-frequency state increment based on the speed field estimation data and a preset numerical step parameter to obtain time-frequency state increment data, carrying out ordinary differential equation numerical solution on intermediate state time-frequency tensors in the stream matching input representing data by utilizing the time-frequency state increment data to obtain updated intermediate state time-frequency tensors, and inputting the updated intermediate state time-frequency tensors into the stream matching model to update until reaching a target path parameter to obtain enhanced time-frequency representing data; performing inverse standardization processing on the enhanced time-frequency representation data to obtain inverse standardization time-frequency representation data; performing time domain reconstruction processing on the anti-standardized time-frequency representation data to obtain enhanced voice waveform data; The noise time-frequency representation data is subjected to standardization processing to obtain standardized time-frequency representation data, wherein the standardized time-frequency representation data comprises an amplitude spectrum matrix and a phase spectrum matrix which are respectively extracted from the noise time-frequency representation data, a stabilization amplitude spectrum matrix is obtained by carrying out numerical stabilization processing on the amplitude spectrum matrix to be standardized, a compression amplitude spectrum matrix is obtained by carrying out dynamic range compression processing on the stabilization amplitude spectrum matrix, a standardized statistical parameter is obtained by carrying out mean value variance normalization processing on the compression amplitude spectrum matrix according to preset dimensionality, a normalized amplitude spectrum matrix is obtained by carrying out coding combination on the normalized amplitude spectrum matrix and the phase spectrum matrix on the basis of a preset time-frequency characteristic format, and standardized time-frequency representation data is obtained; The method comprises the steps of extracting a characteristic tensor corresponding to a model input size from standardized time-frequency representation data to obtain a standardized time-frequency tensor, generating a random noise tensor with zero mean unit variance based on the standardized time-frequency tensor to obtain a noise time-frequency initial tensor, sampling path parameters in a preset interval, performing linear interpolation operation on the noise time-frequency initial tensor and the standardized time-frequency tensor to obtain an intermediate state time-frequency tensor, performing frame block division and position index coding on the intermediate state time-frequency tensor to obtain an intermediate state time-frequency tensor with position codes, inputting the intermediate state time-frequency tensor with the position codes and the path parameters into a time coding module together, outputting a condition vector, performing characteristic splicing on the condition vector and the intermediate state time-frequency tensor with the position codes, and performing channel mapping processing to obtain the stream matching input representation data.
2. The method for stream matching-based speech enhancement according to claim 1, wherein the obtaining noise speech data and performing a frequency-to-frequency conversion process to obtain noise time-frequency representation data comprises: receiving an audio input stream and analyzing file header information to obtain original noise voice data; respectively carrying out sampling rate unification and channel conversion processing on the original noise voice data to obtain formatted noise voice data; respectively carrying out DC offset removal and pre-emphasis treatment on the formatted noise voice data to obtain pre-treated noise voice data; Carrying out framing treatment on the preprocessed noise voice data based on a preset window length and frame shift to obtain a noise voice frame sequence; applying a preset window function to the noise voice frame sequence frame by frame to obtain a windowed noise voice frame sequence; performing short-time Fourier transform on the windowed noise voice frame sequence to obtain a complex spectrum matrix; And calculating the magnitude spectrum and the phase spectrum of the complex spectrum matrix, and combining according to a preset time-frequency characteristic format to obtain noise time-frequency representation data.
3. The method for speech enhancement based on stream matching according to claim 1, wherein the sampling path parameters in a preset interval and performing linear interpolation operation with the noise time-frequency initial tensor and the normalized time-frequency tensor to obtain an intermediate state time-frequency tensor comprises: Analyzing a start-stop boundary of a preset interval and configuring a sampling strategy to obtain path parameters; Respectively carrying out relative position conversion and range constraint on the target path parameters to obtain interpolation weights; Performing size, channel and step length alignment processing on the noise time-frequency initial tensor and the standardized time-frequency tensor based on the interpolation weight to obtain a pair Ji Zhangliang set; and carrying out linear weighted synthesis on the alignment tensor set to obtain an intermediate state time-frequency tensor.
4. The method for stream matching-based speech enhancement according to claim 1, wherein said performing an inverse normalization process on the enhanced time-frequency representation data to obtain inverse normalized time-frequency representation data comprises: Respectively extracting an enhanced normalized magnitude spectrum matrix and an enhanced phase spectrum matrix from the enhanced time-frequency representation data; performing inverse transformation calculation on the enhanced normalized amplitude spectrum matrix to obtain an inverse normalized amplitude spectrum matrix; Performing inverse compression processing on the inverse normalized amplitude spectrum matrix according to preset dynamic range compression configuration parameters to obtain an inverse compressed amplitude spectrum matrix; performing numerical reduction treatment on the back pressure shrinkage amplitude spectrum matrix to obtain a reduction amplitude spectrum matrix; coding and combining the reduced amplitude spectrum matrix and the enhanced phase spectrum matrix to obtain candidate anti-standardized time-frequency representation data; and carrying out dimension rearrangement on the candidate anti-standardized time-frequency representation data to obtain the anti-standardized time-frequency representation data.
5. The method for stream matching based speech enhancement according to claim 1, wherein said performing a time domain reconstruction process on said inverse normalized time-frequency representation data to obtain enhanced speech waveform data comprises: respectively extracting an inverse normalized amplitude spectrum matrix and an inverse normalized phase spectrum matrix from the inverse normalized time-frequency representation data; Complex synthesis processing is carried out on the inverse standardized magnitude spectrum matrix and the inverse standardized phase spectrum matrix to obtain a reconstructed complex spectrum matrix; performing inverse short-time Fourier transform processing on the reconstructed complex spectrum matrix to obtain a windowed time domain frame sequence; Respectively carrying out window removal compensation and overlap addition on the windowed time domain frame sequence to obtain a continuous time domain voice sequence; performing direct current offset correction on the continuous time domain voice sequence to obtain corrected time domain voice data; and carrying out sampling rate and bit depth output coding processing on the corrected time domain voice data to obtain enhanced voice waveform data.
6. A stream matching based speech enhancement apparatus, comprising: the noise acquisition module is used for acquiring noise voice data and performing frequency conversion processing to obtain noise time-frequency representation data; the noise processing module is used for carrying out standardization processing on the noise time-frequency representation data to obtain standardized time-frequency representation data; the data processing module is used for carrying out stream matching input construction processing based on the standardized time-frequency representation data to obtain stream matching input representation data The model optimization module is used for inputting the stream matching input representation data to a feature encoding submodule of a stream matching model to obtain stream matching feature representation data, carrying out context modeling processing on the stream matching feature representation data to obtain context enhancement feature representation data, inputting the context enhancement feature representation data to a speed field estimation submodule to output speed field estimation data, calculating time-frequency state increment based on the speed field estimation data and a preset numerical step parameter to obtain time-frequency state increment data, carrying out ordinary differential equation numerical solution on an intermediate state time-frequency tensor in the stream matching input representation data by utilizing the time-frequency state increment data to obtain an updated intermediate state time-frequency tensor, taking the updated intermediate state time-frequency tensor as input, and inputting the updated intermediate state time-frequency tensor into the stream matching model to update until reaching a target path parameter to obtain enhanced time-frequency representation data; The data back-pushing module is used for carrying out back-normalization processing on the enhanced time-frequency representation data to obtain back-normalization time-frequency representation data; The reconstruction output module is used for carrying out time domain reconstruction processing on the anti-standardized time-frequency representation data to obtain enhanced voice waveform data; The noise processing module is specifically configured to extract an amplitude spectrum matrix and a phase spectrum matrix from the noise time-frequency representation data respectively, obtain a to-be-standardized amplitude spectrum matrix and a phase spectrum matrix correspondingly, perform numerical stabilization processing on the to-be-standardized amplitude spectrum matrix to obtain a stabilized amplitude spectrum matrix, perform dynamic range compression processing on the basis of the stabilized amplitude spectrum matrix to obtain a compressed amplitude spectrum matrix, count the mean value and standard deviation parameters of the compressed amplitude spectrum matrix according to a preset dimension to obtain a standardized statistical parameter, perform mean value variance normalization processing on the compressed amplitude spectrum matrix by using the standardized statistical parameter to obtain a normalized amplitude spectrum matrix, and perform coding combination on the normalized amplitude spectrum matrix and the phase spectrum matrix based on a preset time-frequency characteristic format to obtain standardized time-frequency representation data; the data processing module is specifically configured to extract a feature tensor corresponding to a model input size from the normalized time-frequency representation data to obtain a normalized time-frequency tensor, generate a random noise tensor with zero mean unit variance based on the normalized time-frequency tensor to obtain a noise time-frequency initial tensor, sample path parameters in a preset interval, perform linear interpolation operation on the noise time-frequency initial tensor and the normalized time-frequency tensor to obtain an intermediate state time-frequency tensor, respectively perform frame block division and position index coding on the intermediate state time-frequency tensor to obtain an intermediate state time-frequency tensor with position codes, input the intermediate state time-frequency tensor with position codes and the path parameters to a time coding module together, output a condition vector, perform feature splicing on the condition vector and the intermediate state time-frequency tensor with position codes, and perform channel mapping processing to obtain stream matching input representation data.
7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the stream matching based speech enhancement method according to any of claims 1 to 5 when the computer program is executed.
8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the stream matching based speech enhancement method according to any of claims 1 to 5.

Description

Voice enhancement method, device, equipment and medium based on stream matching Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for enhancing speech based on stream matching. Background The existing speech enhancement technical route mainly comprises three categories, namely, firstly, a discrete modeling method based on a Language Model (LM) maps speech into discrete marks and then models, but the discrete quantization process inevitably introduces quantization loss, so that artifacts and tone quality are reduced, and speaker similarity and intelligibility are affected, secondly, a method (such as CDiffuSE, SGMSE, stoRM) based on a diffusion model usually depends on a large number of gradual inversion iterations to finish the reduction from noise to data distribution, the reasoning steps are more, the time consumption is long, the real-time scene requirement is difficult to meet, thirdly, a method based on a generation formula of VAE or GAN is used, and the problems of complex model structure, poor training stability, easy occurrence of mode collapse or excessive smoothness of a synthetic result and the like are solved although the perception quality is improved. In summary, the prior art has the problems of poor fidelity and large reasoning delay in terms of voice representation, such as application in the fields of finance and medical treatment. Disclosure of Invention The invention provides a voice enhancement method, a device, equipment and a medium based on stream matching, which are used for solving the technical problems of poor fidelity and large reasoning time delay caused by discrete quantization commonly existing in the existing voice enhancement method. In a first aspect, a method for enhancing speech based on stream matching is provided, including: acquiring noise voice data and performing frequency conversion processing to obtain noise time-frequency representation data; Carrying out standardization processing on the noise time-frequency representation data to obtain standardized time-frequency representation data; performing stream matching input construction processing based on the standardized time-frequency representation data to obtain stream matching input representation data; The method comprises the steps of inputting stream matching input representing data into a characteristic coding submodule of a stream matching model to obtain stream matching characteristic representing data, carrying out context modeling on the stream matching characteristic representing data to obtain context enhancement characteristic representing data, inputting the context enhancement characteristic representing data into a speed field estimation submodule to output speed field estimation data, calculating time-frequency state increment based on the speed field estimation data and a preset numerical step parameter to obtain time-frequency state increment data, carrying out ordinary differential equation numerical solution on intermediate state time-frequency tensors in the stream matching input representing data by utilizing the time-frequency state increment data to obtain updated intermediate state time-frequency tensors, and inputting the updated intermediate state time-frequency tensors into the stream matching model to update until reaching a target path parameter to obtain enhanced time-frequency representing data; performing inverse standardization processing on the enhanced time-frequency representation data to obtain inverse standardization time-frequency representation data; and carrying out time domain reconstruction processing on the anti-standardized time-frequency representation data to obtain enhanced voice waveform data. In a second aspect, a speech enhancement apparatus based on stream matching is provided, comprising: the noise acquisition module is used for acquiring noise voice data and performing frequency conversion processing to obtain noise time-frequency representation data; the noise processing module is used for carrying out standardization processing on the noise time-frequency representation data to obtain standardized time-frequency representation data; The data processing module is used for carrying out stream matching input construction processing based on the standardized time-frequency representation data to obtain stream matching input representation data; The model optimization module is used for inputting the stream matching input representation data to a feature encoding submodule of a stream matching model to obtain stream matching feature representation data, carrying out context modeling processing on the stream matching feature representation data to obtain context enhancement feature representation data, inputting the context enhancement feature representation data to a speed field estimation submodule to output speed field estimation data, calculating tim