CN-118737119-B - Audio multi-scene noise adding processing method, device, equipment and medium

CN118737119BCN 118737119 BCN118737119 BCN 118737119BCN-118737119-B

Abstract

The application relates to an audio multi-scene noise adding processing method, device, equipment and medium, wherein the method comprises the steps that an audio service system acquires the noise type in a target acoustic scene and the original audio which needs to be subjected to audio multi-scene noise adding processing; the audio service system takes each noise type as a text embedding and transmits the text embedding to a potential diffusion model in the noise generation system, gaussian noise distribution and text embedding are adopted in the potential diffusion model as starting points, noise audio samples are gradually generated, each noise audio sample is copied according to a plurality of preset volume multiple thresholds to determine noise audio samples corresponding to the preset volume multiple thresholds, one or more noise audio samples corresponding to the preset volume multiple thresholds are randomly selected from each noise type, and the noise audio samples are synthesized with original audio which is required to be subjected to audio multi-scene noise adding processing, so that noise adding audio is obtained. The application can make the model better adapt to the actual environment and improve the robustness.

Inventors

ZENG BI
CHEN ZIHAO
LIN ZHENTAO

Assignees

广东工业大学

Dates

Publication Date: 20260505
Application Date: 20240712

Claims (7)

1. An audio multi-scene noise processing method is characterized by comprising the following steps: Responding to an audio multi-scene noise adding processing instruction, and acquiring a noise type in a target acoustic scene and an original audio needing audio multi-scene noise adding processing by an audio service system; The audio service system transmits each noise type as a text-embedding to a potential diffusion model in a noise generation system, in which a gaussian noise distribution and the text-embedding are employed as starting points, gradually generating noise audio samples, comprising: The potential diffusion model comprises a diffusion process and a back diffusion process, wherein text is embedded in each time step in the diffusion process The transition probability is given by the following formula: , , Wherein, the Is a predefined noise scale and meets , Representation of Is used to determine the re-parameterized coefficients of (c), , Representing the noise level of each step, A standard gaussian distribution representing injected noise at the last time step , With a standard isotropic gaussian distribution; for model optimization, the re-weighted noise estimate training objective is employed: , Wherein, the Is the case for the current parameters, Is the injection of noise and, Is the prediction noise of the signal to be processed, Is a time step of the method, Pre-training audio encoder in contrast text audio pre-training Generated noise audio samples Audio embedding of (a); In the back diffusion process, the noise is distributed from Gaussian And text embedding Beginning with the text embedding For conditional denoising process, audio priors are gradually generated by Comprising: , , The average value parameters are as follows: , the variance parameters were: , Wherein, the Is predictive of noise, and during the training phase, based on the noisy audio samples Audio embedding of (a) Learning to generate audio priors Providing the text embedding in a prediction stage To predict noise ; In the contrast text-to-audio pre-training, Representing the noise audio samples and, Representing text descriptions using text encoders And audio encoder Extracting text embedding separately And audio embedding ; In the variable self-encoder, the variable self-encoder consists of an encoder and a decoder with a stacked convolution module, and the encoder uses a Mel spectrogram Compression to potential space , wherein, Representing compression ratio, an audio a priori representation generated by the decoder from the potential diffusion model Construction of Mel spectrogram Using a preset countermeasure generation network as a vocoder, and obtaining a Mel spectrogram from the Mel spectrogram Generating noisy audio samples ; The audio service system copies each noise audio sample according to a plurality of preset volume multiple thresholds so as to determine noise audio samples corresponding to the preset volume multiple thresholds; The audio service system randomly selects one or more noise audio samples corresponding to a preset volume multiple threshold value from each noise type, and synthesizes the noise audio samples with the original audio which needs to be subjected to audio multi-scene noise adding processing to obtain noise adding audio.
2. The audio multi-scene noise-plus-processing method of claim 1, wherein the audio service system transmits each noise type as a text-insert to a potential diffusion model in a noise generation system, wherein a gaussian noise distribution and the text-insert are employed as starting points, the step of gradually generating noise audio samples comprises: Generating an audio prior based on contrast text audio pre-training in the potential diffusion model; a variational self-encoder is adopted as a decoder, and a Mel spectrogram is reconstructed according to the audio priori; and adopting a preset countermeasure generation network as a vocoder, and generating the noise audio sample with high quality according to the Mel spectrogram.
3. The method for audio multi-scene noise processing according to claim 1, wherein the audio service system randomly selects one or more noise audio samples corresponding to a preset volume multiple threshold value from each noise type, and synthesizes the noise audio samples with the original audio to be subjected to the audio multi-scene noise processing, the method comprises the following steps: the audio service system opens and reads each audio file and converts the audio file into NumPy arrays; Determining the maximum length of all audio data, and filling the data with insufficient length with zero; stacking all the audio data into a two-dimensional array according to columns, and flattening the two-dimensional array into a one-dimensional array; A new WAV audio file is created and the flattened audio data is written therein.
4. A method of audio multi-scene noise addition processing according to any of claims 1 to 3, wherein the infrastructure of the countermeasure generation network is a HiFi-GAN countermeasure generation network.
5. An audio multi-scene noise-adding processing device, comprising: The audio acquisition module is used for responding to the audio multi-scene noise adding processing instruction, and the audio service system acquires the noise type in the target acoustic scene and the original audio needing the audio multi-scene noise adding processing; A noise audio generation module configured to transmit each noise type as a text-insert to a potential diffusion model in the noise generation system, in which a gaussian noise distribution and the text-insert are employed as starting points, to gradually generate noise audio samples, comprising: The potential diffusion model comprises a diffusion process and a back diffusion process, wherein text is embedded in each time step in the diffusion process The transition probability is given by the following formula: , , Wherein, the Is a predefined noise scale and meets , Representation of Is used to determine the re-parameterized coefficients of (c), , Representing the noise level of each step, A standard gaussian distribution representing injected noise at the last time step , With a standard isotropic gaussian distribution; for model optimization, the re-weighted noise estimate training objective is employed: , Wherein, the Is the case for the current parameters, Is the injection of noise and, Is the prediction noise of the signal to be processed, Is a time step of the method, Pre-training audio encoder in contrast text audio pre-training Generated noise audio samples Audio embedding of (a); In the back diffusion process, the noise is distributed from Gaussian And text embedding Beginning with the text embedding For conditional denoising process, audio priors are gradually generated by Comprising: , , The average value parameters are as follows: , the variance parameters were: , Wherein, the Is predictive of noise, and during the training phase, based on the noisy audio samples Audio embedding of (a) Learning to generate audio priors Providing the text embedding in a prediction stage To predict noise ; In the contrast text-to-audio pre-training, Representing the noise audio samples and, Representing text descriptions using text encoders And audio encoder Extracting text embedding separately And audio embedding ; In the variable self-encoder, the variable self-encoder consists of an encoder and a decoder with a stacked convolution module, and the encoder uses a Mel spectrogram Compression to potential space , wherein, Representing compression ratio, an audio a priori representation generated by the decoder from the potential diffusion model Construction of Mel spectrogram Using a preset countermeasure generation network as a vocoder, and obtaining a Mel spectrogram from the Mel spectrogram Generating noisy audio samples ; The audio sample copying module is configured to copy each noise audio sample according to a plurality of preset volume multiple thresholds by the audio service system so as to determine noise audio samples corresponding to the preset volume multiple thresholds; the audio synthesis module is configured to randomly select one or more noise audio samples corresponding to a preset volume multiple threshold value from each noise type by the audio service system, and synthesize the noise audio samples with the original audio which needs to be subjected to audio multi-scene noise adding processing to obtain noise adding audio.
6. An electronic device comprising a central processor and a memory, characterized in that the central processor is arranged to invoke a computer program stored in the memory for performing the steps of the method according to any of claims 1-4.
7. A computer-readable storage medium, characterized in that it stores in the form of computer-readable instructions a computer program implemented according to the method of any one of claims 1 to 4, which, when invoked by a computer, performs the steps comprised by the corresponding method.

Description

Audio multi-scene noise adding processing method, device, equipment and medium Technical Field The present application relates to the field of audio processing, and in particular, to an audio multi-scene noise-adding processing method, a corresponding apparatus, an electronic device, and a computer readable storage medium. Background With the continuous development of the market, various audio task models are layered, and services such as voice awakening and voice recognition are provided, however, most training data sets directly process audio signals and only play a role in data enhancement, so that effects in specific application scenes cannot be ensured. Therefore, there is an urgent need for an audio scene-based method that can address a specific application scene, where scene noise needs to be added to an audio signal to simulate a real acoustic environment, which is an audio noise-adding method in audio signal processing. After the production enters the market, most of the audio task models are affected by scene factors in practical application and gradually deviate from the requirements of different noise environments. Under the condition of unchanged content, the acoustic scene is evolved, so that the actual execution effect of the model deviates from the expected effect, and the use experience of a user cannot be guaranteed. Along with the rapid development of hardware computing power, the scene noise adding technology has greatly advanced, and the scene noise adding method is mainly divided into two categories, namely, sound recording is carried out by actually building a scene, and the other category is to simulate an acoustic scene through software simulation. As substantial human resources and time are consumed for the actual building, more and more researchers tend to explore the software simulation further. Because the scene noise adding method is based on the recorded actual noise and is constructed or simulated according to the existing scene information, the scene noise adding method is also called a non-generation scene noise adding method. Although non-generative methods can purposefully enhance the effect of the model in the target acoustic scene, their application is limited to known or foreseeable scenes. In practical applications, it is time consuming and labor consuming to re-score an acoustic scene, and it is difficult for researchers to collect enough scenerised audio to train the model. In summary, the method is suitable for the problems that in the prior art, a large amount of manpower resources and time are required for actual construction, the application scene is limited to the known or foreseeable scene, in the practical application, the time and the effort are consumed for rescaling the acoustic scene, and a researcher can hardly collect enough scenerized audio to train a model, and the like. Disclosure of Invention The present application is directed to solving the above-mentioned problems and providing an audio multi-scene noise processing method, a corresponding apparatus, an electronic device, and a computer-readable storage medium. In order to meet the purposes of the application, the application adopts the following technical scheme: an audio multi-scene noise processing method according to one of the objects of the present application comprises: Responding to an audio multi-scene noise adding processing instruction, and acquiring a noise type in a target acoustic scene and an original audio needing audio multi-scene noise adding processing by an audio service system; The audio service system transmits each noise type as text embedding to a potential diffusion model in the noise generation system, and noise audio samples are gradually generated in the potential diffusion model by adopting Gaussian noise distribution and the text embedding as starting points; the audio service system copies each noise audio sample according to a plurality of preset volume multiple thresholds so as to determine noise audio samples corresponding to the preset volume multiple thresholds; The audio service system randomly selects one or more noise audio samples corresponding to a preset volume multiple threshold value from each noise type, and synthesizes the noise audio samples with the original audio which needs to be subjected to audio multi-scene noise adding processing to obtain noise adding audio. Optionally, the audio service system transmits each noise type as a text embedding to a potential diffusion model in the noise generation system, and steps of gradually generating noise audio samples in the potential diffusion model using gaussian noise distribution and the text embedding as starting points include: Generating an audio prior based on contrast text audio pre-training in the potential diffusion model; a variational self-encoder is adopted as a decoder, and a Mel spectrogram is reconstructed according to the audio priori; and adopting a preset countermeasure generation networ