CN-121983081-A - Audio generation method, device, equipment and storage medium

CN121983081ACN 121983081 ACN121983081 ACN 121983081ACN-121983081-A

Abstract

The application relates to an audio generation method, a device, equipment and a storage medium, wherein the method comprises the steps of extracting audio characteristics from original audio data to obtain an audio characteristic set, wherein the audio characteristics comprise frequency spectrum characteristics, phase characteristics, tone characteristics and transient characteristics; the method comprises the steps of carrying out preset mapping on audio features in an audio feature set to obtain an audio vector set, carrying out fusion on audio vectors in the audio vector set to obtain fusion vectors, calling a pre-trained capsule network, and processing the fusion vectors to obtain target audio data. According to the method, the audio is generated by fusing the frequency spectrum characteristic, the phase characteristic, the tone characteristic and the transient characteristic, and rich information of the audio signal can be captured in an omnibearing and multi-level manner, so that the actual requirement of a user on high-quality audio is more accurately met.

Inventors

SUN WEIJIA
YUAN MIN
LIANG SI
Dai Zhennan
LI LIN

Assignees

咪咕音乐有限公司
咪咕文化科技有限公司
中国移动通信集团有限公司

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. A method of audio generation, the method comprising: Extracting audio features from original audio data to obtain an audio feature set, wherein the audio features comprise frequency spectrum features, phase features, tone features and transient features; performing preset mapping on the audio features in the audio feature set to obtain an audio vector set; fusing the audio vectors in the audio vector set to obtain a fusion vector; And calling a pre-trained capsule network, and processing the fusion vector to obtain target audio data.
2. The method of claim 1, wherein the spectral features are a spectrogram, a spectral centroid, a spectral flatness, and a spectral bandwidth, and wherein extracting the audio features from the raw audio data comprises: Converting the original audio data into a spectrogram through short-time Fourier transform; For each time window in the spectrogram, determining a spectrum centroid based on frequency components and corresponding amplitudes of the time window; determining the frequency spectrum flatness according to the amplitudes corresponding to all the frequency components in the time window; and determining the spectrum bandwidth according to the deviation of each frequency component relative to the spectrum centroid in the time window.
3. The method of claim 2, wherein extracting audio features from the raw audio data comprises: determining MFCC characteristics from the raw audio data; invoking a pre-trained self-supervision learning model, and processing the spectrogram and the MFCC features to obtain audio global features; And detecting the original audio data and the audio global characteristic by using a tone detection algorithm to obtain a tone characteristic.
4. The method of claim 1, wherein extracting audio features from the raw audio data comprises: calling a preset neural network, and primarily capturing short-time transient characteristics in the original audio data to obtain first data; Weighting different frequency components in the first data by using preset weighted convolution to obtain second data; capturing transient characteristics of different time spans in the second data by using preset multi-scale convolution to obtain third data; Determining the transient peak value position and the corresponding intensity in the third data to obtain fourth data; and processing the fourth data by using a preset dynamic convolution to obtain transient characteristics.
5. The method of claim 1, wherein the fusing the audio vectors in the set of audio vectors to obtain a fused vector comprises: Determining a weight coefficient corresponding to each audio vector; according to the weight coefficient corresponding to each audio vector, carrying out weighted fusion on the audio vectors in the audio vector set to obtain an initial fusion vector; and carrying out nonlinear mapping and processing on the preliminary fusion vector by using a joint modeling model to obtain a fusion vector.
6. The method of claim 1, wherein invoking the pre-trained capsule network to process the fusion vector to obtain the target audio data comprises: generating primary capsule vectors corresponding to each audio feature according to the fusion vectors; determining a coupling coefficient corresponding to each primary capsule vector; Determining a high-level capsule vector according to each primary capsule vector and the corresponding coupling coefficient; and converting the high-level capsule vector by using a decoder to obtain target audio data.
7. The method of claim 6, wherein determining a higher layer capsule vector from each primary capsule vector and corresponding coupling coefficients comprises: For each primary capsule vector, determining a feature score corresponding to the primary capsule vector; when the feature score corresponding to the primary capsule vector is larger than a preset threshold value, determining an intermediate capsule vector according to the primary capsule vector and the corresponding coupling coefficient; From all the intermediate capsule vectors, a higher layer capsule vector is determined.
8. An audio generating apparatus, the apparatus comprising: The extraction unit is used for extracting audio features from the original audio data to obtain an audio feature set, wherein the audio features comprise frequency spectrum features, phase features, tone features and transient features; the mapping unit is used for carrying out preset mapping on the audio features in the audio feature set to obtain an audio vector set; the fusion unit is used for fusing the audio vectors in the audio vector set to obtain fusion vectors; and the calling unit is used for calling the pre-trained capsule network, and processing the fusion vector to obtain target audio data.
9. An audio generating device comprising at least one communication interface, at least one bus coupled to the at least one communication interface, at least one processor coupled to the at least one bus, and at least one memory coupled to the at least one bus, wherein the processor is configured to: Extracting audio features from original audio data to obtain an audio feature set, wherein the audio features comprise frequency spectrum features, phase features, tone features and transient features; performing preset mapping on the audio features in the audio feature set to obtain an audio vector set; fusing the audio vectors in the audio vector set to obtain a fusion vector; And calling a pre-trained capsule network, and processing the fusion vector to obtain target audio data.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the audio generation method of any one of claims 1 to 7.

Description

Audio generation method, device, equipment and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to an audio generating method, apparatus, device, and storage medium. Background With the rapid development of technology, audio related application scenes are continuously emerging like spring bamboo shoots after rain, from the immersive audio experience of intelligent voice assistants, virtual Reality (VR) and Augmented Reality (AR) to high-quality music creation and synthesis, none of which has put forward more stringent requirements on audio generation technology. Since the frequency spectrum is visually presented as an audio signal in the frequency domain, it is possible to provide some key information for audio generation, such as revealing the frequency composition, harmonic structure, etc. of the audio, the prior art generates audio data from the spectral features. However, if audio data is generated only by means of frequency spectrum, it is often difficult to achieve ideal effects, so that the generated audio cannot meet the actual requirements of users. Disclosure of Invention The application provides an audio generation method, an audio generation device, audio generation equipment and a storage medium, which are used for generating audio by fusing frequency spectrum characteristics, phase characteristics, tone characteristics and transient characteristics, so that rich information of audio signals can be captured in an omnibearing and multilevel manner. The generated audio can realize quality improvement in terms of fineness of tone quality, fullness of tone color and dynamic change and overall consistency of sound, thereby more accurately meeting the actual requirements of users on high-quality audio. In a first aspect, the present application provides an audio generation method, the method comprising: Extracting audio features from original audio data to obtain an audio feature set, wherein the audio features comprise frequency spectrum features, phase features, tone features and transient features; performing preset mapping on the audio features in the audio feature set to obtain an audio vector set; fusing the audio vectors in the audio vector set to obtain a fusion vector; And calling a pre-trained capsule network, and processing the fusion vector to obtain target audio data. Optionally, the spectral features are a spectrogram, a spectrum centroid, a spectrum flatness, and a spectrum bandwidth, and the extracting the audio features from the original audio data includes: Converting the original audio data into a spectrogram through short-time Fourier transform; For each time window in the spectrogram, determining a spectrum centroid based on frequency components and corresponding amplitudes of the time window; determining the frequency spectrum flatness according to the amplitudes corresponding to all the frequency components in the time window; and determining the spectrum bandwidth according to the deviation of each frequency component relative to the spectrum centroid in the time window. Optionally, the extracting the audio feature from the original audio data includes: determining MFCC characteristics from the raw audio data; invoking a pre-trained self-supervision learning model, and processing the spectrogram and the MFCC features to obtain audio global features; And detecting the original audio data and the audio global characteristic by using a tone detection algorithm to obtain a tone characteristic. Optionally, the extracting the audio feature from the original audio data includes: calling a preset neural network, and primarily capturing short-time transient characteristics in the original audio data to obtain first data; Weighting different frequency components in the first data by using preset weighted convolution to obtain second data; capturing transient characteristics of different time spans in the second data by using preset multi-scale convolution to obtain third data; Determining the transient peak value position and the corresponding intensity in the third data to obtain fourth data; and processing the fourth data by using a preset dynamic convolution to obtain transient characteristics. Optionally, the fusing the audio vectors in the audio vector set to obtain a fused vector includes: Determining a weight coefficient corresponding to each audio vector; according to the weight coefficient corresponding to each audio vector, carrying out weighted fusion on the audio vectors in the audio vector set to obtain an initial fusion vector; and carrying out nonlinear mapping and processing on the preliminary fusion vector by using a joint modeling model to obtain a fusion vector. Optionally, the invoking the pre-trained capsule network processes the fusion vector to obtain target audio data, including: generating primary capsule vectors corresponding to each audio feature according to the fusion vectors; determinin