CN-121789697-B - Encoder training method, audio generation method and audio retrieval method

CN121789697BCN 121789697 BCN121789697 BCN 121789697BCN-121789697-B

Abstract

The application provides an encoder training method, an audio generation method and an audio retrieval method, and relates to the technical field of audio processing. The encoder training method comprises the steps of adjusting an audio quantization encoder to obtain a layered encoder, wherein an associated semantic codebook and an associated audio codebook are arranged in the layered encoder, performing joint training on the layered encoder and a text encoder based on joint loss to obtain an optimized encoder, wherein the layered encoder is used for performing quantization compression processing on audio data to obtain fusion features, the fusion features comprise semantic features and audio features, the text encoder is used for processing text information associated with the audio data to obtain text vectors, and the fusion features have relevance with the text vectors. The audio generation method comprises the steps of determining generation requirements, wherein the generation requirements comprise semantic requirements and audio requirements, and processing the generation requirements based on the generation requirements through an optimized encoder to obtain required audio data.

Inventors

XIAO JIE
Bao Chengke

Assignees

成都开心音符科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260305

Claims (10)

1. A method of encoder training, the method comprising: Adjusting an audio quantization encoder to obtain a layered encoder, wherein the layered encoder is provided with an associated semantic codebook and an associated audio codebook; performing joint training on the layered encoder and the text encoder based on joint loss to obtain an optimized encoder; The hierarchical encoder is used for carrying out quantization compression processing on the audio data to obtain fusion features, the fusion features comprise semantic features and audio features, the text encoder is used for processing text information associated with the audio data to obtain text vectors, and the fusion features have relevance with the text vectors.
2. The method of claim 1, wherein said adjusting the audio quantization encoder to obtain a layered encoder comprises: determining the semantic codebook that quantifies semantic information of the audio data; determining the audio codebook associated with the semantic codebook that quantizes the audio data; And adjusting the audio quantization encoder based on the semantic codebook and the audio codebook to obtain the layered encoder.
3. The method of claim 2, wherein in the layered encoder, semantically quantizing the audio data comprises: ; Wherein, the For a semantic weight matrix, attention is a multi-headed Attention mechanism, For the purpose of the semantic features described above, In order to provide the audio data in question, A semantic feature quantization algorithm; in the layered encoder, the audio quantization of the audio data includes: ; where alpha is the hierarchical fusion coefficient, upsample is the upsampling operation, As a result of the audio characteristics described above, A quantization algorithm for audio features; The fusion characteristics of the layered encoder output are: ; Wherein, the For the fusion feature, β is the fusion coefficient.
4. The method of claim 1, wherein the jointly training the layered encoder and text encoder based on joint loss results in an optimized encoder, comprising: determining the joint loss associated with the layered encoder and the text encoder; and based on the joint loss, performing joint training on the layered encoder and the text encoder to obtain the optimized encoder.
5. The method of claim 4, wherein said determining said joint loss associated with said layered encoder and said text encoder comprises: determining a composite loss of the layered encoder; Determining a language loss of a language model in the text encoder; The joint loss is determined based on the composite loss, the language loss, and a weight balance coefficient.
6. The method of claim 5, wherein the joint loss comprises: ; Wherein, the For the purpose of the said joint loss, In order to achieve the above-mentioned composite loss, For the weight balance coefficient of the said weight, Loss for the language; ; Wherein, the For the loss of the audio reconstruction, As a coefficient of difference (co) of the differences, To constrain the loss of difference between the continuous coding feature and the discrete codebook, As a comparison coefficient of the two-dimensional image, A contrast penalty for semantically aligning the audio data with the text information; ; Wherein, the For the t-th semantic token in the semantic features, For the length of the token sequence, A conditional probability distribution is output for the language model.
7. The method of claim 6, wherein the contrast loss comprises: ; Wherein, the For the semantic features of the ith audio sample in the audio data, For the text vector associated with the ith audio sample, For the text vector for all text samples in the text information, For the calculation of the degree of similarity, Is a temperature parameter.
8. A method of audio generation, the method comprising: determining a generation requirement, wherein the generation requirement comprises a semantic requirement and/or an audio requirement; Processing based on the generated demand by an optimized encoder, resulting in demand audio data, wherein the optimized encoder is trained by the method of any of claims 1-7.
9. The method of claim 8, wherein the processing, by the optimizing encoder, based on the generated demand to obtain demand audio data comprises: If the generated demand is the semantic demand, determining a demand text vector based on the semantic demand by optimizing an encoder; determining an associated first demand fusion feature based on the demand text vector; generating the demand audio data based on the first demand fusion feature; If the generated demand is the audio demand, determining an input fusion feature based on the audio demand through the optimized encoder, determining a second demand fusion feature associated with the input fusion feature, and generating the demand audio data based on the second demand fusion feature.
10. An audio retrieval method, the method comprising: determining a retrieval fusion feature according to a retrieval requirement by an optimized encoder, wherein the optimized encoder is trained by the method of any one of claims 1-7; determining, by the optimization encoder, similar fusion features having a similarity to the retrieved fusion features greater than or equal to a similarity threshold; and determining target audio based on the similar fusion characteristics through the optimized encoder.

Description

Encoder training method, audio generation method and audio retrieval method Technical Field The application relates to the technical field of audio processing, in particular to an encoder training method, an audio generation method and an audio retrieval method. Background In order to generate audio, create music, compress audio, etc., it is currently common to use VQ-VAE (vector quantization variation from encoder) based audio discretization methods to process, map audio waveforms or spectra into continuous potential vectors based on convolutional neural networks by an encoder, replace the continuous potential vectors with the nearest discrete entries in the Codebook using a predefined or end-to-end learned Codebook (Codebook), and reconstruct the original audio based on the discrete Codebook entries using a decoder. However, in the current audio discretization method, since codebook entries are only learned through reconstruction loss, alignment capability with semantic information such as text and labels is lacking, so that the discrete token cannot express high-level semantics, the codebook entries are abstract symbols for unsupervised learning, mapping is difficult to build with semantic units understandable by human beings, audio discretization results are not associated with modes such as text and images, cross-mode retrieval or generation cannot be supported, the current audio processing effect is poor, and the use requirements of various different application scenes cannot be met. Disclosure of Invention Accordingly, an objective of the embodiments of the present application is to provide an encoder training method, an audio generating method and an audio retrieving method, so as to solve the problem of poor audio data processing effect in the prior art. To solve the above problem, in a first aspect, an embodiment of the present application provides an encoder training method, including: Adjusting an audio quantization encoder to obtain a layered encoder, wherein the layered encoder is provided with an associated semantic codebook and an associated audio codebook; performing joint training on the layered encoder and the text encoder based on joint loss to obtain an optimized encoder; The hierarchical encoder is used for carrying out quantization compression processing on the audio data to obtain fusion features, the fusion features comprise semantic features and audio features, the text encoder is used for processing text information associated with the audio data to obtain text vectors, and the fusion features have relevance with the text vectors. In the implementation process, in order to enable the encoder to understand and characterize the corresponding semantic meaning, the audio quantization encoder which can only perform quantization processing on the audio features can be adjusted based on the original audio quantization encoder, so as to obtain a layered encoder which is provided with an associated semantic codebook and an audio codebook and can express low-layer acoustic features and high-layer semantic information at the same time, and the layered encoder is used for performing quantization compression processing on the audio data to obtain melting features containing the semantic features and the audio features. In addition, in consideration of the high relevance between text information and audio data in various application scenes, the text encoder can process the text information related to the audio data to obtain text vectors, namely relevance between the text vectors and fusion features, corresponding joint loss can be set for improving the compactness and effectiveness of the relevance between the text vectors and the fusion features, and the layered encoder and the text encoder are jointly trained based on the joint loss to obtain corresponding optimized encoder capable of understanding semantics and processing the audio data based on the semantics. Through the mode of layered design of the semantic codebook and the audio codebook and combined training of a plurality of encoders, the effect of processing audio data by the optimized encoder based on semantic information is effectively improved, and the use requirements of various different application scenes are met. Optionally, the adjusting the audio quantization encoder to obtain a layered encoder includes: determining the semantic codebook that quantifies semantic information of the audio data; determining the audio codebook associated with the semantic codebook that quantizes the audio data; And adjusting the audio quantization encoder based on the semantic codebook and the audio codebook to obtain the layered encoder. In the implementation process, when the audio quantization encoder is adjusted, a semantic codebook for quantizing semantic information related to the audio data and an audio codebook associated with the semantic codebook for quantizing the audio data can be determined first, and then the audio quantization enc