CN-115831133-B - Audio processing method, device, electronic equipment and storage medium

CN115831133BCN 115831133 BCN115831133 BCN 115831133BCN-115831133-B

Abstract

The disclosure relates to an audio processing method, an audio processing device, electronic equipment and a storage medium. The method comprises the steps of obtaining target audio fragments to be encoded and target tone quality indexes, wherein the target tone quality indexes are tone quality indexes indicated by the target audio fragments, conducting tone quality index prediction processing on the target audio fragments based on a tone quality prediction model to obtain tone quality indexes corresponding to the target audio fragments under various preset code rates, constructing mapping relations between code rates and tone quality indexes according to the various preset code rates and the tone quality indexes corresponding to the various preset code rates, and determining target code rates corresponding to the target tone quality indexes based on the mapping relations, wherein the target code rates are used for encoding the target audio fragments. According to the technical scheme provided by the disclosure, the coding efficiency and flexibility of the audio can be improved.

Inventors

CHEN LIANWU
ZHENG XIGUANG
ZHANG CHEN

Assignees

北京达佳互联信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20220929

Claims (12)

1. An audio processing method, comprising: the method comprises the steps of obtaining target audio to be encoded and target tone quality indexes, wherein the target tone quality indexes are tone quality indexes of target audio fragments, and the tone quality indexes of the target audio fragments are designated tone quality indexes of the target audio fragments; Segmenting the target audio based on a preset time length to obtain a plurality of audio fragments, wherein the preset time length is fixed or variable; any audio fragment is obtained from the plurality of audio fragments to serve as a target audio fragment to be encoded; performing timbre index prediction processing on the target audio fragment based on a timbre prediction model to obtain timbre indexes respectively corresponding to the target audio fragment under a plurality of preset code rates; according to the multiple preset code rates and the tone quality indexes corresponding to the multiple preset code rates respectively, constructing a mapping relation between the code rates and the tone quality indexes; and determining a target code rate corresponding to the target tone quality index based on the mapping relation, wherein the target code rate is used for encoding the target audio fragment.
2. The method according to claim 1, wherein the performing, based on a timbre prediction model, timbre index prediction processing on the target audio segment to obtain timbre indexes of the target audio segment corresponding to a plurality of preset code rates, respectively, includes: performing feature extraction processing on the target audio fragment to obtain audio features; And inputting the audio characteristics into the tone quality prediction model to perform tone quality index prediction processing to obtain tone quality indexes respectively corresponding to the target audio fragments under the multiple preset code rates.
3. The method according to claim 2, wherein the performing feature extraction processing on the target audio segment to obtain audio features includes: performing frequency domain transformation on the target audio fragment to obtain frequency spectrum data of a plurality of audio frames; and acquiring the audio characteristics according to the frequency spectrum data.
4. A method according to claim 3, wherein said obtaining said audio features from said spectral data comprises: extracting frame features from the spectral data of each audio frame; carrying out statistical processing on the frame characteristics to obtain characteristic statistical information of the plurality of audio frames; and acquiring the audio features based on the feature statistics.
5. The method according to any one of claims 1-4, wherein the mapping relationship is a code rate tone quality curve, and the constructing the mapping relationship between the code rate and the tone quality index according to the plurality of preset code rates and tone quality indexes corresponding to the plurality of preset code rates respectively includes: Constructing a two-dimensional coordinate system, wherein coordinate axes of the two-dimensional coordinate system correspond to the tone quality index and the code rate respectively; marking a first coordinate point set formed by each preset code rate and a corresponding tone quality index in the two-dimensional coordinate system; Fitting the code rate and the tone quality index between adjacent preset code rates by adopting a linear difference mode to obtain a second coordinate point set between the adjacent preset code rates; and obtaining the code rate tone quality curve based on the first coordinate point set and the second coordinate point set.
6. The method according to claim 1, wherein the method further comprises: acquiring a sample audio fragment and corresponding annotation data, wherein the annotation data represents reference tone quality indexes respectively corresponding to the sample audio fragment under the multiple preset code rates; performing feature extraction processing on the sample audio fragment to obtain sample audio features; inputting the sample audio characteristics into a preset neural network to perform voice quality index prediction processing to obtain predicted voice quality indexes of the sample audio fragments corresponding to the plurality of preset code rates respectively; Determining loss information according to the labeling data and the predicted tone quality index; And training the preset neural network based on the loss information until the training iteration condition is met, so as to obtain the tone quality prediction model.
7. The method of claim 6, wherein the obtaining the sample audio piece and the corresponding annotation data comprises: acquiring a historical audio fragment as a sample audio fragment; the sample audio fragments are subjected to coding processing by utilizing the plurality of preset code rates, so that a plurality of coded audio fragments corresponding to the sample audio fragments are obtained, and each coded audio fragment corresponds to one preset code rate; and carrying out tone quality comparison analysis processing on the sample audio fragment and the corresponding plurality of coded audio fragments to obtain a reference tone quality index corresponding to the sample audio fragment, wherein the reference tone quality index is used as annotation data corresponding to the sample audio fragment.
8. The method according to any one of claims 1-4, wherein the obtaining the target sound quality index includes: acquiring content category information and target bandwidth information of the target audio fragment; and determining the target tone quality index according to the content category information and/or the target bandwidth information.
9. An audio processing apparatus, comprising: The device comprises a first acquisition module, a second acquisition module and a first processing module, wherein the first acquisition module is configured to acquire a target audio fragment to be encoded and a target tone quality index, wherein the target tone quality index is a tone quality index of the target audio fragment indicated; the voice quality prediction module is configured to execute voice quality index prediction processing on the target audio segment based on the voice quality prediction model to obtain voice quality indexes respectively corresponding to the target audio segment under a plurality of preset code rates; The mapping relation construction module is configured to execute the construction of the mapping relation between the code rate and the tone quality index according to the various preset code rates and the tone quality index corresponding to the various preset code rates respectively; the code rate determining module is configured to determine a target code rate corresponding to the target tone quality index based on the mapping relation, wherein the target code rate is used for encoding the target audio segment; wherein, the first acquisition module includes: A target audio acquisition unit configured to perform acquisition of target audio to be encoded; The segmentation processing unit is configured to execute segmentation processing on the target audio based on preset duration to obtain a plurality of audio fragments, wherein the preset duration is fixed or variable; A target audio clip obtaining unit configured to perform obtaining any one audio clip from the plurality of audio clips as the target audio clip.
10. An electronic device, comprising: A processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the audio processing method of any one of claims 1 to 8.
11. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the audio processing method of any one of claims 1 to 8.
12. A computer program product comprising computer instructions which, when executed by a processor, cause a computer to perform the audio processing method of any of claims 1 to 8.

Description

Audio processing method, device, electronic equipment and storage medium Technical Field The disclosure relates to the field of computer technology, and in particular, to an audio processing method, an audio processing device, electronic equipment and a storage medium. Background Audio coding refers to the process of compressing an audio signal using coding techniques during audio transmission and storage. In the related art, a code rate for audio encoding is manually set, but this can be set based on only a few features and limited strategies, and it is difficult to cover a complete audio content scene. And different tone quality targets can be generated in different application scenes, if different processing strategies are set for the different tone quality targets, the process is tedious and the time consumption is long. Disclosure of Invention The disclosure provides an audio processing method, an audio processing device, electronic equipment and a storage medium. The technical scheme of the present disclosure is as follows: According to a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including: Acquiring a target audio fragment to be encoded and a target tone quality index, wherein the target tone quality index is the tone quality index of the target audio fragment indicated; performing timbre index prediction processing on the target audio fragment based on a timbre prediction model to obtain timbre indexes respectively corresponding to the target audio fragment under a plurality of preset code rates; according to the multiple preset code rates and the tone quality indexes corresponding to the multiple preset code rates respectively, constructing a mapping relation between the code rates and the tone quality indexes; and determining a target code rate corresponding to the target tone quality index based on the mapping relation, wherein the target code rate is used for encoding the target audio fragment. In one possible implementation manner, the performing, based on a timbre prediction model, timbre index prediction processing on the target audio segment to obtain timbre indexes corresponding to the target audio segment under multiple preset code rates respectively includes: performing feature extraction processing on the target audio fragment to obtain audio features; And inputting the audio characteristics into the tone quality prediction model to perform tone quality index prediction processing to obtain tone quality indexes respectively corresponding to the target audio fragments under the multiple preset code rates. In one possible implementation manner, the performing feature extraction processing on the target audio segment to obtain an audio feature includes: performing frequency domain transformation on the target audio fragment to obtain frequency spectrum data of a plurality of audio frames; and acquiring the audio characteristics according to the frequency spectrum data. In a possible implementation manner, the acquiring the audio feature according to the spectrum data includes: extracting frame features from the spectral data of each audio frame; carrying out statistical processing on the frame characteristics to obtain characteristic statistical information of the plurality of audio frames; and acquiring the audio features based on the feature statistics. In one possible implementation manner, the mapping relationship is a code rate and tone quality curve, and the constructing the mapping relationship between the code rate and the tone quality index according to the multiple preset code rates and tone quality indexes corresponding to the multiple preset code rates respectively includes: Constructing a two-dimensional coordinate system, wherein coordinate axes of the two-dimensional coordinate system correspond to the tone quality index and the code rate respectively; marking a first coordinate point set formed by each preset code rate and a corresponding tone quality index in the two-dimensional coordinate system; Fitting the code rate and the tone quality index between adjacent preset code rates by adopting a linear difference mode to obtain a second coordinate point set between the adjacent preset code rates; and obtaining the code rate tone quality curve based on the first coordinate point set and the second coordinate point set. In one possible implementation, the method further includes: acquiring a sample audio fragment and corresponding annotation data, wherein the annotation data represents reference tone quality indexes respectively corresponding to the sample audio fragment under the multiple preset code rates; performing feature extraction processing on the sample audio fragment to obtain sample audio features; inputting the sample audio characteristics into a preset neural network to perform voice quality index prediction processing to obtain predicted voice quality indexes corresponding to the sample audio fragments und