CN-121983024-A - Model training method, device, electronic equipment, computer readable storage medium and computer program product

CN121983024ACN 121983024 ACN121983024 ACN 121983024ACN-121983024-A

Abstract

The application provides a model training method, a device, electronic equipment, a computer readable storage medium and a computer program product, wherein the method comprises the steps of extracting an audio feature sequence from original audio data, determining a first phoneme feature sequence based on the original audio data and text data, aligning the first phoneme feature sequence based on the audio feature sequence to obtain a second phoneme feature sequence, configuring neglect identification in the second phoneme feature sequence based on the text data to obtain a third phoneme feature sequence, carrying out structural association on the audio feature sequence and the third phoneme feature sequence to obtain structural data, storing the structural data in a target storage file, reading the structural data from the target storage file when a training instruction is received, updating model parameters of an audio synthesis model based on the data in which the neglect identification is not configured, and obtaining the trained audio synthesis model. The application can improve the training efficiency and accuracy of the audio synthesis model.

Inventors

YANG MINGMING
FENG DAN

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260403

Claims (20)

1. A method of model training, the method comprising: acquiring original audio data and text data corresponding to the original audio data, extracting an audio feature sequence from the original audio data, and determining a first phoneme feature sequence based on the original audio data and the text data; Based on the audio feature sequence, carrying out alignment processing on the first phoneme feature sequence to obtain a second phoneme feature sequence; Based on the text data, configuring an neglect mark in the second phoneme characteristic sequence to obtain a third phoneme characteristic sequence; Carrying out structural association on the audio feature sequence and the third phoneme feature sequence to obtain structural data, and storing the structural data into a target storage file; Under the condition that a training instruction aiming at an audio synthesis model to be trained is received, reading the structured data from the target storage file, and updating model parameters of the audio synthesis model to be trained based on the read structured data without the configuration of the data of the neglected identification, so as to obtain the trained audio synthesis model.
2. The method of claim 1, wherein the aligning the first phoneme feature sequence based on the audio feature sequence to obtain a second phoneme feature sequence comprises: acquiring a first time resolution of the first phoneme feature sequence and a second time resolution of the audio feature sequence; And resampling the first phoneme feature sequence based on the first time resolution and the second time resolution to obtain a second phoneme feature sequence, wherein the sequence length of the second phoneme feature sequence is the same as the sequence length of the audio feature sequence.
3. A method as defined in claim 2, wherein resampling the first phoneme feature sequence based on the first temporal resolution and the second temporal resolution to obtain a second phoneme feature sequence comprises: Determining a target scale based on the first temporal resolution and the second temporal resolution; and carrying out interpolation processing on the first phoneme characteristic sequence according to the target scaling to obtain the second phoneme characteristic sequence.
4. A method according to claim 3, wherein interpolating the first phoneme feature sequence according to the target scaling to obtain the second phoneme feature sequence comprises: constructing an initial resampling sequence, wherein the sequence length of the initial resampling sequence is the same as the sequence length of the audio feature sequence; Determining a mapping position corresponding to a target position index contained in the initial resampling sequence in the first phoneme characteristic sequence based on the target scaling; determining an original position index closest to the mapping position from the first phoneme feature sequence, and acquiring an original tag value corresponding to the original position index in the first phoneme feature sequence; and filling the original tag value into the target position index of the initial resampling sequence to obtain the second phoneme characteristic sequence.
5. The method of claim 1, wherein configuring an ignore flag in the second phoneme feature sequence based on the text data to obtain a third phoneme feature sequence comprises: Determining a non-pronunciation area contained in the second phoneme feature sequence based on the text data; and configuring the neglected identification in the non-pronunciation area of the second phoneme characteristic sequence to obtain the third phoneme characteristic sequence.
6. The method of claim 1, wherein the structurally associating the audio feature sequence and the third phoneme feature sequence to obtain structured data comprises: Acquiring a sample identifier corresponding to the original audio data; and based on the sample identifier, logically binding the audio feature sequence and the third phoneme feature sequence to obtain the structured data containing the sample identifier.
7. The method of claim 1, wherein storing the structured data to a target storage file comprises: dividing the target storage file into a plurality of independent storage columns; and writing the audio feature sequence and the third phoneme feature sequence in the structured data into the corresponding storage columns respectively.
8. The method of claim 7, wherein the structured data further comprises a voiceprint vector extracted from the original audio data; the writing the audio feature sequence and the third phoneme feature sequence in the structured data into the corresponding storage columns respectively includes: respectively constructing the audio feature sequence, the third phoneme feature sequence and the voiceprint vector in the structured data into mutually independent column data to be stored; Determining the access frequency of each column data to be stored; based on the access frequency, compressing each column of data to be stored to obtain compressed column data; and writing each compressed column data into the corresponding storage column respectively.
9. The method of claim 8, wherein the compressing each column data to be stored based on the access frequency to obtain compressed column data comprises: Aiming at the data to be stored corresponding to the audio feature sequence or the third phoneme feature sequence, under the condition that the access frequency of the data to be stored is greater than or equal to a first preset threshold value, performing first compression processing on the data to be stored to obtain compressed data; And aiming at the column data to be stored corresponding to the voiceprint vector, performing second compression processing on the column data to be stored under the condition that the access frequency of the column data to be stored is smaller than the first preset threshold value to obtain compressed column data, wherein the data compression ratio of the second compression processing is higher than that of the first compression processing.
10. The method of claim 8, wherein after writing each of the compressed column data to the corresponding storage column, the method further comprises: acquiring physical storage positions of the compressed column data in the corresponding storage columns aiming at the compressed column data corresponding to the audio feature sequence or the third phoneme feature sequence; determining a multi-level index of the packed column data based on the physical storage locations; and writing the multi-level index into the target storage file.
11. The method of claim 8, wherein after writing each of the compressed column data to the corresponding storage column, the method further comprises: and configuring delay reading identification for the compressed column data corresponding to the voiceprint vector, wherein the delay reading identification is used for indicating that the reading operation of the compressed column data corresponding to the voiceprint vector is skipped in a model training stage.
12. The method of any of claims 1-11, wherein after the storing the structured data to a target storage file, the method further comprises: Acquiring newly added audio data; performing feature extraction processing on the newly added audio data to obtain newly added structured data; splitting the newly added structured data into a plurality of newly added column data; and adding each new added column data to a storage column corresponding to the new added column data in the target storage file.
13. The method according to any one of claims 1-11, wherein the audio synthesis model to be trained comprises a large language model, an audio prediction network, and a phoneme prediction network; Updating model parameters of the audio synthesis model to be trained based on the read data without the neglected identification in the structured data to obtain a trained audio synthesis model, comprising: extracting features of the data which are not configured with the neglected identification in the structured data through the large language model to obtain hidden state features; carrying out audio prediction on the hidden state features through the audio prediction network to obtain an audio prediction result; Carrying out phoneme prediction on the hidden state features through the phoneme prediction network to obtain a phoneme prediction result; And updating model parameters of the large language model, the audio prediction network and the phoneme prediction network based on the audio prediction result and the phoneme prediction result to obtain the trained audio synthesis model.
14. The method of claim 13, wherein the updating model parameters of the large language model, the audio prediction network, and the phoneme prediction network based on the audio prediction results and the phoneme prediction results to obtain the trained audio synthesis model comprises: Determining a first loss value based on the audio prediction result; determining a second loss value based on the phoneme prediction; Determining a target loss value based on the first loss value and the second loss value; And based on the target loss value, updating model parameters of the large language model, the audio prediction network and the phoneme prediction network to obtain the trained audio synthesis model.
15. The method of claim 14, wherein the determining a target loss value based on the first loss value and the second loss value comprises: acquiring a training stage of the audio synthesis model to be trained in the current process; Determining the second loss value as the target loss value in the case that the training phase is a phoneme independent training phase; and determining the first loss value as the target loss value in the case that the training phase is an audio independent training phase.
16. The method of any of claims 1-11, wherein the audio synthesis model to be trained is deployed on a distributed computing cluster comprising a plurality of computing nodes, the structured data further comprising the text data and voiceprint vectors extracted from the raw audio data; Before said reading of said structured data from said target storage file, said method further comprises: Acquiring the text length of the text data, the first feature distribution state of the audio feature sequence and the second feature distribution state of the third phoneme feature sequence in the structured data; based on the voiceprint vector, the text length, the first characteristic distribution state and the second characteristic distribution state, performing block processing on a plurality of structured data in the target storage file to obtain a plurality of data blocks; And distributing the plurality of data blocks to a plurality of computing nodes.
17. A model training apparatus, the apparatus comprising: The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original audio data and text data corresponding to the original audio data, extracting an audio feature sequence from the original audio data, and determining a first phoneme feature sequence based on the original audio data and the text data; the alignment module is used for carrying out alignment processing on the first phoneme characteristic sequence based on the audio characteristic sequence to obtain a second phoneme characteristic sequence; the configuration module is used for configuring neglect marks in the second phoneme characteristic sequence based on the text data to obtain a third phoneme characteristic sequence; the storage module is used for carrying out structural association on the audio feature sequence and the third phoneme feature sequence to obtain structural data, and storing the structural data into a target storage file; And the training module is used for reading the structured data from the target storage file under the condition of receiving a training instruction aiming at the audio synthesis model to be trained, and updating the model parameters of the audio synthesis model to be trained based on the read structured data without the configuration of the data of the neglected identification, so as to obtain the trained audio synthesis model.
18. An electronic device, the electronic device comprising: a memory for storing computer executable instructions or computer programs; A processor for implementing the model training method of any of claims 1 to 16 when executing computer-executable instructions or computer programs stored in the memory.
19. A computer readable storage medium storing computer executable instructions or a computer program, which when executed by a processor implements the model training method of any of claims 1 to 16.
20. A computer program product comprising computer executable instructions or a computer program, which when executed by a processor implements the model training method of any of claims 1 to 16.

Description

Model training method, device, electronic equipment, computer readable storage medium and computer program product Technical Field The present application relates to the field of audio processing technology, and in particular, to a model training method, apparatus, electronic device, computer readable storage medium, and computer program product. Background The audio synthesis model is widely applied to the fields of man-machine interaction and audio generation. In the related model training process, multi-mode media data need to be acquired, multi-dimensional feature extraction and sequence processing are carried out on the read media data, and model training is further carried out based on the extracted feature data. In a model training scene related to large-scale data samples and multi-mode joint modeling, as the feature dimension increases, how to consider the processing efficiency of multi-dimensional feature data and the accuracy of sequence feature characterization in the model training stage provides higher requirements for the existing data processing and model training mechanism. Disclosure of Invention The embodiment of the application provides a model training method, a model training device, electronic equipment, a computer readable storage medium and a computer program product, which can improve the training efficiency and accuracy of an audio synthesis model. The technical scheme of the embodiment of the application is realized as follows: the embodiment of the application provides a model training method, which comprises the following steps: acquiring original audio data and text data corresponding to the original audio data, extracting an audio feature sequence from the original audio data, and determining a first phoneme feature sequence based on the original audio data and the text data; Based on the audio feature sequence, carrying out alignment processing on the first phoneme feature sequence to obtain a second phoneme feature sequence; Based on the text data, configuring an neglect mark in the second phoneme characteristic sequence to obtain a third phoneme characteristic sequence; Carrying out structural association on the audio feature sequence and the third phoneme feature sequence to obtain structural data, and storing the structural data into a target storage file; Under the condition that a training instruction aiming at an audio synthesis model to be trained is received, reading the structured data from the target storage file, and updating model parameters of the audio synthesis model to be trained based on the read structured data without the configuration of the data of the neglected identification, so as to obtain the trained audio synthesis model. The embodiment of the application provides a model training device, which comprises: The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original audio data and text data corresponding to the original audio data, extracting an audio feature sequence from the original audio data, and determining a first phoneme feature sequence based on the original audio data and the text data; the alignment module is used for carrying out alignment processing on the first phoneme characteristic sequence based on the audio characteristic sequence to obtain a second phoneme characteristic sequence; the configuration module is used for configuring neglect marks in the second phoneme characteristic sequence based on the text data to obtain a third phoneme characteristic sequence; the storage module is used for carrying out structural association on the audio feature sequence and the third phoneme feature sequence to obtain structural data, and storing the structural data into a target storage file; And the training module is used for reading the structured data from the target storage file under the condition of receiving a training instruction aiming at the audio synthesis model to be trained, and updating the model parameters of the audio synthesis model to be trained based on the read structured data without the configuration of the data of the neglected identification, so as to obtain the trained audio synthesis model. An embodiment of the present application provides an electronic device, including: a memory for storing computer executable instructions or computer programs; and the processor is used for realizing the model training method provided by the embodiment of the application when executing the computer executable instructions or the computer programs stored in the memory. The embodiment of the application provides a computer readable storage medium, which stores a computer program or computer executable instructions for realizing the model training method provided by the embodiment of the application when being executed by a processor. The embodiment of the application provides a computer program product, which comprises a computer program or a co