CN-115762557-B - Training method and system for self-supervision training predictor for speech separation

CN115762557BCN 115762557 BCN115762557 BCN 115762557BCN-115762557-B

Abstract

The embodiment of the invention provides a training method and a training system for a self-supervision training predictor for voice separation. The method comprises the steps of respectively extracting self-supervision training features of single sound source voices by using a pre-training model, extracting shallow features for voice representation and deep features for context information in the self-supervision training features, determining the shallow features and the deep features of the single sound source voices as training labels of a self-supervision training predictor, inputting training mixed voices generated by the single sound source voices to the self-supervision training predictor to obtain estimated features of the single sound source voices, and training the self-supervision training predictor based on the estimated features and loss functions determined by the training labels corresponding to the single sound source voices. The embodiment of the invention trains the self-supervision training predictor and is applied to the voice separation model, so that the precision of the self-supervision training characteristic is improved, the performance of the voice separation system is improved, and the model parameters and the calculation complexity are reduced.

Inventors

QIAN YANMIN
LI CHENDA
QU BOWEN

Assignees

思必驰科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20221110

Claims (6)

1. A training method for a self-supervised training predictor for speech separation, comprising: Respectively extracting self-supervision training characteristics of each single sound source voice by using a pre-training model; Extracting shallow features for voice representation and deep features for context information from the self-supervision training features, and determining the shallow features and the deep features of each single sound source voice as training labels of a self-supervision training predictor, wherein the shallow features are extracted based on a time domain convolutional neural network model, and the deep features are extracted based on a transducer; Inputting training mixed voice generated by each single sound source voice into the self-supervision training predictor to obtain estimation characteristics of each single sound source voice, wherein the self-supervision training predictor comprises a time domain convolution neural network for extracting time domain voice signals and a two-way circulation neural network for context modeling; and training the self-supervision training predictor based on the estimated characteristics of each single sound source voice and the loss function determined by the training label corresponding to each single sound source voice to obtain a trained self-supervision training predictor, wherein the loss function is a mean square error loss function so as to perform displacement invariance training on the self-supervision training predictor.
2. The method of claim 1, wherein the pre-training model comprises a Wav2vec unsupervised pre-training model.
3. The method of claim 1, wherein after deriving the trained self-supervising training predictor, the method further comprises: Inputting the received mixed voice containing a plurality of speakers into a voice separation model, wherein the voice separation comprises an encoder, a voice separator and a decoder; The encoder encodes the mixed voice to obtain a mixed voice deep feature, determines a self-supervision training estimation feature of each speaker in the mixed voice by utilizing the self-supervision training predictor, and determines a fusion feature based on the self-supervision training estimation feature of each speaker and the mixed voice deep feature; the voice separator determines the feature code of each speaker in the fusion features; the decoder decodes the feature codes to obtain separate voices of each speaker in the mixed voice.
4. A training system for a self-supervising training predictor for speech separation, comprising: the feature extraction program module is used for respectively extracting self-supervision training features of the single sound source voices by utilizing the pre-training model; The training label determining program module is used for extracting shallow characteristics for voice representation and deep characteristics for context information in the self-supervision training characteristics, and determining the shallow characteristics and the deep characteristics of each single sound source voice as training labels of a self-supervision training predictor, wherein the shallow characteristics are extracted based on a time domain convolutional neural network model, and the deep characteristics are extracted based on a transducer; The estimated characteristic determining program module is used for inputting training mixed voice generated by each single sound source voice into the self-supervision training predictor to obtain estimated characteristics of each single sound source voice, wherein the self-supervision training predictor comprises a time domain convolutional neural network for extracting time domain voice signals and a two-way cyclic neural network for context modeling; And the training program module is used for training the self-supervision training predictor based on the estimated characteristics of each single sound source voice and the loss function determined by the training label corresponding to each single sound source voice to obtain a trained self-supervision training predictor, wherein the loss function is a mean square error loss function so as to perform displacement invariance training on the self-supervision training predictor.
5. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-3.
6. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method of any of claims 1-3.

Description

Training method and system for self-supervision training predictor for speech separation Technical Field The invention relates to the field of intelligent voice, in particular to a training method and a system for a self-supervision training predictor for voice separation. Background Higher recognition accuracy has been achieved for single speaker speech recognition, but for cocktails problems, such as reduced speech recognition rate for each speaker when multiple speakers speak together, can be encountered. To improve the accuracy of speech recognition when multiple speakers speak together, the prior art typically uses an unsupervised pre-training model. The large scale unsupervised pre-training model uses a mask-prediction based criterion for self-supervised training with a large amount of unlabeled data. The model attempts to model the context information of the speech signal during training, thereby learning strong deep embedded features. The large-scale pre-unsupervised training model achieves good results in a variety of downstream speech tasks. In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art: Most pre-training models are trained using data sets consisting primarily of single speaker voices, but the voice separation input is typically a mixture of voices involving multiple speakers. The pre-training features of the single person speech are directly utilized to poor effect in speech separation tasks. The training data set of most pre-training models consists mainly of single speaker speech, but the speech separation input is typically a speech mixture involving multiple speakers. The pre-training features of the single person speech are directly utilized to poor effect in speech separation tasks. Another problem with applying pre-trained models to speech separation tasks that need to be considered and optimized is the complexity and computational cost of the model. Most pre-training models are designed for general downstream tasks, trained using large-scale data sets. The model may be too large in size for the speech separation task, and the resulting computational cost may be prohibitive. Disclosure of Invention The method aims at least solving the problems of complexity and high cost of using a pre-training model for voice separation tasks in the prior art. In a first aspect, an embodiment of the present invention provides a training method of a self-supervision training predictor for speech separation, including: Respectively extracting self-supervision training characteristics of each single sound source voice by using a pre-training model; extracting shallow features for voice representation and deep features for context information from the self-supervision training features, and determining the shallow features and the deep features of each single sound source voice as training labels of a self-supervision training predictor; Inputting training mixed voice generated by each single sound source voice into the self-supervision training predictor to obtain estimated characteristics of each single sound source voice; and training the self-supervision training predictor based on the estimated characteristics of each single sound source voice and the loss function determined by the training label corresponding to each single sound source voice to obtain the trained self-supervision training predictor. In a second aspect, embodiments of the present invention provide a training system for a self-supervised training predictor for speech separation, comprising: the feature extraction program module is used for respectively extracting self-supervision training features of the single sound source voices by utilizing the pre-training model; The training label determining program module is used for extracting shallow layer characteristics for voice representation and deep layer characteristics for context information from the self-supervision training characteristics, and determining the shallow layer characteristics and the deep layer characteristics of each single sound source voice as training labels of a self-supervision training predictor; the estimated characteristic determining program module is used for inputting the training mixed voice generated by each single sound source voice into the self-supervision training predictor to obtain the estimated characteristic of each single sound source voice; And the training program module is used for training the self-supervision training predictor based on the estimated characteristics of each single sound source voice and the loss function determined by the training label corresponding to each single sound source voice to obtain the trained self-supervision training predictor. In a third aspect, an electronic device is provided comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executa