US-20260128037-A1 - SPEECH RECOGNITION METHOD AND APPARATUS, AND COMPUTER-READABLE STORAGE MEDIUM

US20260128037A1US 20260128037 A1US20260128037 A1US 20260128037A1US-20260128037-A1

Abstract

This application relates to a speech recognition method and apparatus, and a computer-readable storage medium, and the method includes: obtaining a first loss function of a speech separation and enhancement model and a second loss function of a speech recognition model; performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation and enhancement model and the speech recognition model, to obtain a representation model; fusing the first loss function and the second loss function, to obtain a target loss function; and jointly training the speech separation and enhancement model, the representation model, and the speech recognition model based on the target loss function, and ending the training when a preset convergence condition is met.

Inventors

Jun Wang
Wing Yip Lam

Assignees

TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED

Dates

Publication Date: 20260507
Application Date: 20251218
Priority Date: 20200116

Claims (20)

1 . A speech recognition method, performed by a computer device, the method comprising: obtaining a first loss function of a speech separation and enhancement model and a second loss function of a speech recognition model; obtaining a representation model by training an intermediate model bridging between the speech separation and enhancement model and the speech recognition model; obtaining a target loss function by fusing the first loss function and the second loss function; and jointly training the speech separation and enhancement model, the representation model, and the speech recognition model based on the target loss function.
2 . The method according to claim 1 , wherein obtaining the representation model by training the intermediate model bridging between the speech separation and enhancement model and the speech recognition model comprises: performing back propagation based on the second loss function to train the intermediate model bridging between the speech separation and enhancement model and the speech recognition model to obtain the representation model.
3 . The method according to claim 1 , wherein obtaining the target loss function by fusing the first loss function and the second loss function comprises: fusing the first loss function and the second loss function by using a machine learning algorithm.
4 . The method according to claim 1 , further comprising: extracting an estimated spectrum and an embedding feature matrix of a sample speech stream based on a first neural network model; determining an attractor corresponding to the sample speech stream according to the embedding feature matrix and a preset ideal masking matrix; obtaining a target masking matrix of the sample speech stream by calculating a similarity between each matrix element in the embedding feature matrix and the attractor; determining an enhanced spectrum corresponding to the sample speech stream according to the target masking matrix; and training the first neural network model based on a mean-square error (MSE) loss between the estimated spectrum and the enhanced spectrum corresponding to the sample speech stream, to obtain the speech separation and enhancement model.
5 . The method according to claim 4 , wherein extracting the estimated spectrum and the embedding feature matrix of the sample speech stream based on the first neural network model comprises: performing Fourier transform on the sample speech stream, to obtain a speech spectrum and a speech feature of each audio frame of the sample speech stream; performing speech separation (SS) and speech enhancement (SE) on the speech spectrum based on the first neural network model, to obtain the estimated spectrum; and mapping the speech feature to an embedding space based on the first neural network model, to obtain the embedding feature matrix.
6 . The method according to claim 5 , wherein determining the attractor corresponding to the sample speech stream according to the embedding feature matrix and the preset ideal masking matrix comprises: determining an ideal masking matrix according to the speech spectrum and the speech feature; filtering out noise elements in the ideal masking matrix based on a preset binary threshold matrix to obtain the preset ideal masking matrix; and determining the attractor corresponding to the sample speech stream according to the embedding feature matrix and the preset ideal masking matrix.
7 . The method according to claim 1 , further comprising: obtaining a second neural network model; performing non-negative constraint processing on the second neural network model, to obtain a non-negative neural network model; obtaining a differential model configured for performing auditory matching on an acoustic feature outputted by the non-negative neural network model; and cascading the differential model and the non-negative neural network model, to obtain the intermediate model.
8 . The method according to claim 7 , wherein obtaining the differential model configured for performing auditory matching on the acoustic feature outputted by the non-negative neural network model comprises: obtaining a logarithmic model configured for performing a logarithmic operation on a feature vector corresponding to the acoustic feature; obtaining a difference model configured for performing a difference operation on the feature vector corresponding to the acoustic feature; and constructing the differential model according to the logarithmic model and the difference model.
9 . The method according to claim 1 , further comprising: obtaining a sample speech stream and corresponding phoneme categories that are annotated; extracting a depth feature of each audio frame of the sample speech stream by using a third neural network model; determining a center vector of the sample speech stream according to depth features corresponding to audio frames of all the phoneme categories; determining a fusion loss between an inter-class confusion measurement index and an intra-class distance penalty index of each audio frame based on the depth features and the center vector; and training the third neural network model based on the fusion losses, to obtain the speech recognition model.
10 . The method according to claim 9 , wherein determining a fusion loss between the inter-class confusion measurement index and the intra-class distance penalty index of each audio frame based on the depth features and the center vector comprises: calculating the inter-class confusion measurement index of each audio frame of the sample speech stream based on the depth features; calculating the intra-class distance penalty index of each audio frame based on the depth features and the center vector; and performing a fusion operation on the inter-class confusion measurement index and the intra-class distance penalty index, to obtain the fusion loss.
11 . The method according to claim 1 , wherein jointly training the speech separation and enhancement model, the representation model, and the speech recognition model based on the target loss function comprises: determining a global descent gradient generated by the target loss function; and iteratively updating model parameters corresponding to the speech separation and enhancement model, the representation model, and the speech recognition model according to the global descent gradient, until a minimum loss value of the target loss function is obtained.
12 . The method according to claim 1 , further comprising: ending the joint training when a preset convergence condition is met.
13 . A speech recognition method, performed by a computer device, the method comprising: obtaining a target speech stream; extracting an enhanced spectrum of each audio frame of the target speech stream based on a speech separation and enhancement model; performing auditory matching on the enhanced spectrum based on a representation model to obtain a representation feature, wherein the representation model is obtained by training an intermediate model bridging between the speech separation and enhancement model and a speech recognition model; and recognizing the representation feature based on the speech recognition model, to obtain a phoneme corresponding to each audio frame.
14 . The method according to claim 13 , wherein the representation model is obtained by performing back propagation based on a first loss function of the speech recognition model to train the intermediate model bridging between the speech separation and enhancement model and the speech recognition model.
15 . The method according to claim 14 , wherein the speech separation and enhancement model, the representation model, and the speech recognition model comprise neural networks with network parameters obtained by joint training by iteratively minimizing a fused loss function of the first loss function and a second loss function of the speech separation and enhancement model.
16 . A speech recognition device, comprising a memory for storing computer instructions and a processor for executing the computer instructions to: obtain a first loss function of a speech separation and enhancement model and a second loss function of a speech recognition model; obtain a representation model by training an intermediate model bridging between the speech separation and enhancement model and the speech recognition model; obtain a target loss function by fusing the first loss function and the second loss function; and jointly train the speech separation and enhancement model, the representation model, and the speech recognition model based on the target loss function.
17 . The speech recognition device of claim 16 , wherein the processor is further configured to execute the computer instructions to: end the joint training when a preset convergence condition is met.
18 . A computer device, comprising a memory and a processor, the memory storing computer-readable instructions, the computer-readable instructions, when executed by the processor, causing the processor to perform the method according to claim 13 .
19 . A computer-readable storage medium, storing computer-readable instructions, the computer-readable instructions, when executed by a processor, causing the processor to perform the method according to claim 1 .
20 . A computer-readable storage medium, storing computer-readable instructions, the computer-readable instructions, when executed by a processor, causing the processor to perform the method according to claim 13 .

Description

RELATED APPLICATION This application is a continuation application of and claims the benefit of priority to U.S. patent application Ser. No. 17/583,512 filed on Jan. 25, 2022, which is a continuation of and claims priority to PCT International Application No. PCT/CN2020/128392, entitled “SPEECH RECOGNITION METHOD AND MODEL TRAINING METHOD, APPARATUS, AND COMPUTER-READABLE STORAGE MEDIUM” and filed on Nov. 12, 2020, which claims priority to Chinese Patent Application No. 202010048780.2, entitled “SPEECH RECOGNITION METHOD AND MODEL TRAINING METHOD, APPARATUS, AND COMPUTER-READABLE STORAGE MEDIUM” and filed on Jan. 16, 2020. All of these applications are incorporated by reference in their entireties. FIELD OF THE TECHNOLOGY This application relates to the field of speech processing technologies, and in particular, to a speech recognition method and apparatus, and a computer-readable storage medium. BACKGROUND OF THE DISCLOSURE The development of speech recognition technologies makes it possible for humans to interact with machines by using natural language. A speech signal may be converted into a text sequence based on the speech recognition technologies. To realize such conversion, front-end processing such as speech separation (SS) and speech enhancement (SE) may be performed on a received speech signal, and then back-end processing such as automatic speech recognition (ASR) may be performed on an acoustic feature obtained through the front-end processing. In conventional technologies, speech separation and speech enhancement are performed on the speech signal by using a speech separation and enhancement model, and then speech recognition is performed by using a speech recognition model. However, such a method usually offers low accuracy in speech recognition. SUMMARY Embodiments of this application provide a speech recognition method and apparatus, and a computer-readable storage medium. A speech recognition method, performed by a computer device, the method including: obtaining a first loss function of a speech separation and enhancement model and a second loss function of a speech recognition model; performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation and enhancement model and the speech recognition model, to obtain a robust (or robustness) representation model; fusing the first loss function and the second loss function, to obtain a target loss function; jointly training the speech separation and enhancement model, the robust representation model, and the speech recognition model based on the target loss function, and ending the joint training when a preset convergence condition is met. A speech recognition apparatus, including: an intermediate representation learning module, configured to obtain a first loss function of a speech separation and enhancement model and a second loss function of a speech recognition model; and perform back propagation based on the second loss function to train an intermediate model bridging between the speech separation and enhancement model and the speech recognition model, to obtain a robust representation model; a loss fusion module, configured to fuse the first loss function and the second loss function, to obtain a target loss function; and a joint training module, configured to jointly train the speech separation and enhancement model, the robust representation model, and the speech recognition model based on the target loss function, and ending the training when a preset convergence condition is met. A speech recognition method, performed by a computer device, the method including: obtaining a target speech stream; extracting an enhanced spectrum of each audio frame of the target speech stream based on a speech separation and enhancement model; performing auditory matching on the enhanced spectrum based on a robustness representation model to obtain a robustness or representation feature; and recognizing the robustness or representation feature based on a speech recognition model, to obtain a phoneme corresponding to each audio frame. The speech separation and enhancement model, the robustness representation model, and the speech recognition model being obtained by joint training. A speech recognition apparatus, including: a speech separation and enhancement module, configured to obtain a target speech stream; and extract an enhanced spectrum of each audio frame of the target speech stream based on a speech separation and enhancement model; an intermediate representation transition module, configured to perform auditory matching on the enhanced spectrum based on a robustness representation model to obtain a robustness feature; and a speech recognition module, configured to recognize the robustness feature based on a speech recognition model, to obtain a phoneme corresponding to each audio frame, the speech separation and enhancement model, the robust representation model, and the speech rec