US-12626687-B2 - Speech recognition method, apparatus and device, and storage medium

US12626687B2US 12626687 B2US12626687 B2US 12626687B2US-12626687-B2

Abstract

Provided in the present application are a speech recognition method, apparatus and device, and a storage medium. The method comprises: acquiring a speech feature of target mixed speech and a speaker feature of a specified speaker; taking the direction of tending to a target speech feature as an extraction direction, and according to the speech feature of the target mixed speech and a speaker feature of a target speaker, extracting a speech feature of the target speaker from the speech feature of the target mixed speech, so as to obtain an extracted speech feature of the target speaker; and acquiring a speech recognition result of the specified speaker according to an extracted speech feature of the specified speaker.

Inventors

Xin Fang
Junhua Liu

Assignees

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA
IFLYTEK CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20211110
Priority Date: 20210907

Claims (18)

1 . A speech recognition method, comprising: obtaining a speech feature of a target mixed speech and a speaker feature of a target speaker; extracting, by directing the extraction towards a target speech feature, a speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker, to obtain the extracted speech feature of the target speaker, wherein the target speech feature is a speech feature for obtaining a speech recognition result consistent with real speech content of the target speaker; and obtaining the speech recognition result of the target speaker based on the extracted speech feature of the target speaker; wherein a feature extraction model and a speech recognition model are pre-established, and joint training is performed on the speech recognition model and the feature extraction model; wherein the joint training performed on the speech recognition model and the feature extraction model further comprises: extracting, by using the feature extraction model, the speech feature of the designated speaker from the speech feature of the training mixed speech, to obtain the extracted speech feature of the designated speaker, wherein the training mixed speech corresponds to a speech of the designated speaker; obtaining the speech recognition result of the designated speaker by using the speech recognition model and the extracted speech feature of the designated speaker; obtaining an annotated text of the speech of the designated speaker; obtaining a speech feature of the speech of the designated speaker as a standard speech feature of the designated speaker; determining a first prediction loss based on the extracted speech feature of the designated speaker and the standard speech feature of the designated speaker; determining a second prediction loss based on the speech recognition result of the designated speaker and the annotated text of the speech of the designated speaker; performing the parameter update on the feature extraction model based on the first prediction loss and the second prediction loss; and performing the parameter update on the speech recognition model based on the second prediction loss.
2 . The speech recognition method according to claim 1 , wherein the obtaining the speaker feature of the target speaker comprises: obtaining a registered speech of the target speaker; and extracting a short-term voiceprint feature and a long-term voiceprint feature from the registered speech of the target speaker to obtain a multi-scale voiceprint feature as the speaker feature of the target speaker.
3 . The speech recognition method according to claim 1 , wherein the extracting, directing the extraction towards the target speech feature, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker further comprises: extracting, by using the pre-established feature extraction model, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker; wherein, the feature extraction model is trained, by using a speech feature of a training mixed speech and a speaker feature of a designated speaker, with a speech recognition result obtained based on an extracted speech feature of the designated speaker as an optimization objective, wherein the speech feature of the training mixed speech comprising a speech of the designated speaker, and the extracted speech feature of the designated speaker is a speech feature of the designated speaker extracted from the speech feature of the training mixed speech.
4 . The speech recognition method according to claim 3 , wherein the feature extraction model is trained using both the extracted speech feature of the designated speaker and the speech recognition result obtained based on the extracted speech feature of the designated speaker as the optimization objective.
5 . The speech recognition method according to claim 3 , wherein the extracting, by using the pre-established feature extraction model, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker further comprises: inputting the speech feature of the target mixed speech and the speaker feature of the target speaker into the feature extraction model to obtain a feature mask corresponding to the target speaker; and extracting the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the feature mask corresponding to the target speaker.
6 . The speech recognition method according to claim 1 , wherein obtaining the speech recognition result of the target speaker based on the extracted speech feature of the target speaker further comprises: obtaining the speech recognition result of the target speaker based on the extracted speech feature of the target speaker and a registered speech feature of the target speaker; wherein the registered speech feature of the target speaker is a speech feature of a registered speech of the target speaker.
7 . The speech recognition method according to claim 3 , wherein obtaining the speech recognition result of the target speaker based on the extracted speech feature of the target speaker further comprises: inputting a speech recognition input feature at least comprising the extracted speech feature of the target speaker into the pre-established speech recognition model to obtain the speech recognition result of the target speaker; wherein the speech recognition model is trained, by using the extracted speech feature of the designated speaker, with the speech recognition result obtained based on the extracted speech feature of the designated speaker as an optimization objective.
8 . The speech recognition method according to claim 7 , wherein inputting the speech recognition input feature into the speech recognition model to obtain the speech recognition result of the target speaker further comprises: encoding the speech recognition input feature based on an encoder module in the speech recognition model to obtain an encoded result; extracting, from the encoded result, an audio-related feature vector required for decoding at a decoding time instant, based on an attention module in the speech recognition model; and decoding the audio-related feature vector extracted from the encoded result based on a decoder module in the speech recognition model to obtain a recognition result at the decoding time instant.
9 . The speech recognition method according to claim 1 , wherein the training mixed speech and the speech of the designated speaker corresponding to the training mixed speech are obtained from a pre-constructed training dataset; and a construction of the training dataset comprises: obtaining a plurality of speeches from a plurality of speakers, wherein each speech is a speech from a single speaker and has an annotated text; using each speech of part or all of the plurality of speeches as the speech of the designated speaker; mixing the speech of the designated speaker with one or more speeches of the plurality of speakers other than the designated speaker in remaining of the plurality of speeches to obtain the training mixed speech; determining the speech of the designated speaker and the training mixed speech obtained from mixing as a piece of training data; and forming the training dataset with a plurality of obtained pieces of training data.
10 . A speech recognition apparatus, comprising a feature obtaining module, a feature extraction module and a speech recognition module, wherein the feature obtaining module is configured to obtain a speech feature of a target mixed speech and a speaker feature of a target speaker; the feature extraction module is configured to extract, by directing an extraction towards a target speech feature, a speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker, to obtain the extracted speech feature of the target speaker; wherein the target speech feature is a speech feature for obtaining a speech recognition result consistent with real speech content of the target speaker; and the speech recognition module is configured to obtain the speech recognition result of the target speaker based on the extracted speech feature of the target speaker; wherein a feature extraction model and a speech recognition model are pre-established, and joint training is performed on the speech recognition model and the feature extraction model; wherein the joint training performed on the speech recognition model and the feature extraction model further comprises: extracting, by using the feature extraction model, the speech feature of the designated speaker from the speech feature of the training mixed speech, to obtain the extracted speech feature of the designated speaker, wherein the training mixed speech corresponds to a speech of the designated speaker; obtaining the speech recognition result of the designated speaker by using the speech recognition model and the extracted speech feature of the designated speaker; obtaining an annotated text of the speech of the designated speaker; obtaining a speech feature of the speech of the designated speaker as a standard speech feature of the designated speaker; determining a first prediction loss based on the extracted speech feature of the designated speaker and the standard speech feature of the designated speaker; determining a second prediction loss based on the speech recognition result of the designated speaker and the annotated text of the speech of the designated speaker; performing the parameter update on the feature extraction model based on the first prediction loss and the second prediction loss; and performing the parameter update on the speech recognition model based on the second prediction loss.
11 . The speech recognition apparatus according to claim 10 , wherein the feature obtaining module comprises: a speaker feature obtaining module, configured to obtain a registered speech of the target speaker; and extract a short-term voiceprint feature and a long-term voiceprint feature from the registered speech of the target speaker to obtain a multi-scale voiceprint feature as the speaker feature of the target speaker.
12 . The speech recognition apparatus according to claim 10 , wherein the feature extraction module is further configured to: extract, by using the pre-established feature extraction model, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker; wherein, the feature extraction model is trained, by using a speech feature of a training mixed speech and a speaker feature of a designated speaker, with a speech recognition result obtained based on an extracted speech feature of the designated speaker as an optimization objective, wherein the speech feature of the training mixed speech comprising a speech of the designated speaker, and the extracted speech feature of the designated speaker is a speech feature of the designated speaker extracted from the speech feature of the training mixed speech.
13 . The speech recognition apparatus according to claim 10 , wherein the speech recognition module is further configured to: obtain the speech recognition result of the target speaker based on the extracted speech feature of the target speaker and a registered speech feature of the target speaker; and wherein the registered speech feature of the target speaker is a speech feature of a registered speech of the target speaker.
14 . The speech recognition apparatus according to claim 12 , wherein the speech recognition module is further configured to: input a speech recognition input feature at least comprising the extracted speech feature of the target speaker into the pre-established speech recognition model to obtain the speech recognition result of the target speaker; wherein the speech recognition model is trained, by using the extracted speech feature of the designated speaker, with the speech recognition result obtained based on the extracted speech feature of the designated speaker as an optimization objective.
15 . A speech recognition device, comprising a memory and a processor; wherein the memory is configured to store a program; and the processor is configured to execute the program to implement a speech recognition method, the speech recognition method comprising: obtaining a speech feature of a target mixed speech and a speaker feature of a target speaker; extracting, by directing the extraction towards a target speech feature, a speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker, to obtain the extracted speech feature of the target speaker; wherein the target speech feature is a speech feature for obtaining a speech recognition result consistent with real speech content of the target speaker; and obtaining the speech recognition result of the target speaker based on the extracted speech feature of the target speaker; wherein a feature extraction model and a speech recognition model are pre-established, and joint training is performed on the speech recognition model and the feature extraction model; wherein the joint training performed on the speech recognition model and the feature extraction model further comprises: extracting, by using the feature extraction model, the speech feature of the designated speaker from the speech feature of the training mixed speech, to obtain the extracted speech feature of the designated speaker, wherein the training mixed speech corresponds to a speech of the designated speaker; obtaining the speech recognition result of the designated speaker by using the speech recognition model and the extracted speech feature of the designated speaker; obtaining an annotated text of the speech of the designated speaker; obtaining a speech feature of the speech of the designated speaker as a standard speech feature of the designated speaker; determining a first prediction loss based on the extracted speech feature of the designated speaker and the standard speech feature of the designated speaker; determining a second prediction loss based on the speech recognition result of the designated speaker and the annotated text of the speech of the designated speaker; performing the parameter update on the feature extraction model based on the first prediction loss and the second prediction loss; and performing the parameter update on the speech recognition model based on the second prediction loss.
16 . A computer readable storage medium storing a computer program thereon, wherein the computer program, when executed by a processor, implements the speech recognition method according to claim 1 .
17 . The speech recognition method according to claim 4 , wherein the process of the extracting, by using the pre-established feature extraction model, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker comprises: inputting the speech feature of the target mixed speech and the speaker feature of the target speaker into the feature extraction model to obtain a feature mask corresponding to the target speaker; and extracting the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the feature mask corresponding to the target speaker.
18 . The speech recognition method according to claim 4 , wherein the process of obtaining the speech recognition result of the target speaker based on the extracted speech feature of the target speaker comprises: inputting a speech recognition input feature at least comprising the extracted speech feature of the target speaker into the pre-established speech recognition model to obtain the speech recognition result of the target speaker; wherein the speech recognition model is trained, by using the extracted speech feature of the designated speaker, with the speech recognition result obtained based on the extracted speech feature of the designated speaker as an optimization objective.

Description

This application is the national phase of International Application No. PCT/CN2021/129733, titled “SPEECH RECOGNITION METHOD, APPARATUS AND DEVICE, AND STORAGE MEDIUM”, filed on Nov. 10, 2021, which claims priority to Chinese Patent Application No. 202111042821.8, titled “SPEECH RECOGNITION METHOD, APPARATUS AND DEVICE, AND STORAGE MEDIUM”, filed on Sep. 7, 2021 with the China National Intellectual Property Administration (CNIPA), both of which are incorporated herein by reference in their entireties. FIELD The present disclosure relates to the technical field of speech recognition, and in particular to a speech recognition method, a speech recognition apparatus, a speech recognition device, and a storage medium. BACKGROUND With the rapid development of artificial intelligence technology, smart devices are playing an increasingly important role in lives. Speech interaction, as the most convenient and natural way of human-computer interaction, is the most favored among the users. When using a smart device, a user may be in a complex environment with speeches from others. In this case, a speech collected by the smart device is a mixed speech. In the speech interaction, there is a need to recognize speech content of a target speaker from the mixed speech, in order to obtain a better user experience. How to recognize the speech content of the target speaker from the mixed speech is a problem to be solved. SUMMARY In view of this, a speech recognition method, a speech recognition apparatus, a speech recognition device, and a storage medium are provided according to the present disclosure, which can more accurately recognize the speech content of a target speaker from a mixed speech. Technical solutions are described below. A speech recognition method is provided, including: obtaining a speech feature of a target mixed speech and a speaker feature of a target speaker;extracting, by directing the extraction towards a target speech feature, a speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker, to obtain the extracted speech feature of the target speaker; where the target speech feature is a speech feature for obtaining a speech recognition result consistent with real speech content of the target speaker; andobtaining the speech recognition result of the target speaker based on the extracted speech feature of the target speaker. In an embodiment, the obtaining the speaker feature of the target speaker includes: obtaining a registered speech of the target speaker; andextracting a short-term voiceprint feature and a long-term voiceprint feature from the registered speech of the target speaker to obtain a multi-scale voiceprint feature as the speaker feature of the target speaker. In an embodiment, the process of extracting, by directing the extraction towards the target speech feature, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker includes: extracting, by using a pre-established feature extraction model, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker;where, the feature extraction model is trained, by using a speech feature of a training mixed speech and a speaker feature of a designated speaker, with a speech recognition result obtained based on an extracted speech feature of the designated speaker as an optimization objective, where the speech feature of the training mixed speech including a speech of the designated speaker, and the extracted speech feature of the designated speaker is a speech feature of the designated speaker extracted from the speech feature of the training mixed speech. In an embodiment, the feature extraction model is trained using both the extracted speech feature of the designated speaker and the speech recognition result obtained based on the extracted speech feature of the designated speaker as the optimization objective. In an embodiment, the process of the extracting, by using the pre-established feature extraction model, the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the speaker feature of the target speaker includes: inputting the speech feature of the target mixed speech and the speaker feature of the target speaker into the feature extraction model to obtain a feature mask corresponding to the target speaker; andextracting the speech feature of the target speaker from the speech feature of the target mixed speech, based on the speech feature of the target mixed speech and the feature mask corresponding to the target speaker. In an embodiment, the process of