CN-116597826-B - Speech recognition method, device, terminal equipment and computer readable storage medium

CN116597826BCN 116597826 BCN116597826 BCN 116597826BCN-116597826-B

Abstract

The application is applicable to the technical field of terminals, and particularly relates to a voice recognition method, a voice recognition device, terminal equipment and a computer readable storage medium. In the method, after the terminal equipment acquires the voice to be recognized, the voice to be recognized can be recognized by using the end-to-end voice recognition model, so as to obtain a recognition result. For each decoding of the speech recognition model, the terminal device may determine a first probability corresponding to the candidate decoding result according to the N-gram language model, so that the speech recognition model may determine a candidate decoding result obtained by next decoding according to the first probability corresponding to the candidate decoding result. In the application, when the voice recognition is carried out through the end-to-end voice recognition model, the first probability corresponding to the candidate decoding result can be determined according to the N-gram language model, so that the decoding result accords with a grammar structure, decoding errors caused by inaccurate pronunciation such as accent are reduced, the accuracy of the voice recognition is improved, and the user experience is improved.

Inventors

YANG XIANJIE
HUANG DONGYAN

Assignees

深圳市优必选科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20230530

Claims (6)

1. A voice recognition method applied to a terminal device, the method comprising: the terminal equipment acquires voice to be recognized; The terminal equipment performs voice recognition on the voice to be recognized by utilizing a voice recognition model to obtain a recognition result; The terminal equipment determines a first probability corresponding to a candidate decoding result according to an N-gram language model when the voice recognition model is used for carrying out voice recognition on the voice to be recognized, wherein the voice recognition model is an end-to-end voice recognition model, and the terminal equipment determines the first probability corresponding to the candidate decoding result according to the N-gram language model, wherein the first probability corresponding to the candidate decoding result is used for determining the candidate decoding result obtained by next decoding of the voice recognition model: The terminal equipment acquires a second probability corresponding to the candidate decoding result, wherein the second probability corresponding to the candidate decoding result is determined by the voice recognition model; the terminal equipment determines a first weight corresponding to the candidate decoding result according to the N-gram language model; the terminal equipment determines the length of the candidate decoding result and determines a scaling value corresponding to the first weight according to the length of the candidate decoding result; And the terminal equipment determines the first probability corresponding to the candidate decoding result according to the first weight, the scaling value corresponding to the first weight and the second probability corresponding to the candidate decoding result.
2. The method of claim 1, wherein the determining, by the terminal device, the first weight corresponding to the candidate decoding result according to the N-gram language model, includes: When the candidate decoding result exists in the N-gram language model, the terminal equipment determines a second weight corresponding to the candidate decoding result in the N-gram language model, and determines a first weight corresponding to the candidate decoding result according to the second weight; and when the candidate decoding result does not exist in the N-gram language model, the terminal equipment determines that the first weight corresponding to the candidate decoding result is a preset value.
3. The method according to any one of claims 1 to 2, further comprising: The terminal equipment acquires a target text, and the target text and text content corresponding to the voice to be recognized belong to the same field; And the terminal equipment trains the N-gram language model through the target text to obtain the trained N-gram language model.
4. A speech recognition apparatus for use in a terminal device, the apparatus comprising: the voice acquisition module is used for acquiring voice to be recognized; The voice recognition module is used for carrying out voice recognition on the voice to be recognized by utilizing a voice recognition model to obtain a recognition result; The voice recognition module is used for determining a first probability corresponding to a candidate decoding result according to an N-gram language model when the voice recognition module is used for carrying out voice recognition on the voice to be recognized, wherein the first probability corresponding to the candidate decoding result is used for determining a candidate decoding result obtained by next decoding of the voice recognition module; The voice recognition module is further configured to obtain a second probability corresponding to the candidate decoding result, wherein the second probability corresponding to the candidate decoding result is determined by the voice recognition model, determine a first weight corresponding to the candidate decoding result according to an N-gram language model, determine a length of the candidate decoding result, determine a scaling value corresponding to the first weight according to the length of the candidate decoding result, and determine a first probability corresponding to the candidate decoding result according to the first weight, the scaling value corresponding to the first weight, and the second probability corresponding to the candidate decoding result.
5. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any one of claims 1 to 3 when executing the computer program.
6. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech recognition method according to any one of claims 1 to 3.

Description

Speech recognition method, device, terminal equipment and computer readable storage medium Technical Field The present application belongs to the technical field of terminals, and in particular, relates to a voice recognition method, a device, a terminal device, and a computer readable storage medium. Background End-to-end based speech recognition techniques use a neural network model to directly accomplish the conversion from speech to text, e.g., attention (Attention) based neural network models. Among them, the Attention-based neural network model generally includes an encoder and a decoder based on an Attention mechanism. The encoder converts the sequence of speech features into a sequence of hidden state vectors. The decoder focuses Attention on a part of hidden state vector sequences in the decoder through an autoregressive mode and an Attention mechanism, and the decoding result is output in a single step. In the decoding process, the first few words/sentences with the highest probability are generally obtained by using a bundle search (BeamSearch) as candidate decoding results. That is, the neural network model is searched in the decoding space based on probability, has uncertainty, and can give decoding results which are not accordant with grammar aiming at the situation of inaccurate pronunciation such as accent and the like, so that the accuracy of voice recognition is lower. Disclosure of Invention The embodiment of the application provides a voice recognition method, a voice recognition device, terminal equipment and a computer readable storage medium, which can solve the problem of low voice recognition accuracy. In a first aspect, an embodiment of the present application provides a voice recognition method, applied to a terminal device, where the method may include: the terminal equipment acquires voice to be recognized; The terminal equipment performs voice recognition on the voice to be recognized by utilizing a voice recognition model to obtain a recognition result; When the voice recognition model is utilized to perform voice recognition on the voice to be recognized, for each decoding of the voice recognition model, the terminal equipment determines a first probability corresponding to a candidate decoding result according to an N-gram language model, wherein the first probability corresponding to the candidate decoding result is used for determining a candidate decoding result obtained by next decoding of the voice recognition model. In the above voice recognition method, after the terminal device obtains the voice to be recognized, the terminal device may perform voice recognition on the voice to be recognized by using the end-to-end voice recognition model, so as to obtain a recognition result. When the voice recognition model is used for carrying out voice recognition on voice to be recognized, for each decoding of the voice recognition model, the terminal equipment can determine the first probability corresponding to the candidate decoding result according to the N-gram language model, so that the voice recognition model can determine the candidate decoding result obtained by next decoding according to the first probability corresponding to the candidate decoding result. In the embodiment of the application, when the voice recognition is carried out through the end-to-end voice recognition model, the first probability corresponding to the candidate decoding result can be determined according to the N-gram language model, so that the decoding result accords with a grammar structure, decoding errors caused by inaccurate pronunciation such as accent are reduced, the accuracy of the voice recognition is improved, and the user experience is improved. In a possible implementation manner, the determining, by the terminal device, the first probability corresponding to the candidate decoding result according to the N-gram language model may include: The terminal equipment acquires a second probability corresponding to the candidate decoding result, wherein the second probability corresponding to the candidate decoding result is determined by the voice recognition model; And the terminal equipment determines the first probability corresponding to the candidate decoding result according to the N-gram language model and the second probability corresponding to the candidate decoding result. Illustratively, the determining, by the terminal device, the first probability corresponding to the candidate decoding result according to the N-gram language model and the second probability corresponding to the candidate decoding result may include: the terminal equipment determines a first weight corresponding to the candidate decoding result according to the N-gram language model; And the terminal equipment determines the first probability corresponding to the candidate decoding result according to the second probability and the first weight corresponding to the candidate decoding result. Optionally, the