CN-116434771-B - Speech recognition method, device, system, electronic equipment and readable storage medium

CN116434771BCN 116434771 BCN116434771 BCN 116434771BCN-116434771-B

Abstract

The invention provides a voice recognition method, a device, a system, an electronic device and a readable storage medium, which sequentially acquire a plurality of audio fragments of an audio stream at preset time intervals, and identifying each audio fragment to obtain a plurality of identification results, splicing to obtain a plurality of candidate identification result sequences corresponding to the audio stream, and selecting a target identification result sequence from the plurality of candidate identification result sequences. When the target recognition result sequences are corrected, decoding is carried out on the candidate recognition result sequences and the audio streams respectively according to a first sequence and a second sequence, wherein the first sequence is decoding from left to right, so that each audio fragment can be combined with information before the audio fragment for decoding calculation, and the second sequence is decoding from right to left, so that each audio fragment can be combined with information after the audio fragment for decoding calculation, the complete context information of the audio streams can be utilized for correcting the results, and the accuracy of the voice recognition model is improved.

Inventors

Ying Yile

Assignees

北京奕斯伟计算技术股份有限公司

Dates

Publication Date: 20260508
Application Date: 20230414

Claims (12)

1. A method of speech recognition, the method comprising: sequentially acquiring a plurality of audio clips of an audio stream at preset time intervals, and generating audio features corresponding to each audio clip; According to the multiple recognition results, a plurality of candidate recognition result sequences corresponding to the audio stream are obtained in a splicing way, and a target recognition result sequence is selected from the multiple candidate recognition result sequences; decoding a plurality of candidate recognition result sequences and the audio stream according to a first sequence to obtain a first recognition result, wherein the first sequence is the acquisition time sequence of the audio fragment; decoding the candidate recognition result sequences and the audio stream according to a second sequence to obtain a second recognition result, wherein the second sequence is opposite to the first sequence; Correcting the target recognition result sequence according to the first recognition result and the second recognition result, and displaying the corrected target recognition result sequence.
2. The method of claim 1, wherein the obtaining a plurality of recognition results corresponding to each of the plurality of audio features comprises: Encoding a plurality of the audio features to generate a plurality of speech encoding features; And decoding a plurality of the voice coding features to generate a plurality of recognition results corresponding to each audio fragment.
3. The method according to claim 2, wherein the splicing to obtain a plurality of candidate recognition result sequences corresponding to the audio stream according to a plurality of recognition results, and selecting a target recognition result sequence from a plurality of candidate recognition result sequences, includes: Selecting any one of a plurality of recognition results corresponding to each audio fragment as an intermediate recognition result, and splicing the intermediate recognition results corresponding to each audio fragment to serve as a candidate recognition result sequence; and selecting the candidate recognition result sequence with the largest matching value with the audio stream from all the candidate recognition result sequences as a target recognition result sequence.
4. A method according to claim 3, wherein selecting, as the target recognition result sequence, a candidate recognition result sequence having the largest matching value with the audio stream from among all the candidate recognition result sequences, comprises: obtaining matching values of the plurality of recognition results and corresponding audio clips respectively; selecting the recognition result with the highest matching value as the real-time recognition result of the audio fragment; and splicing the real-time identification results corresponding to each audio fragment to obtain a candidate identification result sequence serving as a target identification result sequence.
5. The method of claim 1, wherein decoding the plurality of candidate recognition result sequences and the audio stream in a first order to obtain a first recognition result comprises: traversing each recognition result in each candidate recognition result sequence according to a first sequence, and performing decoding calculation according to the traversed recognition result, other recognition results before the traversed recognition result and the encoding result of the audio stream to obtain a first recognition result corresponding to each candidate recognition result sequence; Decoding the candidate recognition result sequences and the audio stream according to a second order to obtain a second recognition result, wherein the method comprises the following steps: Traversing each recognition result in each candidate recognition result sequence according to a second sequence, and performing decoding calculation according to the traversed recognition result, other recognition results after the traversed recognition result and the encoding result of the audio stream to obtain a second recognition result corresponding to each candidate recognition result sequence.
6. The method of claim 1, wherein the correcting the target recognition result sequence based on the first recognition result and the second recognition result comprises: Determining a target weight value of each candidate recognition result sequence according to each candidate recognition result sequence, and the first recognition result and the second recognition result of each candidate recognition result sequence; and selecting the candidate recognition result sequence with the maximum target weight value as the corrected target recognition result sequence.
7. The method of claim 6, wherein determining the target weight value for each candidate recognition result sequence based on each candidate recognition result sequence and the first recognition result and the second recognition result for each candidate recognition result sequence comprises: for each candidate recognition result sequence, acquiring weight information corresponding to each candidate recognition result sequence, each first recognition result and each second recognition result and a matching value between each candidate recognition result sequence and the audio stream based on an attention mechanism; and according to the weight information, carrying out weighted summation on the matching values corresponding to the candidate streaming identification result sequences, the first identification result and the second identification result respectively, and determining a target weight value for each candidate identification result sequence.
8. The method according to claim 4, wherein the method further comprises: Displaying each real-time identification result, and displaying a target identification result sequence corresponding to the audio stream after the last audio fragment of the audio stream is identified; And after the target recognition result sequence is corrected, displaying the corrected target recognition result sequence.
9. A speech recognition device, the device comprising: The acquisition module is used for sequentially acquiring a plurality of audio fragments of the audio stream at preset time intervals and generating audio characteristics corresponding to each audio fragment; The first recognition module is used for acquiring a plurality of recognition results corresponding to each audio feature in a plurality of audio features according to the acquisition time sequence of the audio fragment, splicing and acquiring a plurality of candidate recognition result sequences corresponding to the audio stream according to the plurality of recognition results, and selecting a target recognition result sequence from the plurality of candidate recognition result sequences; The second recognition module is used for decoding the candidate recognition result sequences and the audio stream according to a first sequence to obtain a first recognition result, wherein the first sequence is the acquisition time sequence of the audio fragment; The third recognition module is used for decoding the candidate recognition result sequences and the audio stream according to a second sequence to obtain a second recognition result, wherein the second sequence is opposite to the first sequence; The correction module is used for correcting the target recognition result sequence according to the first recognition result and the second recognition result and displaying the corrected target recognition result sequence.
10. The voice recognition system is characterized by comprising a collecting device, a voice recognition model and a display device, wherein the voice recognition model comprises a first decoder, a second decoder and a third decoder; The acquisition device is used for sequentially acquiring a plurality of audio fragments of an audio stream at preset time intervals, the voice recognition model generates audio features corresponding to each audio fragment according to the audio fragments, the third decoder acquires a plurality of recognition results corresponding to each audio feature in the plurality of audio features according to the acquisition time sequence of the audio fragments, the voice recognition model acquires a plurality of candidate recognition result sequences corresponding to the audio stream in a splicing mode according to the plurality of recognition results, and selects a target recognition result sequence from the plurality of candidate recognition result sequences, the first decoder decodes the plurality of candidate recognition result sequences and the audio stream according to a first sequence to obtain a first recognition result, the second decoder decodes the plurality of candidate recognition result sequences and the audio stream according to a second sequence to obtain a second recognition result, and the voice recognition model corrects the target recognition result sequences according to the first recognition result and the second recognition result and then displays the corrected target recognition result sequences.
11. An electronic device comprising a processor and a memory, the memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of any one of claims 1 to 8.
12. A readable storage medium, characterized in that it has stored thereon a program or instructions which, when executed by a processor, implement the steps of the method according to any of claims 1 to 8.

Description

Speech recognition method, device, system, electronic equipment and readable storage medium Technical Field The embodiment of the invention relates to the field of voice recognition, in particular to a voice recognition method, a voice recognition device, a voice recognition system, electronic equipment and a readable storage medium. Background Streaming speech recognition (STREAMING ASR) or Online speech recognition (Online ASR) is a recognition method that gives text results of speech recognition in real time as the data of the input speech increases. The current mainstream deep learning streaming speech recognition model is mostly implemented by combining the structure of a self-attention mechanism-based encoder (encoder) and decoder (decoder) with a streaming prediction module, such as a joint timing classification (Connectionist temporal classification, CTC) module. In the real-time identification process, the streaming identification result of each audio fragment is output in real time through the CTC, after a sentence is completely identified, the output of the encoder of the whole sentence and a plurality of candidate streaming results of the CTC are input into the decoder, and the corrected final non-streaming identification result is obtained. In the above method, when correcting the result of the streaming recognition, for example, a sentence, only the audio information before the audio clip can be used for each audio clip, so that the result of the final non-streaming recognition after correction is only the result with reference to the above information, which limits the accuracy of the speech recognition. Disclosure of Invention In view of the foregoing, embodiments of the present invention are directed to providing a method, apparatus, system, electronic device, and readable storage medium for speech recognition that overcomes or at least partially solves the foregoing problems. In a first aspect, an embodiment of the present application discloses a speech recognition method, where the method includes: sequentially acquiring a plurality of audio clips of an audio stream at preset time intervals, and generating audio features corresponding to each audio clip; According to the multiple recognition results, a plurality of candidate recognition result sequences corresponding to the audio stream are obtained in a splicing way, and a target recognition result sequence is selected from the multiple candidate recognition result sequences; decoding a plurality of candidate recognition result sequences and the audio stream according to a first sequence to obtain a first recognition result, wherein the first sequence is the acquisition time sequence of the audio fragment; decoding the candidate recognition result sequences and the audio stream according to a second sequence to obtain a second recognition result, wherein the second sequence is opposite to the first sequence; Correcting the target recognition result sequence according to the first recognition result and the second recognition result, and displaying the corrected target recognition result sequence. In a second aspect, an embodiment of the present application discloses a speech recognition apparatus, the apparatus including: The acquisition module is used for sequentially acquiring a plurality of audio fragments of the audio stream at preset time intervals and generating audio characteristics corresponding to each audio fragment; The first recognition module is used for acquiring a plurality of recognition results corresponding to each audio feature in a plurality of audio features according to the acquisition time sequence of the audio fragment, splicing and acquiring a plurality of candidate recognition result sequences corresponding to the audio stream according to the plurality of recognition results, and selecting a target recognition result sequence from the plurality of candidate recognition result sequences; The second recognition module is used for decoding the candidate recognition result sequences and the audio stream according to a first sequence to obtain a first recognition result, wherein the first sequence is the acquisition time sequence of the audio fragment; The third recognition module is used for decoding the candidate recognition result sequences and the audio stream according to a second sequence to obtain a second recognition result, wherein the second sequence is opposite to the first sequence; The correction module is used for correcting the target recognition result sequence according to the first recognition result and the second recognition result and displaying the corrected target recognition result sequence. In a third aspect, the embodiment of the application also discloses a voice recognition system, which comprises an acquisition device, a voice recognition model and a display device, wherein the voice recognition model comprises a first decoder, a second decoder and a third decoder; The acquisition device