CN-115708155-B - Speech decoder generation, speech decoding method, apparatus, device and readable medium

CN115708155BCN 115708155 BCN115708155 BCN 115708155BCN-115708155-B

Abstract

Embodiments of the present disclosure disclose speech decoder generation, speech decoding methods, apparatuses, devices, and readable media. An embodiment of the voice decoding method comprises the steps of obtaining voice probability vectors, determining a time interval between a time point for obtaining the voice probability vectors and a timing starting point as voice input duration, determining whether a voice decoder to be reset exists in a voice decoder sequence according to the voice input duration, resetting the voice decoder to be reset in response to determining that the voice decoder to be reset exists in the voice decoder sequence, inputting the voice probability vectors into each voice decoder in the voice decoder sequence to execute decoding operation, and determining a target voice decoding sequence according to a voice decoding sequence and a voice decoding probability value included in at least one voice decoding message. According to the embodiment, pruning is not needed, and the accuracy of a voice decoding result is improved through full probability traversal of a voice decoder on all offline instructions.

Inventors

DONG XICHENG
Dang Yubo
FU JIQIANG

Assignees

杭州灵伴科技有限公司

Dates

Publication Date: 20260512
Application Date: 20210820

Claims (13)

1. A speech decoder generation method, comprising: acquiring a preset offline instruction set; Generating an offline instruction graph according to the offline instruction set, wherein the offline instruction graph comprises the following steps of: Inserting preset characters into each offline instruction in the offline instruction set to obtain a preprocessed offline instruction set; Sequentially taking out characters at the same position in each preprocessing offline instruction in the preprocessing offline instruction set according to the sequence from left to right, performing de-duplication operation on the taken out characters to obtain at least one character which is different from each other, and taking the at least one character which is different from each other as a node in an offline instruction graph to obtain a node set; taking a connecting line between two adjacent preset characters and instruction words in the node set representing the preprocessing offline instruction and a connecting line between two adjacent instruction words in the node set representing the offline instruction as an edge in an offline instruction graph, and taking a graph formed by the node set and at least one connecting line as an offline instruction graph; And generating a voice decoder according to the offline instruction graph, wherein the voice decoder is configured to update the probability value of each node in the offline instruction graph according to the input voice probability vector, and output one offline instruction matched with the input voice probability vector in the offline instruction set so as to determine a target voice decoding sequence.
2. The method of claim 1, wherein the generating a speech decoder from the offline instruction map comprises: mapping the offline instruction map into a one-dimensional array to obtain an offline instruction array; the offline instruction array is determined as the speech decoder.
3. A method of speech decoding, comprising: Acquiring a voice probability vector, and determining a time interval between a time point for acquiring the voice probability vector and a timing starting point as a voice input duration, wherein the timing starting point is a time point for starting voice decoding; Determining whether a speech decoder to be reset exists in a sequence of speech decoders according to the speech input duration, wherein the speech decoders in the sequence of speech decoders are generated according to the method of one of claims 1-2; In response to determining that there are speech decoders to be reset in the sequence of speech decoders, resetting the speech decoders to be reset, and inputting the speech probability vector into each of the speech decoders in the sequence of speech decoders to perform a decoding operation, wherein the decoding operation comprises the steps of: Determining a probability value of each off-line instruction in the off-line instruction set by using the voice decoder and each voice probability vector input into the voice decoder; selecting an offline instruction with the maximum probability value from the offline instruction set as an offline instruction to be output; In response to determining that the probability value of the offline instruction to be output is greater than a preset threshold, determining the offline instruction to be output and the probability value corresponding to the offline instruction to be output as a voice decoding sequence and a voice decoding probability value respectively to obtain voice decoding information; And in response to generating at least one voice decoding information according to the decoding operation corresponding to each voice decoder in the voice decoder sequence, determining a target voice decoding sequence according to the voice decoding sequence and the voice decoding probability value included in the at least one voice decoding information.
4. The method of claim 3, wherein the determining whether there is a speech decoder in a sequence of speech decoders to be reset based on the speech input duration comprises: Determining that a speech decoder to be reset exists in the speech decoder sequence in response to determining that the ratio of the speech input duration to a preset time interval is an integer; Taking the remainder of the ratio of the voice input duration to the preset voice recognition duration to obtain a duration remainder, wherein the preset voice recognition duration is multiple of the target number of the preset time interval; Determining the ratio of the duration remainder to the preset time interval as a target decoder sequence number; And determining an ith voice decoder in the voice decoder sequence as a voice decoder to be reset, wherein i is the same as the target decoder sequence number, and the number of the voice decoders in the voice decoder sequence is the same as the target number.
5. The method of claim 4, wherein the resetting the speech decoder to be reset comprises: and clearing each voice probability vector previously input into the voice decoder to be reset.
6. A method according to claim 3, wherein the method further comprises: in response to determining that there are no speech decoders in the sequence of speech decoders to be reset, the speech probability vector is input into each speech decoder in the sequence of speech decoders.
7. The method of claim 3, wherein the determining the target speech decoding sequence from the speech decoding sequence and the speech decoding probability value included in the at least one speech decoding information comprises: Selecting, as target speech decoding information, speech decoding information including a speech decoding probability value that is a maximum probability value among speech decoding probability values included in each of the at least one speech decoding information from among the at least one speech decoding information; and determining a voice decoding sequence included in the target voice decoding information as a target voice decoding sequence.
8. A method according to claim 3, wherein the speech probability vector is generated by: receiving real-time voice stream data input by a user; carrying out framing treatment on the real-time voice stream data to obtain a voice data sequence; And inputting the continuous target number of voice data in the voice data sequence to a preset voice probability model to obtain a voice probability vector output by the voice probability model.
9. A method according to claim 3, wherein the method further comprises: And executing the control operation corresponding to the target voice decoding sequence.
10. A speech decoder generating apparatus comprising: An acquisition unit configured to acquire a preset offline instruction set; a first generating unit configured to generate an offline instruction map according to the offline instruction set, including the steps of: Inserting preset characters into each offline instruction in the offline instruction set to obtain a preprocessed offline instruction set; Sequentially taking out characters at the same position in each preprocessing offline instruction in the preprocessing offline instruction set according to the sequence from left to right, performing de-duplication operation on the taken out characters to obtain at least one character which is different from each other, and taking the at least one character which is different from each other as a node in an offline instruction graph to obtain a node set; taking a connecting line between two adjacent preset characters and instruction words in the node set representing the preprocessing offline instruction and a connecting line between two adjacent instruction words in the node set representing the offline instruction as an edge in an offline instruction graph, and taking a graph formed by the node set and at least one connecting line as an offline instruction graph; And the second generation unit is configured to generate a voice decoder according to the offline instruction graph, wherein the voice decoder is configured to update the probability value of each node in the offline instruction graph according to the input voice probability vector, and output one offline instruction matched with the input voice probability vector in the offline instruction set so as to determine a target voice decoding sequence.
11. A speech decoding apparatus comprising: An acquisition and determination unit configured to acquire a speech probability vector, and determine a time interval between a time point at which the speech probability vector is acquired and a timing start point, which is a time point at which speech decoding is started, as a speech input duration; A first determining unit configured to determine whether there is a speech decoder to be reset in a sequence of speech decoders according to the speech input duration, wherein the speech decoders in the sequence of speech decoders are generated according to the method of one of claims 1-2; A reset and input unit configured to reset the speech decoder to be reset and input the speech probability vector into each of the speech decoders in the sequence of speech decoders in response to determining that there is a speech decoder to be reset in the sequence of speech decoders, comprising: Determining a probability value of each off-line instruction in the off-line instruction set by using the voice decoder and each voice probability vector input into the voice decoder; selecting an offline instruction with the maximum probability value from the offline instruction set as an offline instruction to be output; In response to determining that the probability value of the offline instruction to be output is greater than a preset threshold, determining the offline instruction to be output and the probability value corresponding to the offline instruction to be output as a voice decoding sequence and a voice decoding probability value respectively to obtain voice decoding information; And a second determining unit configured to determine a target speech decoding sequence from a speech decoding sequence and a speech decoding probability value included in at least one speech decoding information in response to generating the at least one speech decoding information according to a decoding operation corresponding to each of the speech decoders in the sequence of speech decoders.
12. An electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-9.
13. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-9.

Description

Speech decoder generation, speech decoding method, apparatus, device and readable medium Technical Field Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, a device, and a readable medium for generating a speech decoder and decoding a speech. Background Voice decoding is a technology for decoding voice information input by a user in real time to obtain corresponding offline instructions. Currently, in decoding a voice, a method of decoding voice information input in real time by a user using a CTC (Connectionist Temporal Classification, connection timing class) decoder is generally adopted. However, when the above-mentioned method is adopted to perform voice decoding, there is often a technical problem that BeamSearch (bundle search) algorithm is introduced into the CTC decoder, and the BeamSearch algorithm has pruning in the voice decoding process, so that the situation of pruning error is very likely to occur, thereby causing inaccurate voice decoding result. Disclosure of Invention The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Some embodiments of the present disclosure propose a speech decoder generation, a speech decoding method, apparatus, device and readable medium to solve the technical problems mentioned in the background section above. In a first aspect, some embodiments of the present disclosure provide a method for generating a speech decoder, where the method includes obtaining a preset offline instruction set, generating an offline instruction graph according to the offline instruction set, and generating a speech decoder according to the offline instruction graph, where the speech decoder is configured to update a probability value of each node in the offline instruction graph according to an input speech probability vector, and output an offline instruction in the offline instruction set that matches the input speech probability vector to determine a target speech decoding sequence. In a second aspect, some embodiments of the present disclosure provide a speech decoding method, the method including obtaining a speech probability vector, determining a time interval between a time point at which the speech probability vector is obtained and a time start point as a speech input time period, wherein the time start point is a time point at which speech decoding starts, determining whether a speech decoder to be reset exists in a speech decoder sequence according to the speech input time period, wherein the speech decoder in the speech decoder sequence is generated according to the method in the first aspect, resetting the speech decoder to be reset in response to determining that the speech decoder to be reset exists in the speech decoder sequence, and inputting the speech probability vector into each speech decoder in the speech decoder sequence to perform a decoding operation, generating at least one speech decoding information in response to a decoding operation corresponding to each speech decoder in the speech decoder sequence, and determining a target speech decoding sequence according to the speech decoding sequence and a speech decoding probability value included in the at least one speech decoding information. In a third aspect, some embodiments of the present disclosure provide a voice decoder generating apparatus, where the apparatus includes an acquiring unit configured to acquire a preset offline instruction set, a first generating unit configured to generate an offline instruction graph according to the offline instruction set, and a second generating unit configured to generate a voice decoder according to the offline instruction graph, where the voice decoder is configured to update a probability value of each node in the offline instruction graph according to an input voice probability vector, and output an offline instruction in the offline instruction set that matches the input voice probability vector to determine a target voice decoding sequence. In a fourth aspect, some embodiments of the present disclosure provide a speech decoding apparatus, which includes an acquisition and determination unit configured to acquire a speech probability vector and determine a time interval between a time point at which the speech probability vector is acquired and a time start point as a speech input duration, wherein the time start point is a time point at which speech decoding starts, a first determination unit configured to determine whether a speech decoder to be reset exists in a speech decoder sequence according to the speech input duration, a reset and input unit configured to reset the speech decoder to be reset in response to det