CN-121999782-A - Speech recognition method, device and medium based on artificial intelligence

CN121999782ACN 121999782 ACN121999782 ACN 121999782ACN-121999782-A

Abstract

The invention discloses a voice recognition method, equipment and medium based on artificial intelligence, which relate to the technical field of voice signal processing and comprise the steps of collecting voice data, framing after cutting, constructing an encoder to process and output latent variables of framed voice cut blocks, constructing an RVQ discretization quantizer to carry out residual quantization on the latent variables to obtain quantized vectors, defining the promise loss of the RVQ discretization quantizer, outputting through latent representations of the quantized vectors, constructing index embedding to splice, constructing a content encoder to process spliced features to output voice content features, adopting a causality Conformer to establish a recognizer R to recognize voice feature content to output text content, training parameters in stages, and carrying out model deployment after training is completed. The invention realizes low-delay identification without waveform reconstruction, reduces deployment cost, and is suitable for long-term stable operation of the resource-limited terminal.

Inventors

WANG PING
CHEN YU
GUO RUI
LIN FEI

Assignees

阳光学院

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. A speech recognition method based on artificial intelligence is characterized by comprising the following steps of, Collecting voice data, cutting blocks, framing, constructing an encoder, and processing the framed voice cut blocks to output latent variables; Initializing a codebook, constructing an RVQ discretization quantizer, carrying out residual quantization on the latent variables to obtain quantized vectors, and defining the promised loss of the RVQ discretization quantizer; outputting the through latent representation to the quantized vector, constructing index embedding for splicing, and constructing a content encoder to process splicing characteristics to output voice content characteristics; Establishing a recognizer R by adopting a causal Conformer to recognize the voice characteristic content and output text content; Constructing an entropy model to output a codebook probability estimation total code length and code rate; And training the encoder, the RVQ discretization quantizer, the content encoder, the identifier and the entropy model parameters in stages, and performing model deployment after training is completed.
2. The artificial intelligence-based speech recognition method according to claim 1, wherein the step of performing framing after the collected speech data is diced and the step of constructing an encoder to process the framed speech diced to output latent variable refers to collecting single-channel speech, and the method is characterized in that the sampling rate is uniform Unified mapping of speech audio amplitude to A floating point sequence, defining a fixed window length, and cutting the voice audio according to the fixed window; Calculating the total downsampling multiplying power of the encoder, defining the streaming frame length and framing each voice block: Constructing an encoder E, which comprises an inlet convolution layer, a downsampling convolution block, a double-layer LSTM layer and an end convolution layer; the framed voice is diced and input into an encoder E for processing and outputting latent variables 。
3. The method for recognizing speech based on artificial intelligence of claim 2, wherein the initializing codebook, constructing RVQ discretization quantizer, carrying out residual quantization on latent variables to obtain quantization vectors, defining commitment loss of RVQ discretization quantizer, constructing RVQ discretization quantizer Q, selecting a target bandwidth for each training batch, determining codebook number set according to the target bandwidth, using K-Means clustering algorithm on the latent variables, setting the number of cluster centers as codebook capacity, obtaining a corresponding number of cluster centers as codewords of initial codebook, and obtaining initial codewords for each codebook; performing multi-level residual quantization on the latent variable, initializing residual, and performing nearest neighbor codeword search on the c-th level; updating the residual error and accumulating the code words to obtain quantized vectors; defining the committed losses of RVQ discretized quantizer Q 。
4. The artificial intelligence-based speech recognition method of claim 3, wherein the outputting the through latent representation for the quantized vector and constructing an index insert for stitching, and the constructing the content encoder for processing the stitched feature to output the speech content feature comprises: outputting a pass-through latent representation of the quantized vector using a pass-through estimator; establishing embedding tables for each codebook to obtain index embedding and summing; embedding and splicing the straight-through latent representation and the index, and linearly projecting the straight-through latent representation and the index as channel input; building content encoder For a pair of Performing frame-by-frame content extraction, including an input gating, a causal self-attention layer, a convolution layer and an output layer; Synchronous construction speaker encoder Extracting speaker embedding from the through latent representation, wherein the speaker embedding comprises a front end layer, a pooling layer and a mapping normalization layer; obtaining speech content features After the speaker is embedded, the voice content features are subjected to global pooling by adopting gradient inversion for antagonism; Speaker classification is predicted using a speaker classifier.
5. The artificial intelligence-based speech recognition method according to claim 4, wherein the creating a recognizer R for recognizing speech feature content using causal Conformer to output text content means creating a recognizer R for recognizing speech feature content features, the recognizer R being stacked using causal Conformer, including an input gating, causal self-attention layer, convolution layer, and joining feed-forward residual layer output after the convolution layer; inputting the feedforward output into the CTC classification head again to obtain text probability distribution; Acquiring text content from text probability distribution ; CTC classification loss is determined from the text probability distribution.
6. The artificial intelligence-based speech recognition method according to claim 5, wherein the constructing the entropy model output codebook probability estimation total code length and code rate means constructing an input vector for each time step t using a discrete index of the last time; Mapping an entropy model M formed by 5 layers of convectors input into an input vector forming sequence into a hidden state, and outputting softmax probability for each codebook; Estimating the total code length by adopting negative logarithmic probability and converting the negative logarithmic probability into code rate; Constructing code rate exceeding punishment according to the code rate; The cross entropy loss is calculated from the softmax probability for each codebook.
7. The artificial intelligence-based speech recognition method of claim 6, wherein training the staged encoder, RVQ discretizing quantizer, content encoder, recognizer, and entropy model parameters refers to a staged three-staged encoder E, RVQ discretizing quantizer Q, content encoder Training the recognizer R and the entropy model M; the first stage is to freeze entropy model M, train up-date encoder E, RVQ discretized quantizer Q, content encoder An identifier R; second stage, freeze encoder E, RVQ discretized quantizer Q and content encoder And an identifier R for training and updating the entropy model M; Third stage frozen content encoder The identifier R and the entropy model M, the update encoders E and RVQ discretize the quantizer Q.
8. The artificial intelligence-based speech recognition method according to claim 7, wherein the model deployment after the training is completed refers to discretizing the quantizer Q and the content encoder Q of the encoder E, RVQ after the staged training is completed The recognizer R and the entropy model M are deployed to the user side, a user is guided to read the fixed calibration text to perform calibration test, and after the test passes, user voice is collected to perform text content recognition output.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, and wherein the processor implements the steps of the artificial intelligence based speech recognition method of any one of claims 1 to 8 when the processor executes the computer program.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the artificial intelligence based speech recognition method according to any one of claims 1 to 8.

Description

Speech recognition method, device and medium based on artificial intelligence Technical Field The invention relates to the technical field of voice signal processing, in particular to a voice recognition method, device and medium based on artificial intelligence. Background In recent years, a voice interaction system is rapidly popularized in scenes such as mobile terminals, vehicle-mounted and wearable equipment, a voice signal processing technology is promoted to evolve from traditional coding and decoding and rear-end recognition to end-to-end neural network characterization learning and end-to-end real-time reasoning, on one hand, deep learning voice recognition is developed from DNN-HMM to CTC, RNN-T and an end-to-end framework based on attention/transducer, robustness of complex noise and accent scenes is remarkably improved through learning acoustic-to-text sequence mapping, on the other hand, a neural audio compression technology takes a self-encoder as a core, maps waveforms into low-dimensional latent variables, and carries out continuous latent space discretization through vector quantization to form a code stream representation which can be transmitted, can be stored and is favorable for end-to-end real-time processing, and in the process, joint optimization of discrete code stream-recognizable representation-text output is gradually becoming an important direction. In the development process of the prior art, many defects exist, many technical contents are designed by compression and recognition splitting, a compression model usually takes reconstruction quality or perception indexes as a dominant mode, so that discrete code streams can reconstruct audio, but text discrimination is not optimal, CTC alignment instability, frame-level probability distribution jitter and easy occurrence of repetition/word leakage are presented, even if single-point indexes (recognition or compression) reach standards, and system-level engineering effects are still difficult to stably reproduce in real terminal scenes. Disclosure of Invention The present invention has been made in view of the above-described problems occurring in the prior art. Therefore, the invention provides a voice recognition method, equipment and medium based on artificial intelligence, which solve the problems that text discrimination is not optimal, CTC alignment is unstable, frame-level probability distribution is dithered and repeated/missing words are easy to occur in the prior art. In order to solve the technical problems, the invention provides the following technical scheme: in a first aspect, the present invention provides an artificial intelligence based speech recognition method, comprising, Collecting voice data, cutting blocks, framing, constructing an encoder, and processing the framed voice cut blocks to output latent variables; Initializing a codebook, constructing an RVQ discretization quantizer, carrying out residual quantization on the latent variables to obtain quantized vectors, and defining the promised loss of the RVQ discretization quantizer; outputting the through latent representation to the quantized vector, constructing index embedding for splicing, and constructing a content encoder to process splicing characteristics to output voice content characteristics; Establishing a recognizer R by adopting a causal Conformer to recognize the voice characteristic content and output text content; Constructing an entropy model to output a codebook probability estimation total code length and code rate; And training the encoder, the RVQ discretization quantizer, the content encoder, the identifier and the entropy model parameters in stages, and performing model deployment after training is completed. The invention relates to a voice recognition method based on artificial intelligence, which comprises the following steps of cutting collected voice data into blocks, framing the blocks, constructing an encoder to process the framed voice blocks, outputting latent variable, collecting single-channel voice, and unifying sampling ratesUnified mapping of speech audio amplitude toA floating point sequence, defining a fixed window length, and cutting the voice audio according to the fixed window; Calculating the total downsampling multiplying power of the encoder, defining the streaming frame length and framing each voice block: Constructing an encoder E, which comprises an inlet convolution layer, a downsampling convolution block, a double-layer LSTM layer and an end convolution layer; the framed voice is diced and input into an encoder E for processing and outputting latent variables 。 The initialization codebook, construct RVQ discretization quantizer carry on residual quantization to the latent variable and get the quantized vector and define the promise loss of RVQ discretization quantizer to mean construct RVQ discretization quantizer Q, choose a goal bandwidth to each training batch, confirm the codebook number set according to