CN-116524914-B - End-to-end voice recognition method based on multipath convolution network
Abstract
The invention discloses an end-to-end voice recognition method based on a multipath convolution network, which is specially used for improving recognition of coarsening of a pronunciation unit in Chinese pronunciation in a Chinese voice recognition process, and adopts the multipath convolution network MCNN to extract characteristics of voice data from the direction of a time frame of pronunciation and a frequency spectrum of pronunciation, wherein a MCNN multipath convolution network is added at the input front end of a transducer to acquire local characteristics of voice in advance, the local characteristics are input into the transducer network to perform voice recognition training, a CTC structure is added after a Encoder layer, and finally a combined training model is used for assisting training by taking CTC as an auxiliary training, so that the convergence rate of an overall model is accelerated. A large number of experiments prove that the method has stronger robustness and the recognition rate exceeding that of a general voice recognition model, and the model can be modularized, so that stronger feasibility is provided for later improvement.
Inventors
- QIU YUAN
- XIAO HAO
- WEI JINBO
- LIU ZUO
- LI CONG
- KOU JIAWEI
- ZHAO ZHONGQI
- HE HUSONG
Assignees
- 西安理工大学
- 广西东信易通科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20230419
Claims (3)
- 1. An end-to-end voice recognition method based on a multipath convolutional network is characterized by comprising the following steps: Step 1, all the sampling rates of all the voice data are converted into aHz in a resampling mode; wav=resampel(wav,a) step 2, performing voice preprocessing on the resampled voice data wav; The method comprises the steps of adopting Fbank output modes, carrying out pre-emphasis processing on voice signals, filtering low frequency bands in the voice signals through a high-pass filter, carrying out framing processing on the voice signals passing through the high-pass filter, segmenting audio with indefinite length into small segments with fixed length to obtain framed voice signals, setting a sliding Hamming window function to carry out Fourier transformation on the framed voice signals, converting voice data from time domain signals to frequency domain signals, and finally completing Fbank calculation on the obtained frequency domain signals through logarithmic Mel spectrum filtering; wav_frame=hamming(wav,10ms) wav_data=fft(wav_frame) wav Fbank =log_mel(wav_data) Step 3, performing Fbank processing on voice data in the step 2 to obtain voice data wav Fbank [ batch_size, wav_input,80] with the spectral characteristics of 80 dimensions, wherein barch _size represents the data size of each batch processing, and wav_input represents the input voice length; All voice data are subjected to length filling, so that the dimension of wav_input reaches 200: wav_data′[batch_size,200,80] =padding(wav_data[batch_size,wav_input,80]) Step 4, inputting the voice data obtained in the step 3 into a multipath convolution network MCNN; The voice data are respectively input into three channels, the structures of the three channels are the same, a first layer two-dimensional convolution CNN layer is firstly entered, each convolution layer is attached with a batch normalization BatchNorm2D and Relu activation function, each CNN layer has 16 filter groups, the size of each filter kernel is 3*3, the step size is 1, the first layer two-dimensional convolution CNN layer is output and then is input into a second layer two-dimensional convolution layer which is identical to the first layer, then is input into a 2-dimensional maximum pooling layer Maxpool, then is input into a Dropout layer with the inactivation rate of 0.3, the parameters of the voice data are changed into [1,100,40,16], the first parameter is bitchsize, the second parameter is the audio length, the third parameter is the characteristic dimension of the audio, the fourth parameter is the filter size of the convolution layer, then is continuously input into a third layer Conv2D two-dimensional convolution layer, compared with the first layer, the number of the filter groups in the convolution layer is changed into [1,50,20,32] [ 35D, the parameters of the voice data in the second layer two-dimensional convolution layer are changed into [ 35 ] and finally the voice data in the second layer is changed into [ 35-dimensional data; wav_data′ i =Maxpool(BatchNorm2d(Relu(Conv2D(wav_data′)),epsolon=0.0002),pool_size=[2,2]) wav_data i =reshape(1,25,640)(wav_data′ i ) i=1, 2,3 represents 3 channels; step 5, combining the voice data output by the three channels, namely, combining the voice data output by the three channels, outputting the voice data by combining the dimensions of the spectrum characteristics, entering a full-connection layer FC, wherein the input=1920 and the output=1024 of a first full-connection layer Dense1, the input=1024 and the output=512 of a second full-connection layer Dense2, and the input=512 and the output=320 of a third full-connection layer Dense3, wherein a batch normalization BatchNorm d and Relu activation function is added between every two full-connection layers, and finally, wav_data' 1,25,320 is output; wav_data′′=FC(Merge(wav_data 1 ,wav_data 2 ,wav_data 3 )) Step 6, constructing a voice recognition network based on a transducer model, comprising an encoder and a decoder, wherein the encoder and the decoder are both realized by a multi-head attention mechanism MHA, the encoder and the decoder are both composed of a multi-head attention mechanism and a position feedforward network FFN module, and residual connection and layer normalization are used after each sublayer; Transmitting the data output by the layers of the multipath convolution network MCNN into a transducer model, wherein an encoder maps the input voice data wav' data =(X 1 ,X 2 ,…,X T into a hidden space state (h 1 ,h 2 ,…,h N ) through a multi-head attention mechanism, then a decoder decodes the data by combining the provided text label (Y 1 ,Y 2 ,…,Y L ) with the output hidden space state (h 1 ,h 2 ,…,h N ) of the encoder layer, and finally predicts the data to obtain a target sequence pre_label T (Y 1 ,Y 2 ,…,Y L ; wav_data′′′(h 1 ,h 2 ,…,h N )=Encoder(X 1 ,X 2 ,…,X T ) pre_label T (Y 1 ,Y 2 ,…,Y L )=Decoder((Y 1 ,Y 2 ,…,Y L-1 ),(h 1 ,h 2 ,…,h N )) Step 7, inputting the hidden state transition matrix (h 1 ,h 2 ,…,h N ) obtained by the encoder into a CTC structure, and utilizing a forward and backward algorithm of the CTC to force monotonic alignment between the voice and the tag sequence, wherein the CTC adopts a greedy search method: pre_label C (Y 1 ,Y 2 ,…,Y L ) =CTC_greedy_search((Y 1 ,Y 2 ,…,Y L-1 ),(h 1 ,h 2 ,…,h N )) L MTL =λL CTC +(1-λ)L attention Where λ is a hyper-parameter, L CTC represents CTC loss function, L attention represents attention loss function, and L MTL represents a multiplexing loss function where both are added.
- 2. The method for end-to-end speech recognition based on a multi-path convolutional network of claim 1, characterized in that said a=16000.
- 3. The method for end-to-end speech recognition based on a multi-path convolutional network according to claim 1, wherein the step 2 is to divide the audio with an indefinite length into segments with a fixed length, and each segment has a length of 10-30ms.
Description
End-to-end voice recognition method based on multipath convolution network Technical Field The invention belongs to the technical field of voice recognition, and particularly relates to an end-to-end voice recognition method. Background An important approach to human-computer interaction is to use automatic speech recognition (Auto Speech Recodnize, abbreviated ASR) to "learn" human speech and "extract" text information contained in the speech. With the continuous development of speech recognition, it has been developed as a comprehensive technology involving multidisciplinary techniques such as acoustics, linguistics, digital signal processing, statistical pattern recognition, and the like. Speech recognition has evolved in the age of deep learning. Conventional speech recognition models typically include three parts, an Acoustic Model (AM), a pronunciation dictionary (Lexicon), and a Language Model (LM). Each part needs independent learning training, and how to establish an End-to-End (End-to-End) learning mechanism can lead the training of a model to abandon a pronunciation dictionary and a language model, so that the realization of directly transcribing the pronunciation into the Text (Speech-to-Text) is a current research hot spot. There are two main ideas for achieving end-to-end learning at present, one is to link timing classification (Connectionist Temporal Classification, CTC) and the other is a Sequence-to-Sequence model based on the attention mechanism (Sequence-to-Sequence Model based Attention). For Chinese speech recognition, the pronunciation granularity of each word between the Chinese speech recognition is large to a great extent, the speech recognition method adopting fine granularity is not good for Chinese recognition effect, and a large amount of speech data in a real speech environment has noise problem, the transducer model has poor processing effect on the speech data with noise, and CTC has better effect on processing the real noise data CTC because of the forced alignment of the CTC to the sequence data. MCNN (multipath convolutional neural network) is mainly used for expanding the width of the traditional CNN, adopts a three-channel design scheme, utilizes three-layer convolution to extract the voice information, fully extracts the local features in the voice-speaking voice dimension information, and can extract additional detailed features in the width aspect from the voice. MCNN alone can only feature the speech of each frame, but cannot acquire the correlation between the speech of each frame, so a transducer is required to make up for the deficiency in this respect. The Transformer is proposed by the Google machine translation team in 2017, is an end-to-end framework system based on a self-attention mechanism, discards traditional CNNs and RNNs, the whole network structure is completely composed of self-attention mechanisms, a trainable neural network based on the Transformer can be built by stacking the transformers and is divided into two structures of Encoder-Decoder, a standard Transformer architecture is composed of 12 layers Encoder-Decoder, and a more efficient scheme is provided for solving the seq2seq problem. CTC models are a time-series classification method proposed by Graves et al, a way to avoid manual alignment of inputs and outputs, and are well suited for speech recognition or OCR applications. The input sequence x= [ X1, X2, ], xT ], the tag data y= [ Y1, Y2, ], yU such as speech files and text files in speech recognition, our purpose is to find a mapping from X to Y, this algorithm for classifying time series data being called Temporal Classification. CTCs, however, do not build correlations between each frame of speech as a function of loss, and therefore need to cooperate with other neural networks. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides an end-to-end voice recognition method based on a multipath convolution network, which is specially used for improving recognition of coarseness of a pronunciation unit in Chinese pronunciation in a Chinese voice recognition process, adopts the multipath convolution network MCNN to extract characteristics of voice data from a pronunciation time frame and a pronunciation frequency spectrum direction, adds a MCNN multipath convolution network at the input front end of a transducer so as to acquire local characteristics of voice in advance, inputs the local characteristics into the transducer network to perform voice recognition training, adds a CTC structure after Encoder layers, and finally uses a joint training model to assist training by using CTC so as to accelerate convergence speed of an overall model. A large number of experiments prove that the method has stronger robustness and the recognition rate exceeding that of a general voice recognition model, and the model can be modularized, so that stronger feasibility is provided for later improvement. The techni