CN-120388575-B - Speech recognition method and system based on artificial intelligence

CN120388575BCN 120388575 BCN120388575 BCN 120388575BCN-120388575-B

Abstract

The invention provides a voice recognition method and a voice recognition system based on artificial intelligence, which relate to the technical field of voice recognition and collect voice signals in real time, after noise reduction is carried out on the voice signal, voice signal characteristics are extracted through the Mel frequency cepstrum coefficient, and the Mel cepstrum coefficient is combined, and a voice characteristic vector is formed by the first-order difference and the second-order difference of the Mel cepstrum coefficient. Meanwhile, a lip moving image is collected as a visual signal, after the image is subjected to gray processing, an image feature vector is generated by calculating LBP values of pixel points in the image, weights of the voice feature vector and the image feature vector are dynamically adjusted through a cross-modal attention mechanism, a fusion weight matrix is generated, different fusion weight matrices correspond to different voice instructions, a voice original signal and a visual original image are used as training sets, voice instructions corresponding to the fusion weight matrix are used as labels to train a deep learning network model, and finally real-time voice recognition is carried out by using a model which is obtained by inputting data acquired in real time.

Inventors

DONG YIFEI

Assignees

嘀拍信息科技南通有限公司

Dates

Publication Date: 20260512
Application Date: 20250417

Claims (7)

1. The speech recognition method based on artificial intelligence is characterized by comprising the following specific steps: step 1, collecting a voice original signal, carrying out noise reduction treatment on the voice original signal, decomposing the voice signal into a plurality of sub-bands with different frequencies through wavelet transformation, carrying out threshold treatment on a high-frequency sub-band corresponding to information of the signal on different time scales, weakening noise components, and recombining the treated sub-bands to obtain a denoised voice signal; Step 2, pre-emphasis is carried out on the voice signal after denoising through a high-pass filter, the voice signal after pre-emphasis is divided into short-time frames of 20-40 ms, the frames are overlapped according to 50% of frame length, each frame signal is windowed, the windowed frame signal is converted into a frequency domain signal from a time domain signal through fast Fourier transform, the frequency domain signal is subjected to Mel filter to generate Mel cepstrum coefficients, and the Mel cepstrum coefficients, the first-order difference and the second-order difference are combined into a voice feature vector; step 3, collecting lip movement information of a speaker in real time to serve as a visual original image, and carrying out gray processing on the original image to generate a first identification image; step 4, carrying out normalization processing on pixels of the first identification image, respectively selecting 8 neighborhood pixel points around each pixel point in the first identification image, comparing gray values of a central pixel point and the neighborhood pixel points to generate a binary number, converting the binary number into a decimal number to generate LBP values of the central pixel point, and calculating the LBP values of all the pixel points in the first identification image to form a new image, namely a second identification image; Dividing the second identification image into non-overlapping units according to 8 multiplied by 8 pixels, directly cutting out the units which do not meet the size, counting the LBP value distribution condition of all pixel points in each unit, generating a histogram of the unit, splicing the histograms of all units end to end according to the unit sequence, and generating an image feature vector as the final feature representation of the visual signal; Step 6, converting the voice feature vector and the image feature vector into matrix representation, dynamically adjusting the weights of the voice feature and the image feature through a cross-modal attention mechanism, and generating a fusion weight matrix; Step 7, constructing a deep learning network, taking the voice original signals and the visual original images as training sets, taking voice instructions corresponding to the fusion weight matrix as labels, and inputting the voice instructions into the deep learning network for training; step 8, inputting the voice signals and the visual images acquired in real time into a trained deep learning network, acquiring a current fusion weight matrix, and determining a current voice instruction according to the fusion weight matrix; The principle on which the histogram of a cell is generated is based on: Counting the distribution condition of LBP values of all pixel points in a unit, dividing the LBP values into 256 intervals according to the range of [0,255], wherein each interval corresponds to one LBP value, generating a histogram vector with the length of 256, and each element represents the number of the LBP values of the pixel points in the corresponding interval; the histograms of all the units are connected in sequence, and a formula according to which a feature vector is generated is as follows: ; Wherein, the Representing the final image feature vector(s), Represent the first In the histogram of the individual cells The number of LBP values for each interval, Representing the number of units; the principle on which the fusion weight matrix is generated is as follows: treating the speech feature vector as a1 x 3L matrix can be expressed as: ; Wherein, the The speech feature matrix is represented by a matrix of features, Represent the first The number of mel-frequency cepstral coefficients, Representing the first difference of the first mel-frequency cepstral coefficient, Represent the first Second order differences in the individual mel-frequency cepstral coefficients, Representing the number of mel-frequency cepstrum coefficients; considering the image feature vector as a1×256E matrix, it can be expressed as: ; Wherein, the The image feature matrix is represented by a matrix of features, Represent the first In the histogram of the individual cells The number of LBP values for each interval, Representing the number of units; the voice feature matrix and the image feature matrix take the same element number, and the similarity of the voice feature matrix and the image feature matrix is generated according to the following formula: ; Wherein, the Representing the similarity of corresponding speech feature matrix and image feature matrix elements, Representing the first of the speech feature matrix The elements of the column are arranged such that, Representing the first of the image feature matrix Elements of a column; all calculated similarity values are sequentially arranged to form a matrix The matrix is Namely, a correlation matrix between the voice characteristic matrix and the image characteristic matrix; The attention weight is calculated according to the correlation matrix, and the following formula is adopted: ; ; Wherein, the The attention weights representing the features of the speech are presented, Attention weights representing the features of the image, The correlation matrix is represented by a correlation matrix, A transpose matrix representing the correlation matrix; feature fusion is carried out by using the calculated weights, and a fusion weight matrix is generated according to the following formula: ; Wherein, the The fusion weight matrix is represented as a matrix of fusion weights, The attention weights representing the features of the speech are presented, The speech feature matrix is represented by a matrix of features, Attention weights representing the features of the image, Representing the image feature matrix.
2. The method for speech recognition based on artificial intelligence according to claim 1, wherein the judgment of the high frequency subband range in step 1 is based on the following principle: ; Wherein, the Indicating the frequency of the high-frequency sub-band, Representing the sampling rate of the original signal, Indicating the corresponding number of layers in the wavelet transform.
3. The method for speech recognition based on artificial intelligence according to claim 1, wherein the formula according to which the speech signal is decomposed into a plurality of subbands of different frequencies by wavelet transform in step 1 is: ; ; ; Wherein, the The Haar wavelet basis functions are represented, A variable of the time is represented and, Representing the wavelet coefficients of the high frequency sub-bands, The original signal is represented by a representation of the original signal, Representing a wavelet basis function As a result of the expansion and translation performed, Is a parameter of the scale of the expansion, Is a translational scale parameter; setting a noise threshold, and generating a subband coefficient according to the noise threshold and the wavelet coefficient, wherein the formula is as follows: ; Wherein, the Representing the sub-band coefficients after the processing, Representing a noise threshold; and carrying out inverse wavelet transformation on the processed sub-band coefficients to generate denoised signals, wherein the formula is as follows: ; Wherein, the Representing the signal after denoising, The coefficients of the sub-bands are represented, Representing a wavelet basis function As a result of the expansion and translation performed, Is a parameter of the scale of the expansion, Is a translational scale parameter.
4. The method for speech recognition based on artificial intelligence according to claim 1, wherein the principle based on which the Mel frequency cepstrum coefficient is generated in the step 2 is as follows: Firstly, pre-emphasis is carried out on the signals after denoising, and the following formula is adopted: ; Wherein, the Representing the signal after pre-emphasis at time The corresponding value is used to determine, Representing the time of the denoised signal The corresponding value is used to determine, Representing the time of the denoised signal The corresponding value is used to determine, Representing the pre-emphasis coefficient of the signal, Representing a time variable; after the pre-emphasized signal is divided into short-time frames, windowing is carried out on each frame of signal according to the following formula: ; ; Wherein, the Representing the result after windowing the frame signal after segmentation, Representing the frame signal after the segmentation, A variable of the time is represented and, The hamming window function is represented by a hamming window function, The time index is represented as such, Representing frame length; The formula according to which the windowed frame signal is converted from a time domain signal to a frequency domain signal is: ; Wherein Z is Representing the frequency domain signal, Representing the time-domain signal, The frame length is indicated as such, Representing the units of an imaginary number, Represents the frequency index, the value range is 0, N-1, Representing time index, wherein the value range is [0, N-1]; The energy output of each filter is based on the following formula: ; Wherein, the Represent the first The values of the individual mel-filters, The index of the frequency is represented and, Represent the first The frequency of the individual mel filters; Taking the logarithm of the energy output of each filter to generate logarithmic energy, and performing discrete cosine transform on the logarithmic energy to generate a mel-frequency cepstrum coefficient according to the following formula: ; ; wherein, among them, Represent the first The logarithmic energy of the individual filters, Representing the frequency domain signal, Represent the first The values of the individual mel-filters, The index of the frequency is represented and, The frame length is indicated as such, Represent the first The number of mel-frequency cepstral coefficients, Representing the number of mel-filters, Represent the first The logarithmic energy of the individual mel-filters, Representing the index of the mel-filter, An index representing mel-frequency cepstral coefficients; the formula according to which the first-order difference and the second-order difference of the mel frequency cepstrum coefficient are calculated is as follows: ; ; Wherein, the Represent the first First order difference of the individual mel-frequency cepstral coefficients, Represent the first Second order differences in the individual mel-frequency cepstral coefficients, Representing the amount of offset of the frame, Determined by the size of the differential window, the differential window is of the size of Determining a specific value through an actual application scene; the principle on which the mel frequency cepstrum coefficient, the first-order difference and the second-order difference are combined into one feature vector is as follows: ; Wherein, the The resulting speech feature vector is represented by a vector, Represent the first The number of mel-frequency cepstral coefficients, Representing the first difference of the first mel-frequency cepstral coefficient, Represent the first Second order differences in the individual mel-frequency cepstral coefficients, Representing the number of mel-frequency cepstral coefficients.
5. The method for recognizing speech based on artificial intelligence according to claim 1, wherein the step 3 of graying the original image is based on the following formula: ; Wherein, the The gray value of the pixel point is represented, Red channel values representing pixels of the original image, The green channel value representing the original image pixel point, The blue channel value representing the pixel point of the original image.
6. The method for recognizing speech based on artificial intelligence according to claim 1, wherein the gray values of the center pixel and the neighboring pixels are compared in the step 4, and a binary number is generated based on the following principle: for a center pixel with 8 pixels in the neighborhood, this can be expressed as: ; the formula according to which the value of binary numbers is judged is as follows: ; Wherein, the The gray value representing the center pixel point, Represents the gray value of the neighboring pixel point, Is a comparison formula for judging according to the size relation between the central pixel point and the neighborhood pixel points; converting binary numbers into LBP values of pixel points according to the following formula: ; Wherein, the The LBP value representing the center pixel point, Representing the number of neighboring pixel points, And a comparison formula for judging according to the size relation of the central pixel point and the neighborhood pixel points is shown.
7. An artificial intelligence based speech recognition system, characterized in that the system is adapted to implement an artificial intelligence based speech recognition method according to any of the preceding claims 1-6, comprising: The voice acquisition module is used for acquiring a voice original signal, carrying out noise reduction processing on the voice original signal, decomposing the voice signal into a plurality of sub-bands with different frequencies through wavelet transformation, carrying out threshold processing on a high-frequency sub-band corresponding to information of the signal on different time scales, weakening noise components, and recombining the processed sub-bands to obtain a denoised voice signal; The voice feature extraction module is used for pre-emphasis on the voice signals after denoising through a high-pass filter, dividing the voice signals after pre-emphasis into short-time frames of 20-40 ms, overlapping the frames according to 50% of frame length, windowing each frame signal, converting the windowed frame signal from a time domain signal to a frequency domain signal through fast Fourier transform, generating a Mel cepstrum coefficient through a Mel filter by the frequency domain signal, and grouping the Mel cepstrum coefficient, a first-order difference and a second-order difference of the Mel cepstrum coefficient to synthesize voice feature vectors; the image acquisition module is used for acquiring lip movement information of a speaker in real time to serve as a visual original image, and carrying out gray processing on the original image to generate a first identification image; the image processing module is used for carrying out normalization processing on pixels of the first identification image, selecting 8 neighborhood pixel points around each pixel point in the first identification image, comparing gray values of a central pixel point and the neighborhood pixel points to generate a binary number, converting the binary number into a decimal number to generate LBP values of the central pixel point, and calculating the LBP values of all the pixel points in the first identification image to form a new image, which is called a second identification image; the image feature extraction module is used for dividing the second identification image into non-overlapping units according to 8 multiplied by 8 pixels, directly cutting out the units which do not meet the size, counting the LBP value distribution condition of all pixel points in each unit, generating a histogram of the unit, splicing the histograms of all the units end to end according to the unit sequence, and generating an image feature vector serving as the final feature representation of the visual signal; The feature fusion module is used for converting the voice feature vector and the image feature vector into matrix representation, dynamically adjusting the weights of the voice feature and the image feature through a cross-modal attention mechanism and generating a fusion weight matrix; The model construction module is used for constructing a deep learning network model, taking the voice original signal and the visual original image as training sets, taking voice instructions corresponding to the fusion weight matrix as labels, and inputting the voice instructions into the deep learning network model for training; The voice recognition module inputs the voice signals and the visual images acquired in real time into the trained deep learning network model, acquires a current fusion weight matrix, and determines a current voice instruction according to the fusion weight matrix.

Description

Speech recognition method and system based on artificial intelligence Technical Field The invention relates to the technical field of voice recognition, in particular to a voice recognition method and system based on artificial intelligence. Background Currently, speech recognition technology has been widely used in various fields such as intelligent assistants, automatic translation, and barrier-free technology. However, the conventional speech recognition system mainly relies on an audio signal, is susceptible to environmental noise, and causes degradation in recognition accuracy and robustness. Furthermore, single audio inputs are particularly inefficient in complex noise or multi-person conversation scenarios. The multi-modal speech recognition system can effectively solve the problem of simultaneously utilizing audio and visual information for speech recognition to improve recognition in noisy environments. It is a primary challenge if both types of information are effectively integrated and their contribution in the recognition process is adjusted. In the prior art, publication number CN116580706B discloses a speech recognition method based on artificial intelligence, which is implemented by collecting speech audio information recorded by a user, converting the speech audio information into an audio spectrogram, obtaining a plurality of audio frames in the audio spectrogram, extracting feature information in each audio frame, associating the feature information of the plurality of audio frames to obtain data to be recognized, inputting the data to be recognized into a trained speech recognition model, determining speech content corresponding to the speech audio information, checking the speech content, obtaining a speech recognition result and outputting the speech recognition result. The method has the main problems that the voice frequency information is integrally relied on, so that the recognition accuracy can be influenced under a complex noise environment, different recording equipment and recording conditions can cause the difference of the audio quality, certain errors can be generated when a spectrogram is generated, the characteristic extraction process is complex through the audio spectrogram, a large amount of computing resources and time are required, and the use scene of instant voice recognition can not be met. The above information disclosed in the background section is only for enhancement of understanding of the background of the disclosure and therefore it may include information that does not form the prior art that is already known to a person of ordinary skill in the art. Disclosure of Invention The invention aims to provide a voice recognition method and a voice recognition system based on artificial intelligence, which are used for solving the problems in the background technology. In order to achieve the above purpose, the present invention provides the following technical solutions: A speech recognition method based on artificial intelligence comprises the following specific steps: step 1, collecting a voice original signal, carrying out noise reduction treatment on the voice original signal, decomposing the voice signal into a plurality of sub-bands with different frequencies through wavelet transformation, carrying out threshold treatment on a high-frequency sub-band corresponding to information of the signal on different time scales, weakening noise components, and recombining the treated sub-bands to obtain a denoised voice signal; Step 2, pre-emphasis is carried out on the voice signal after denoising through a high-pass filter, the voice signal after pre-emphasis is divided into short-time frames of 20-40 ms, the frames are overlapped according to 50% of frame length, each frame signal is windowed, the windowed frame signal is converted into a frequency domain signal from a time domain signal through fast Fourier transform, the frequency domain signal is subjected to Mel filter to generate Mel cepstrum coefficients, and the Mel cepstrum coefficients, the first-order difference and the second-order difference are combined into a voice feature vector; step 3, collecting lip movement information of a speaker in real time to serve as a visual original image, and carrying out gray processing on the original image to generate a first identification image; step 4, carrying out normalization processing on pixels of the first identification image, respectively selecting 8 neighborhood pixel points around each pixel point in the first identification image, comparing gray values of a central pixel point and the neighborhood pixel points to generate a binary number, converting the binary number into a decimal number to generate LBP values of the central pixel point, and calculating the LBP values of all the pixel points in the first identification image to form a new image, namely a second identification image; Dividing the second identification image into non-overlapping units acc