CN-122024712-A - Laboratory strong noise environment voice recognition method based on deep learning

CN122024712ACN 122024712 ACN122024712 ACN 122024712ACN-122024712-A

Abstract

The invention relates to the technical field of voice signal processing, in particular to a laboratory strong noise environment voice recognition method based on deep learning. The method comprises the steps of obtaining a signal to be identified, performing energy envelope adjustment by utilizing gain control, compensating gain in real time according to background sound pressure level to output a laboratory heterogeneous signal, inputting the signal into a double-branch characteristic extractor, capturing long Cheng Yuyi dependent characteristics by utilizing a transducer layer, capturing local spectrum characteristics by utilizing a convolutional neural network layer, performing dynamic weight fusion on the two paths of characteristics to construct a characteristic matrix, reducing noise by utilizing a characteristic space mapping function to eliminate offset generated by mechanical vibration, and finally inputting the signal into a back-end engine, and restraining front-end parameters and back-end parameters by utilizing a joint optimization loss function. The invention effectively inhibits the composite interference of the ventilation system and the instrument of the laboratory and solves the problem of characteristic space deviation.

Inventors

LV XIAOLONG
HOU WENGAO
LU JIAJIA
ZHANG JUNSHENG

Assignees

派尔实验装备有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (7)

1. The laboratory strong noise environment voice recognition method based on deep learning is characterized by comprising the following steps of: Acquiring a voice signal to be recognized in a laboratory environment, performing energy envelope adjustment on the voice signal to be recognized through a gain control method, compensating signal gain in real time according to dynamic changes of a current background sound pressure level, extracting an instantaneous energy envelope of the voice signal to be recognized when the energy envelope is adjusted, and calculating gain increment for the instantaneous energy envelope which is lower than a preset sensitivity threshold value and combined with logarithmic weight taking the current average background sound pressure level as a base; The method comprises the steps of inputting the laboratory heterogeneous signals into a double-branch feature extractor at the same time, capturing long Cheng Yuyi dependent features in the laboratory heterogeneous signals by using a transducer layer in a global feature extraction branch through a self-attention mechanism, capturing local spectrum features in the laboratory heterogeneous signals by using a convolutional neural network layer in a local feature extraction branch through a feature convolutional operator; The method comprises the steps of obtaining a dynamic fusion weight vector through weight branch network calculation, carrying out element-by-element multiplication on long Cheng Yuyi dependent features by utilizing the dynamic fusion weight vector, carrying out element-by-element multiplication on the local spectrum features by utilizing the reverse remainder of the dynamic fusion weight vector, carrying out linear weighted summation on enhanced semantic features and enhanced acoustic features, constructing a double-branch feature matrix, carrying out noise reduction treatment on the double-branch feature matrix by utilizing a feature space mapping function, and eliminating feature space offset generated by laboratory mechanical vibration; Inputting the noise-reduced double-branch feature matrix into a rear-end recognition engine, simultaneously restraining front-end enhancement parameters and rear-end recognition classification parameters by utilizing a joint optimization loss function, calculating the distribution distance between the double-branch feature matrix and a preset pure voice feature space by utilizing a feature distribution consistency regular term in the joint optimization loss function, regulating a step value of a gain compensation coefficient according to the distribution distance, and outputting a target recognition result.
2. The deep learning-based laboratory strong noise environment speech recognition method according to claim 1, wherein the gain control method comprises: The method comprises the steps of collecting a leading non-voice section of a voice signal to be recognized in a sliding mode in real time, calculating an average background sound pressure level of a current laboratory environment to serve as a reference value of gain adjustment, extracting an instantaneous energy envelope of the voice signal to be recognized by using a time domain smoothing filter, comparing the instantaneous energy envelope with a preset upper limit of a linear dynamic range, calculating an adaptive gain compensation coefficient according to fluctuation amplitude of the average background sound pressure level, namely, when the instantaneous energy envelope is lower than a preset sensitivity threshold value, lifting a value of the gain compensation coefficient according to a preset first mapping function, when a difference value of the instantaneous energy envelope and the upper limit of the linear dynamic range is smaller than a preset safety margin threshold value, reducing the value of the gain compensation coefficient according to a preset second mapping function, enabling the calculated gain compensation coefficient to act on each sampling point of the voice signal to be recognized, and mapping the adjusted signal to a standard amplitude range through an energy normalization function to obtain the laboratory heterogeneous signal.
3. The method for recognizing the laboratory strong noise environment voice based on the deep learning according to claim 2 is characterized in that the average background sound pressure level is calculated by sampling the leading non-voice segment by utilizing a sliding time window, extracting square root amplitude of each frame signal, calculating instantaneous sound pressure level of each frame by a logarithmic conversion function based on a preset reference sound pressure value, carrying out time weighted average on all instantaneous sound pressure levels in the sliding time window to obtain the average background sound pressure level, wherein the first mapping function is a gain enhancement function, and the second mapping function is a nonlinear compression function.
4. The method for recognizing the laboratory strong noise environment voice based on the deep learning according to claim 1, wherein the process of extracting the characteristics by the double-branch characteristic extractor comprises the steps of calculating the association weights of the laboratory heterogeneous signals on different time spans through a multi-head self-attention mechanism by utilizing a transducer layer in the global characteristic extraction branch, extracting long Cheng Yuyi dependent characteristics reflecting voice context logic and used for compensating continuous broadband masking noise generated by a laboratory ventilation system, utilizing a convolution neural network layer in the local characteristic extraction branch, carrying out spectrum local scanning on the laboratory heterogeneous signals through a characteristic convolution operator with a preset step length, extracting local spectrum characteristics reflecting instantaneous acoustic changes and used for recognizing and eliminating periodic electromagnetic interference pulses generated by a laboratory instrument, carrying out layer normalization processing on the long Cheng Yuyi dependent characteristics and the local spectrum characteristics, and mapping the two paths of characteristics to the same dimension space through a linear projection operator.
5. The method for recognizing the laboratory strong noise environment voice based on the deep learning is characterized in that the construction process of the double-branch feature matrix comprises the steps of splicing the aligned long Cheng Yuyi dependent features with the local spectrum features, inputting a preset weight branch network, calculating to obtain dynamic fusion weight vectors matched with feature dimensions through global averaging pooling and full-connection mapping, multiplying the long Cheng Yuyi dependent features by elements by the dynamic fusion weight vectors to obtain enhanced semantic features, multiplying the local spectrum features by elements by the reverse remainder of the dynamic fusion weight vectors to obtain enhanced acoustic features, and carrying out linear weighted summation on the enhanced semantic features and the enhanced acoustic features to obtain the double-branch feature matrix.
6. The method for recognizing the laboratory strong noise environment voice based on the deep learning is characterized in that the back-end recognition engine comprises an acoustic modeling unit and a joint time sequence classification decoder, the specific recognition process comprises the steps of carrying out nonlinear transformation on an input double-branch feature matrix by the acoustic modeling unit, mapping to obtain posterior probability distribution of each time frame on a preset modeling unit, carrying out gradient back propagation correction on parameters of the acoustic modeling unit through acoustic model loss terms in a joint optimization loss function, calculating distribution distances between the double-branch feature matrix and a preset pure voice feature space by utilizing feature distribution consistency regular terms in the joint optimization loss function, adjusting stepping values of gain compensation coefficients according to the distribution distances, carrying out path search and de-duplication processing on the posterior probability distribution by utilizing the joint time sequence classification decoder, and outputting the final target recognition result by combining a preset laboratory dictionary and a language model.
7. The method for recognition of laboratory strong noise environment speech based on deep learning of claim 6, wherein the decoding process of the joint timing classification decoder comprises: Introducing blank placeholder labels into a preset modeling unit set, constructing an expansion state path space containing the blank placeholder labels by utilizing the posterior probability distribution, aligning the length difference between the double-branch feature matrix and a target text sequence, searching all possible decoding paths in the expansion state path space by utilizing a forward-backward algorithm, accumulating and merging path probabilities with the same folding corresponding text sequence, extracting original candidate text sequences by deleting continuously repeated labels and the blank placeholder labels, pruning the original candidate text sequences by utilizing a beam search algorithm, introducing a preset laboratory professional term dictionary for word bias correction, and adaptively adjusting the search width of the beam search algorithm according to the current value of the average background sound pressure level.

Description

Laboratory strong noise environment voice recognition method based on deep learning Technical Field The invention relates to the technical field of voice signal processing, in particular to a laboratory strong noise environment voice recognition method based on deep learning. Background Along with the improvement of the automation level of a laboratory, non-contact voice control becomes a core technology for guaranteeing safe operation of scientific researchers in chemical synthesis, biological experiments and other 'busy hands' and high-risk scenes. The laboratory environment has unique acoustic heterogeneity and complex noise fields place stringent demands on the robustness of speech recognition systems. The current speech recognition technology is mainly built on a deep learning architecture, and the conversion of signals into texts is realized by utilizing an acoustic model and a language model. When processing environmental noise, the prior art generally adopts fixed gain adjustment or general spectral subtraction to perform front-end preprocessing, and combines a convolutional neural network or a cyclic neural network to perform feature extraction. However, in laboratory specific scenarios, existing solutions have significant drawbacks. First, the general gain control cannot recognize and compensate for severe sound pressure level fluctuation caused by the start and stop of high-power equipment such as a centrifuge, an exhaust cabinet and the like, so that a voice signal is easy to have clipping distortion or is submerged by background noise. Secondly, it is difficult for the conventional single-branch feature extraction architecture to simultaneously decouple the continuous broadband masking noise generated by the laboratory ventilation system from the local periodic electromagnetic interference pulses generated by the precision instrument, resulting in limited feature characterization capability. In addition, since the front-end enhancement module and the back-end recognition engine are usually in independent optimization states, the system lacks overall association constraint, so that the feature space offset generated in the enhancement process cannot be effectively corrected in the recognition stage. The invention aims to solve the problem of how to realize the self-adaptive accurate compensation of gain, the parallel decoupling extraction of global and local characteristics and the linkage optimization of front and rear end models under the extremely low signal-to-noise ratio environment of high-frequency turbulence, broadband vibration and periodic electromagnetic interference interweaving which are special in a laboratory, thereby constructing a voice recognition control method which has high robustness and can accurately capture experimental instructions. Therefore, a laboratory strong noise environment voice recognition method based on deep learning is provided. Disclosure of Invention The invention aims to provide a laboratory strong noise environment voice recognition method based on deep learning, which improves recognition robustness in an environment with extremely low signal to noise ratio through self-adaptive gain compensation, double-branch characteristic decoupling extraction and front-end and rear-end joint optimization. In order to achieve the above purpose, the present invention provides the following technical solutions: A laboratory strong noise environment voice recognition method based on deep learning comprises the following steps: Acquiring a voice signal to be recognized in a laboratory environment, performing energy envelope adjustment on the voice signal to be recognized through a gain control method, compensating signal gain in real time according to dynamic changes of a current background sound pressure level, extracting an instantaneous energy envelope of the voice signal to be recognized when the energy envelope is adjusted, and calculating gain increment for the instantaneous energy envelope which is lower than a preset sensitivity threshold value and combined with logarithmic weight taking the current average background sound pressure level as a base; The method comprises the steps of inputting the laboratory heterogeneous signals into a double-branch feature extractor at the same time, capturing long Cheng Yuyi dependent features in the laboratory heterogeneous signals by using a transducer layer in a global feature extraction branch through a self-attention mechanism, capturing local spectrum features in the laboratory heterogeneous signals by using a convolutional neural network layer in a local feature extraction branch through a feature convolutional operator; The method comprises the steps of obtaining a dynamic fusion weight vector through weight branch network calculation, carrying out element-by-element multiplication on long Cheng Yuyi dependent features by utilizing the dynamic fusion weight vector, carrying out element-by-element multiplication on the