CN-122024751-A - Mine scene voice enhancement method and equipment based on multi-modal noise feature learning

CN122024751ACN 122024751 ACN122024751 ACN 122024751ACN-122024751-A

Abstract

The application relates to the technical field of voice signal processing and industrial communication, and discloses a mine scene voice enhancement method and equipment based on multi-mode noise feature learning, comprising the following steps of synchronously collecting acoustic, vibration and acceleration signals of a mine site, and extracting multidimensional features after trend removal and standardized pretreatment; the method comprises the steps of dynamically fusing vibration and acceleration characteristics representing physical working conditions of pure noise sources with acoustic characteristics by using a self-attention mechanism, inputting the fused characteristics into a CNN-LSTM depth network, reasoning to obtain ideal ratio masking vectors, correcting amplitude spectrums by using the masking vectors, and generating enhanced voice by combining original phase reconstruction. According to the application, the non-acoustic physical mode is introduced as auxiliary information, and the causal relation between mechanical vibration and acoustic noise is utilized for deep network mining, so that the problem that the non-stable impact noise is difficult to suppress under the low signal-to-noise ratio is effectively solved, and the definition and the intelligibility of voice communication under the complex mine environment are improved.

Inventors

ZOU HONGTAO
NING ZHENXING
BAI XUYANG
WANG CHAO
TANG HAOWEN
LI MENG

Assignees

中煤科工集团信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20251226

Claims (10)

1. The mine scene voice enhancement method based on multi-modal noise feature learning is characterized by comprising the following steps of: S1, acquiring an air conduction signal, a solid conduction vibration signal and an acceleration signal of a mine operation site through a synchronous acquisition module, and performing time synchronous calibration on a plurality of signals to generate an original discrete signal sequence; S2, performing trending item removal processing and standardization processing on the original discrete signal sequence, and performing framing and windowing operations to generate a preprocessed discrete frame sequence; S3, respectively extracting acoustic feature vectors representing the mixed state of voice and noise from the discrete frame sequence, and vibration feature vectors and acceleration feature vectors representing the physical working condition of a pure noise source; S4, structurally splicing the acoustic feature vector, the vibration feature vector and the acceleration feature vector, and carrying out weighted fusion on the spliced features by using a self-attention mechanism to generate a multi-mode fusion feature vector; s5, inputting the multi-mode fusion feature vector into a pre-constructed CNN-LSTM depth enhancement network for reasoning, and outputting a spectrum masking vector for the current time frame through the network; S6, correcting the frequency domain amplitude spectrum of the air conduction signal generated in the step S2 by using the frequency spectrum masking vector, and generating an enhanced voice signal by combining the original phase spectrum reconstruction.
2. The mine scene speech enhancement method based on multi-modal noise feature learning of claim 1, wherein step S1 specifically comprises: Collecting a mixed sound wave containing target voice and background noise as the air conduction signal by using a high dynamic range microphone; collecting mechanical vibration waves as the air conduction signals by using a piezoelectric vibration sensor rigidly connected to the surface of noise source equipment; Collecting space impact characteristics as the acceleration signals by using a triaxial accelerometer arranged on the surface of noise source equipment; And triggering the sampling of the air conduction signal, the air conduction signal and the acceleration signal based on a unified clock source through a multichannel parallel analog-digital converter, so as to ensure that all channel data are aligned strictly in a time domain.
3. The mine scene speech enhancement method based on multi-modal noise feature learning of claim 1, wherein step S2 specifically comprises: Calculating an arithmetic mean value of signals in a current processing window, and subtracting the arithmetic mean value from an original discrete signal sequence point by point to eliminate direct current bias; Mapping each mode signal to a standard normal distribution space by adopting a Z-Score standardization method; the normalized signal is split into a sequence of short time frames with overlapping regions using a hamming window function.
4. The mine scene speech enhancement method based on multi-modal noise feature learning of claim 1, wherein in step S3, the specific process of extracting features comprises: performing short-time Fourier transform on the air conduction signal to generate an amplitude spectrum, calculating short-time energy, short-time zero-crossing rate, spectrum centroid and Mel frequency cepstrum coefficient based on the amplitude spectrum, and constructing the acoustic feature vector; Calculating the mean value, variance, peak value and kurtosis of the solid conduction vibration signal and the acceleration signal respectively, and performing discrete wavelet transformation to calculate multi-scale wavelet coefficient energy so as to construct the vibration characteristic vector and the acceleration characteristic vector; Wherein the kurtosis is used to characterize the tail thickness of the signal probability density function and identify the impact noise component.
5. The mine scene speech enhancement method based on multi-modal noise feature learning of claim 1, wherein in step S4, the specific process of weighting and fusing by using a self-attention mechanism comprises: splicing the acoustic feature vector, the vibration feature vector and the acceleration feature vector into a joint feature vector according to a preset sequence strategy; Projecting the joint feature vectors through three independent learnable linear transformation matrixes respectively to generate query vectors, key vectors and value vectors; Calculating a dot product of the query vector and the key vector transposition, dividing a dot product result by a scaling factor, and carrying out normalization processing by using a Softmax function to obtain an attention weight matrix; And carrying out weighted summation on the value vectors by using the attention weight matrix, and outputting the multi-mode fusion feature vectors.
6. The mine scene voice enhancement method based on multi-modal noise feature learning according to claim 1, wherein the CNN-LSTM depth enhancement network comprises a local feature encoder, a time-sequence correlation modeler and a spectrum masking estimator which are sequentially connected, and step S5 specifically comprises: Performing convolution operation and correction linear unit activation on the input multi-modal fusion feature vector by using the local feature encoder configured as a one-dimensional convolution neural network, and extracting a local feature representation; Processing the local feature representation using the timing correlation modeler configured as a two-way long and short term memory network, updating cell states via gating mechanisms of an input gate, a forget gate, and an output gate, outputting a timing feature vector comprising context information; The timing feature vector is mapped to an ideal rate mask vector corresponding to a number of frequency points using the spectral mask estimator configured as a fully connected layer.
7. The mine scene speech enhancement method based on multi-modal noise feature learning of claim 6, wherein the CNN-LSTM depth enhancement network is pre-trained by: Constructing a training set containing paired data, wherein each pair of data comprises the multi-mode fusion feature vector sequence as input and a pure voice amplitude spectrum as a label; defining a mean square error loss function, and calculating the difference between the predicted enhanced voice amplitude spectrum output by the network and the pure voice amplitude spectrum; and updating the convolution kernel parameters of the one-dimensional convolution neural network, the gating weight of the two-way long-short-term memory network and the parameters of the full-connection layer by adopting an Adam optimizer according to the gradient reversal of the loss function.
8. The mine scene speech enhancement method based on multi-modal noise feature learning of claim 1, wherein step S6 specifically comprises: acquiring an original noisy speech amplitude spectrum and an original noisy speech phase spectrum which are obtained by short-time Fourier transform of the air conduction signal; Calculating the Hadamard product of the frequency spectrum masking vector and the original noisy speech amplitude spectrum to obtain an enhanced amplitude spectrum; Synthesizing the enhanced amplitude spectrum and the original noisy speech phase spectrum into a complex frequency spectrum according to an Euler formula; And performing inverse short-time Fourier transform on the complex frequency spectrum to obtain a time domain frame signal, and adopting an overlap-add method to eliminate the discontinuity between frames, and outputting a final enhanced voice signal.
9. The mine scene speech enhancement method based on multi-modal noise feature learning of claim 8, wherein the overlap-add method comprises: aligning and superposing adjacent time domain frame signals on a time axis according to a frame shift step length set during framing; The accumulated value of the superimposed signal divided by the square of the window function is normalized to eliminate the periodic amplitude modulation and smooth the frame boundary.
10. A mine scene speech enhancement device based on multi-modal noise feature learning, characterized in that it is applied to the mine scene speech enhancement method based on multi-modal noise feature learning as set forth in any one of claims 1 to 9, comprising: The multi-mode data acquisition module is configured to be connected with the external acquisition unit and synchronously receives an air conduction signal, a solid conduction vibration signal and an acceleration signal of the mine operation site; the signal preprocessing module is configured to perform trend item removal, standardization and framing windowing processing on the signals received by the multi-mode data acquisition module; A multi-dimensional feature extraction module configured to extract mel-frequency cepstrum coefficients from the acoustic modes and kurtosis and wavelet energy features from the vibration and acceleration modes, respectively; The attention feature fusion module is configured to calculate attention weights of physical features to acoustic features through a self-attention mechanism and generate weighted multi-mode fusion feature vectors; the deep network reasoning module is configured to load a pre-trained CNN-LSTM network and to reason and output an ideal ratio masking vector according to the multi-mode fusion feature vector; And the signal reconstruction output module is configured to correct the original noise amplitude spectrum by using the ideal ratio masking vector and multiplex the original phase spectrum to synthesize the enhanced voice signal.

Description

Mine scene voice enhancement method and equipment based on multi-modal noise feature learning Technical Field The invention relates to the technical field of voice signal processing and industrial communication, in particular to a mine scene voice enhancement method and equipment based on multi-mode noise feature learning. Background The mining environment is usually in a limited underground space or a complicated open pit, and is accompanied by high-strength operations of large mechanical devices such as drilling rigs, heading machines, local ventilators, and transportation vehicles. The devices can generate background noise with extremely high sound pressure level in the operation process, and the noise types are extremely complex, and the noise types comprise broadband steady-state noise of a fan, non-steady impact noise when a drilling machine drills into rock and electromagnetic noise of motor operation. Under the high-risk and high-noise environment, clear and continuous voice communication is kept between operators and between the operators and a ground command center, and the voice communication method is very important for guaranteeing production safety, scheduling operation flow and emergency rescue command. In order to extract a target speech signal from noisy speech, the prior art generally employs single-channel speech enhancement algorithms based on statistical models, such as spectral subtraction, wiener filtering, and minimum mean square error estimators. Most of these conventional algorithms are based on the assumption that noise is stationary for a short time, i.e. the statistical properties of the noise are considered to change slowly over a short time. However, in actual mining scenes, noise generated by impact of drill rods, ore falling and the like has extremely strong non-stationarity and burstiness, and energy of the noise is suddenly hopped on a time frequency domain, so that the traditional algorithm is difficult to accurately track the change of a noise spectrum, obvious music noise is often remained in processed voice or the hearing of the voice is distorted, and the actual communication requirement is difficult to meet. With the development of deep learning technology, a voice enhancement method based on a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), or a cyclic neural network (RNN) is becoming mainstream. The method learns the nonlinear mapping relation from the noisy spectrum to the pure spectrum by performing supervision training on a large number of noisy speech and pure speech pairs, and improves the enhancement effect under non-stationary noise to a certain extent. But most existing deep-learning speech enhancement schemes rely on only a single acoustic modal input (i.e., signals acquired using only a microphone). In environments with very low signal-to-noise ratios (e.g., -5dB or even lower) in mines, high-intensity mechanical noise tends to completely mask the formant structure of the speech signal in the frequency domain, and the spectral texture of the mechanical noise is sometimes very similar to speech. Under the condition of lacking external auxiliary information, the deep neural network is difficult to accurately distinguish which are voice components and which are noise components from serious spectrum aliasing only by the characteristics of acoustic signals, and the problems that target voice is damaged due to excessive suppression or strong noise residues are caused due to insufficient suppression are easy to occur. In addition, although some technologies try to suppress noise by beam forming using a microphone array, mine underground roadways are narrow, wall surfaces are rough, sound wave reflection and reverberation are serious, spatial phase information required for sound source positioning is greatly destroyed, and array processing performance is limited. More importantly, all of the above-described acoustic sensor based methods are essentially processing the resulting signal that has been mixed with noise, without utilizing the source physical information generated by the noise. Mining mechanical noise is mainly caused by mechanical vibration of equipment, and a vibration signal has high causal correlation with acoustic noise, does not contain target voice components at all, and is an ideal noise reference source. However, an effective mechanism is not established in the prior art, so that non-acoustic physical modes such as vibration and acceleration of the surface of equipment and acoustic signals can be deeply fused, and the complementary potential of multi-mode information in a complex industrial scene voice enhancement task can not be fully exploited. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a mine scene voice enhancement method and equipment based on multi-modal noise feature learning, which solve the problems that the traditional single-channel voice enhancement technology cannot dist