CN-119559940-B - End-to-end voice recognition method for empty pipe instruction under high noise condition

CN119559940BCN 119559940 BCN119559940 BCN 119559940BCN-119559940-B

Abstract

The invention relates to the technical field of air traffic control and voice recognition, in particular to an air traffic control instruction end-to-end voice recognition method under the high noise condition, which comprises the steps of preprocessing voice to be recognized and extracting to obtain original voice frequency characteristics; the self-adaptive attention noise reduction module is adopted to carry out noise reduction intensity control, frequency band gain control and pitch tracking processing on the original sound frequency characteristics, so as to obtain noise reduction voice; preprocessing the noise-reduced voice, and extracting to obtain noise-reduced voice frequency characteristics; the invention can improve the voice recognition precision of the empty pipe instruction under the high noise condition.

Inventors

YANG YANG
ZHU YANBO
JIANG YUTONG
HUI YI
CAI KAIQUAN

Assignees

北京航空航天大学

Dates

Publication Date: 20260508
Application Date: 20241126

Claims (7)

1. The end-to-end voice recognition method for the empty pipe instruction under the high noise condition is characterized by adopting a trained end-to-end voice recognition model to perform voice recognition and comprises the following steps of: s1, preprocessing voice to be recognized, and extracting to obtain original voice frequency characteristics; s2, adopting a self-adaptive attention noise reduction module to perform noise reduction intensity control, frequency band gain control and pitch tracking processing on the original voice frequency characteristics to obtain noise reduction voice; s3, preprocessing the noise-reduced voice, and extracting to obtain noise-reduced voice frequency characteristics; s4, adopting a voice recognition module to encode the noise-reduced voice frequency characteristics by a shared encoder, decode by a connection time sequence classification decoder and decode by an attention-based encoder-decoder to obtain an output voice recognition transcription text; The step S2 specifically includes: S2-1, adopting a self-adaptive controller to process the primitive voice frequency characteristics by a convolutional neural network and a gating circulation unit to obtain noise reduction force; s2-2, processing the original voice frequency characteristics by adopting an RNN-GRU network, and obtaining the band gains of all the frequency bands of the audio frequency domain signals through smooth interpolation; S2-3, detecting a pitch period in the voice signal, and determining a gain signal through a comb filter based on the pitch period; S2-4, the gain signal is subjected to the noise reduction force and the control of the frequency band gain to obtain a noise-reduced frequency domain signal, and the noise-reduced voice is obtained through IFFT (inverse fast Fourier transform); In step S2-1, the expression of the convolutional neural network and the gating cycle unit processing procedure is as follows: Wherein, the Is FBank feature vectors of the input original speech, A convolution operation is represented and is performed, A batch normalization is shown and is performed, The maximum pooling is indicated and the maximum pool is indicated, Is a gate-controlled circulation unit, Is a full-connection layer, and is formed by a plurality of layers, Is an intermediate feature vector of the network of adaptive controllers, Representing noise reduction intensity parameters; in step S2-2, the expression of the processing procedure of the RNN-GRU network is as follows: Wherein, the Representing the speech signal at the first The energy of the frequency band is used to determine, Representing noisy speech signal at the first The energy of the frequency band is used to determine, Representing dynamically adjusted noise reduction intensity; Step S2-3 comprises obtaining band-dependent filter coefficients based on a comb filter of pitch period Wherein the filter coefficients The expression is: Wherein, the The representation takes the minimum value of the value, For signals in the frequency band Is used to determine the complex number of the (c), Is a frequency band Is used for the complex conjugate of the pitch signal, Is the frequency band Is used to determine the power of the spectral values of (c), Is the frequency band Is used for the power of the pitch signal, Is an index of the frequency band and, Is the total frequency band number; training the end-to-end voice recognition model, and using the trained end-to-end voice recognition model for air management instruction voice recognition, wherein the training steps specifically comprise: S5, constructing an empty pipe environmental noise-clean voice simulation data set and an empty pipe instruction voice-text data set; And S6, pre-training the self-adaptive attention noise reduction module by adopting the empty pipe environmental noise-clean voice simulation data set, independently training the voice recognition module by adopting an empty pipe instruction voice-text data set, and jointly training the end-to-end voice recognition model by adopting an empty pipe instruction voice-text data set.
2. The method for end-to-end voice recognition of empty pipe command under high noise condition according to claim 1, wherein the step S1 specifically comprises: s1-1, respectively carrying out pre-emphasis, framing and windowing, fast Fourier transform and frequency band division processing on voice to be recognized to obtain filter bank characteristics; S1-2, acquiring a pitch period of a voice to be recognized and a spectrum non-stationarity measure; And step S1-3, taking the first and second derivatives of the filter bank characteristic, the pitch period and the spectrum non-stationarity measure as the primitive voice frequency characteristic.
3. The method for end-to-end voice recognition of empty pipe instruction under high noise condition according to claim 2, wherein in step S1: the expression of the pre-emphasis processing mode is as follows: Wherein, the Representing speech signal No The values of the individual sample points are used, Representing speech signal No The sampled value of the points, 0.97 being the coefficient of the high-pass filter, determines the proportion of high-frequency compensation, Representing pre-emphasis processed speech signal Values of the sampling points; the window function of the framing and windowing has the expression: Wherein, the Representing the first in a frame The window function value of each sampling point, Is the total sampling point number in each frame; the computational expression of the filter bank features is: Wherein, the Is the first The filter bank characteristics of the individual frequency bands, Is the first voice signal The energy of the individual frequency bands is used, Is a constant of a small value, which is, Is a frequency band At the frequency of The amplitude of the light at the location(s), Representing frequency The amplitude of the corresponding spectral value.
4. The method for end-to-end voice recognition of empty pipe command under high noise condition according to claim 3, wherein said step S4 specifically comprises: s4-1, inputting the noise reduction voice frequency characteristics into the shared encoder to obtain a high-dimensional representation; s4-2, inputting the high-dimensional representation to a time sequence classification decoder to obtain an activation value; And S4-3, re-scoring the activation value by the attention-based encoder-decoder to generate a final voice recognition transcription text.
5. The method for end-to-end speech recognition of empty pipe command under high noise condition according to claim 4, wherein in step S4-1, the shared encoder is composed of a plurality of Conformer layers, and the Conformer layers have the structural expression: Wherein, the Is the FBank feature vector of the input noise-reduced speech, For a fully connected feed forward network, In order to be a multi-headed self-attention mechanism, In order for the convolution operation to be performed, For the layer normalization operation, Is the intermediate feature vector of the Conformer layers, A high-dimensional representation representing the output; in step S4-2, the expression of the connection time sequence classification decoder calculation process is: Wherein, the For a given input feature sequence Generating an output sequence Is a function of the probability of (1), In order to align the paths of the optical device, Representation generation tag sequences Is a set of all possible paths of the (c), Representing an alignment path Probability of (2); in step S4-3, the expression of the attention-based encoder-decoder calculation process is: Wherein, the Is the hidden state of the encoder and, Is the output of the encoder, and It is the input of the decoder and, Weighting the number of output vectors of the encoder Representing the current time step of the decoder and the encoder Correlations between the hidden states, thereby helping the model to generate a more accurate output sequence, Is the first And context vectors.
6. The method for end-to-end voice recognition of empty pipe command under high noise condition according to claim 1, wherein the step S5 specifically comprises: S5-1, respectively forming a noise data set and a clean voice data set in the empty pipe environmental noise-clean voice simulation data set by adopting an open source noise data set and an open source clean voice Chinese data set; s5-2, using air traffic control system operation production data as original data, and processing the original data to form an air traffic control instruction voice-text data set; The step S5-2 specifically comprises the following steps: s5-2-1, dividing the original data into voice fragments, marking the range of the voice-containing fragments according to the audio frequency spectrum change, and identifying the voice-containing fragments by using an existing automatic voice identification model to obtain transcribed text of each voice-containing fragment; S5-2-2, manually labeling the voice-containing fragment and the corresponding transcribed text; And S5-2-3, performing text screening on the manual labeling result, removing a part with unclear labeling, dividing the part to form voice of a single sentence instruction and corresponding text, and forming the voice-text data set of the air management instruction.
7. The method for end-to-end voice recognition of empty pipe command under high noise condition as defined in claim 6, wherein said step S6 specifically comprises: Determining a loss function of the adaptive attention noise reduction module: Wherein, the For the loss function of the adaptive attention noise reduction module, Is the first Gain estimation, index for individual bands Is a perception parameter and controls the degree of noise suppression; Determining a loss function of the speech recognition module: Wherein, the As a loss function of the speech recognition module, Respectively input voice characteristics and target text sequences, The losses of the CTC encoder and AED encoder respectively, Is a first balance weight coefficient; determining a loss function of the end-to-end model: Wherein, the As a loss function of the end-to-end model, In order to identify the word error rate of the result, Is a second plane Heng Quanchong coefficient; And aiming at minimizing the loss functions of the self-adaptive attention noise reduction module, the voice recognition module and the end-to-end model, the pre-training of the self-adaptive attention noise reduction module, the independent training of the voice recognition module and the joint training of the end-to-end voice recognition model are completed.

Description

End-to-end voice recognition method for empty pipe instruction under high noise condition Technical Field The invention relates to the technical field of air traffic control and voice recognition, in particular to an air traffic control instruction end-to-end voice recognition method under a high noise condition. Background An empty pipe command is a specific operation command issued by the controller to the pilot. Through the air traffic control instruction, the controller provides situation information about the scene and the airspace of the aircraft to the pilot in real time, instructs the pilot to complete specific operations, and the pilot completes corresponding operations such as taking off, landing, changing the altitude, speed or direction of flight and the like according to the instruction, so that the air traffic order is maintained, collision among the aircraft is prevented, and the aircraft is ensured to fly according to the specified route and speed. Air traffic management primarily communicates instructions and information through voice communications between pilots and air traffic controllers, but the process relies on manual identification, and there is a "forget-to-leak" phenomenon, resulting in frequent unsafe events. Therefore, it is necessary to introduce a voice recognition system to recognize commands and reply voices of controllers and pilots in real time, thereby reducing understanding ambiguity, forgetfulness, and the like. Because the communication environment of the pilot is a cabin, the problem of serious noise interference exists, and the empty pipe voice is transmitted through the ground-air very high frequency communication, so that the conditions of high frequency signal loss and amplitude interception in the audio are caused. The complex communication environment characteristics exacerbate the recognition difficulty, and particularly when the problems of multi-source noise superposition, unstable frequency bands and the like are faced, the accurate voice recognition result is difficult to obtain under the condition in the prior art. The current air traffic control instruction voice recognition method is commonly used for a mixed voice recognition model mainly trained by each module independently, and has the defects of insufficient model collaborative optimization, limited information sharing among modules and easy occurrence of local optimal solution in the training process, so that the overall optimal effect is difficult to achieve. In summary, the prior art lacks noise reduction optimization design for complex multi-source noise and communication environment characteristics, and especially lacks targeted processing for factors such as cabin noise, very high frequency communication distortion and the like. Meanwhile, the parts of the hybrid recognition model cannot realize joint optimization, so that the accuracy and the robustness of the voice recognition in a complex environment are insufficient, and the actual requirements are difficult to meet. Disclosure of Invention In view of the above problems, the invention provides a blank pipe instruction end-to-end voice recognition method under a high noise condition, which solves the problem of inaccurate recognition caused by the difficulty in joint optimization of models under a multi-source high noise condition in the prior art. The invention provides a method for identifying end-to-end voice of an empty pipe instruction under a high noise condition, which adopts a trained end-to-end voice identification model to carry out voice identification and comprises the following steps: s1, preprocessing voice to be recognized, and extracting to obtain original voice frequency characteristics; s2, adopting a self-adaptive attention noise reduction module to perform noise reduction intensity control, frequency band gain control and pitch tracking processing on the original voice frequency characteristics to obtain noise reduction voice; s3, preprocessing the noise-reduced voice, and extracting to obtain noise-reduced voice frequency characteristics; And S4, adopting a voice recognition module to encode the noise-reduced voice frequency characteristics by a shared encoder, decode by a connection time sequence classification decoder and decode by an attention-based encoder-decoder, so as to obtain an output voice recognition transcription text. Preferably, the step S1 specifically includes: s1-1, respectively carrying out pre-emphasis, framing and windowing, fast Fourier transform and frequency band division processing on voice to be recognized to obtain filter bank characteristics; S1-2, acquiring a pitch period of a voice to be recognized and a spectrum non-stationarity measure; And step S1-3, taking the first and second derivatives of the filter bank characteristic, the pitch period and the frequency spectrum non-stationarity measure as the original voice frequency characteristic. Preferably, in the step S1: the expression of the