CN-121983073-A - Remote voice enhancement method, system, medium and terminal based on knowledge distillation

CN121983073ACN 121983073 ACN121983073 ACN 121983073ACN-121983073-A

Abstract

The application provides a remote voice enhancement method, a system, a medium and a terminal based on knowledge distillation, wherein the method comprises the steps of obtaining and preprocessing a target voice signal and an ultrasonic echo signal to obtain a voice characteristic with noise and an ultrasonic lip movement characteristic; the method comprises the steps of inputting noisy voice characteristics and ultrasonic lip movement characteristics into a student model and a pre-trained teacher model, respectively outputting enhanced voice signals and fusion characteristics by the student model and the teacher model, constructing a total loss function according to the student enhanced voice signals, the student fusion characteristics, the teacher enhanced voice signals and the teacher fusion characteristics, performing back propagation on the basis of the total loss function to optimize the student model, and repeatedly and iteratively training the student model until convergence to obtain a far-field voice enhancement model for generating enhanced voice signals according to currently input audio to be enhanced. The application can remarkably improve the practicability and the robustness of the voice enhancement technology based on ultrasonic lip movement perception in a practical complex environment.

Inventors

WANG DONG
YANG YIFEI
ZHANG QIAN

Assignees

上海交通大学

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. A method of remote speech enhancement based on knowledge distillation, comprising: acquiring a target voice signal under a remote scene acquired by a microphone and an ultrasonic echo signal reflected by a lip of a user by ultrasonic waves emitted by a loudspeaker, and respectively preprocessing the target voice signal and the ultrasonic echo signal to obtain a noisy voice characteristic and an ultrasonic lip movement characteristic; Inputting the noise voice characteristics and the ultrasonic lip movement characteristics obtained by pretreatment into a student model and a pre-trained teacher model, wherein the student model outputs student enhanced voice signals and student fusion characteristics; constructing and obtaining a total loss function according to the student enhanced voice signal output by the student model, the student fusion characteristic, the teacher enhanced voice signal output by the teacher model and the teacher fusion characteristic; performing back propagation optimization on the student model based on the constructed total loss function to update parameters of the student model; And deploying the far-field voice enhancement model for generating an enhanced voice signal according to the currently input audio to be enhanced.
2. The knowledge-based distillation remote speech enhancement method according to claim 1, wherein the network architecture of the student model comprises: the voice encoder is used for extracting the characteristics of the input voice characteristics with noise so as to extract the voice characteristics; The ultrasonic encoder is used for extracting the characteristics of the input ultrasonic lip movement characteristics so as to extract the real ultrasonic characteristics; The multi-modal attention fusion module is used for reconstructing and obtaining pseudo-ultrasonic features according to the input voice features based on a predefined learning memory bank and generating cross-modal fusion features according to the reconstructed pseudo-ultrasonic features based on a gating cross attention mechanism; A decoder for generating a time-frequency mask according to the entered cross-modality fusion feature; and the frequency spectrum enhancement module is used for multiplying the time-frequency mask generated by the decoder with the corresponding voice characteristic with noise so as to obtain an enhanced voice signal.
3. The knowledge distillation based remote speech enhancement method according to claim 2, wherein the gating cross-attention mechanism generates cross-modality fusion features from the reconstructed pseudo-ultrasound features, and the specific process comprises: Fusing the reconstructed pseudo-ultrasonic characteristics and the input real ultrasonic characteristics, and taking the fused pseudo-ultrasonic characteristics and real ultrasonic characteristics as keys and values; performing multi-head cross attention calculation by taking the voice feature as a query, and splicing and linearly projecting the outputs of all the attention so as to obtain the attention feature; generating a dynamic gating vector according to the reconstructed pseudo-ultrasonic feature, and performing cross-modal fusion on the voice feature and the attention feature according to the dynamic gating vector based on gating residual connection to obtain a cross-modal fusion feature.
4. The knowledge distillation-based remote voice enhancement method according to claim 1, wherein the teacher model is constructed by training a target voice signal and an ultrasonic echo signal acquired in a short-distance scene, and the network architecture of the teacher model comprises an encoder, a multi-modal attention fusion module, a decoder and a spectrum enhancement module.
5. The knowledge distillation based remote speech enhancement method according to claim 1, wherein the far-field speech enhancement model is fine tuned based on data collected in mixed near, medium, and remote scenes and a distance domain is predicted based on a lightweight distance domain arbiter.
6. The knowledge-based distillation remote speech enhancement method according to claim 1, wherein preprocessing the target speech signal to obtain noisy speech features comprises: Filtering high-frequency ultrasonic components and other high-frequency noise from the target voice signal based on a low-pass filter to obtain a main voice frequency band; linearly superposing the acquired noise signal and the resampled subject audio segment according to a preset signal-to-noise ratio to obtain a synthesized voice signal with noise; and carrying out short-time Fourier transform on the synthesized voice signal with noise so as to convert the voice signal with noise into a time-frequency domain, and extracting a magnitude spectrum to obtain voice characteristics with noise.
7. The knowledge distillation based remote speech enhancement method according to claim 1, wherein the process of preprocessing the ultrasonic echo signal to obtain ultrasonic lip motion characteristics comprises: Separating the mixed signal containing the ultrasonic echo based on the high-frequency band-pass filter to obtain an ultrasonic echo signal in a frequency-modulated continuous wave frequency band; multiplying the ultrasonic echo signal in the frequency band with a local reference emission signal copy, and performing low-pass filtering to obtain a baseband complex signal containing a complex sequence; Performing discrete Fourier transform on the complex sequence in each frequency modulation period along a fast time dimension to obtain a distance dimension spectrum corresponding to each frequency modulation period, wherein the distance dimension spectrum is provided with a plurality of distance units; extracting a time sequence of complex spectrum values changing along with slow time from each selected distance unit, and performing first-order difference or high-pass filtering processing on the time sequence of each distance unit; And decomposing the processed complex sequence into two channels of amplitude and phase, and combining to obtain the ultrasonic lip movement characteristic.
8. A knowledge distillation based remote speech enhancement system, comprising: the acquisition module is used for acquiring a target voice signal under a remote scene acquired by a microphone and an ultrasonic echo signal reflected by ultrasonic waves emitted by a loudspeaker through a user lip, and respectively preprocessing the target voice signal and the ultrasonic echo signal to respectively obtain a noisy voice characteristic and an ultrasonic lip movement characteristic; The student model and teacher model output module is used for inputting the noise voice characteristics and the ultrasonic lip movement characteristics obtained through preprocessing into the student model and the pre-trained teacher model; the student model outputs student enhanced voice signals and student fusion characteristics; the teacher model outputs a teacher enhanced voice signal and teacher fusion characteristics; the total loss function construction module is used for constructing and obtaining a total loss function according to the student enhanced voice signal and the student fusion characteristic output by the student model, the teacher enhanced voice signal and the teacher fusion characteristic output by the teacher model; the model optimization module is used for carrying out back propagation optimization on the student model based on the constructed total loss function so as to update parameters of the student model; the model deployment module is used for deploying the far-field voice enhancement model and generating an enhanced voice signal according to the currently input audio to be enhanced.
9. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any one of claims 1 to 7.
10. An electronic terminal comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any one of claims 1 to 7.

Description

Remote voice enhancement method, system, medium and terminal based on knowledge distillation Technical Field The application relates to the technical field of voice interaction, in particular to a remote voice enhancement method, a system, a medium and a terminal based on knowledge distillation. Background The voice interaction technology becomes a key component in the field of man-machine interaction, and the application scene of the voice interaction technology widely covers various fields such as remote office, online education, intelligent vehicle-mounted systems and daily communication. In these applications, the quality and clarity of the user's speech input is a fundamental premise for ensuring the smoothness of interaction and the reliability of system control. However, speech signals in real environments are extremely vulnerable to a variety of acoustic interference, including various types of background environmental noise (e.g., keyboard knocks, vehicle driving sounds) and voice interference by non-targeted speakers. The noise pollution can cause a series of problems such as reduced automatic voice recognition rate, deteriorated communication quality, false identity authentication and the like, and severely restricts the efficiency and user experience of voice application. To cope with the above challenges, a key objective of the speech enhancement technology is to recover as pure as possible a target speech component from a contaminated mixed speech signal, thereby effectively suppressing environmental noise and interfering human voice, and improving the robustness of a subsequent speech processing module. In recent years, multi-modal speech enhancement techniques based on acoustic perception have begun to be of interest, the core idea of which is to use high frequency ultrasonic signals (typically above 17 kHz) to perceive lip movement. Specifically, ultrasonic signals which cannot be heard by the human ear are emitted through a loudspeaker, echoes reflected by sound organs such as lips, lower jaw and the like are received by a microphone, and then, through processing the echo signals, characteristics which are highly sensitive to the tiny displacement of the lips, such as Doppler shift, phase change and the like, can be extracted, and the characteristics can also effectively reflect the lip movement mode and are basically not influenced by environmental noise in an audible frequency band. The acoustic perception scheme does not need additional imaging equipment, ultrasonic waves work in an inaudible frequency band, can still work under the condition of visual failure such as dim light, side face and the like, and has natural advantages in the aspect of privacy protection. Although multi-modal speech enhancement based on ultrasound perception shows good prospects, the prior art solutions still have significant limitations that limit their application in real complex scenes. The design and experiment of most of the current researches is based on the assumption of 'close distance', and the distance between a loudspeaker/microphone and the lips of a user is generally required to be within 20 cm to 40 cm, and in the range, the signal-to-noise ratio of ultrasonic echo is high, and the lip movement characteristics are obvious. However, in many typical practical application scenarios, such as video conferencing of a user with a notebook computer, interaction in front of a smart television, or use of a voice assistant in a car environment, the interaction distance between the user and the device often reaches 0.8 meters or even more than 1.2 meters. Along with the increase of the distance, the ultrasonic signal can generate serious energy attenuation in the propagation process and is more easily influenced by multipath reflection and environmental interference, so that the quality of the extracted lip movement characteristic is greatly reduced, the signal-to-noise ratio and the discriminant are obviously reduced, and the enhancement performance of the prior art is directly seriously degraded under the long-distance condition. Accordingly, there is a need for a method, system, medium, and terminal for remote speech enhancement based on knowledge distillation to solve the above-mentioned problems in the prior art. Disclosure of Invention In view of the above-mentioned drawbacks of the prior art, an object of the present application is to provide a method, a system, a medium and a terminal for remote speech enhancement based on knowledge distillation, which are used for solving the technical problem that the prior art is limited to speech enhancement in a remote scene. To achieve the above and other related objects, a first aspect of the present application provides a method for remote speech enhancement based on knowledge distillation, comprising: acquiring a target voice signal under a remote scene acquired by a microphone and an ultrasonic echo signal reflected by a lip of a user by ultrasonic waves em