CN-121983078-A - Multi-mode voice separation method and system based on audio-visual feature fusion

CN121983078ACN 121983078 ACN121983078 ACN 121983078ACN-121983078-A

Abstract

The invention discloses a multimode voice separation method and a system based on audio-visual feature fusion, wherein the method comprises a voice and video acquisition step; the method comprises a voice short-time Fourier transform step, a face video processing step, a voice separation step and a voice short-time Fourier inverse transform step, wherein the short-time Fourier transform step is used for converting a time domain signal of the mixed voice into a complex spectrum, the face video processing step is used for preprocessing the face video by using a face recognition model and extracting a video frame sequence of a single speaker lip area. The system comprises a voice preprocessing module, a video preprocessing module, an audio-visual voice separation module and a voice post-processing module. The multi-mode voice separation method and the system based on the audio-visual feature fusion have the advantages of reducing the parameter quantity and the calculated quantity in the voice processing process, simultaneously utilizing the time domain and the frequency domain features to increase the available information quantity in the voice separation process, improving the voice separation accuracy and the like.

Inventors

Fan Cunhang
JIANG YU
LV ZHAO

Assignees

安徽大学

Dates

Publication Date: 20260505
Application Date: 20260109

Claims (10)

1. A multimode voice separation method based on audio-visual feature fusion is characterized by comprising the following steps: the method comprises the steps of 1, collecting voice and video, and simultaneously collecting mixed voice of a plurality of speakers and facial videos of each speaker; a step 2 of voice conversion, in which a short-time Fourier transform is adopted to convert the time domain signal of the mixed voice into a complex spectrum; Preprocessing the face video by using a face recognition model, and extracting a video frame sequence of a lip region of a single speaker; Inputting the complex spectrum and the video frame sequence of the single speaker into a multi-modal voice separation model, and outputting the complex spectrum X tgt of the voice of the target speaker by the multi-modal voice separation model; And 5, carrying out inverse speech transformation, namely, adopting short-time Fourier transformation on a complex spectrum X tgt of the target speaker speech to obtain the target speaker speech separated from the mixed speech.
2. The method for multimodal speech separation based on audio-visual feature fusion according to claim 1, wherein the multimodal speech separation model in step 4 is a pre-trained multimodal speech separation model.
3. The method for multimodal speech separation based on fusion of audiovisual features according to claim 2, wherein the multimodal speech separation model comprises a speech encoder, a video encoder, a speech feature compressor, a visual feature compressor, an audiovisual feature generator, a speech feature filter and a speech decoder; The voice encoder is used for primarily extracting the mixed voice characteristic M src from the complex time spectrum of the mixed voice; The video encoder is used for acquiring a visual characteristic vector sequence V src from a video frame sequence of a single speaker lip region; The voice feature compressor is used for compressing the channel dimension of the mixed voice feature M src to obtain a compressed mixed voice feature M sqz ; the visual feature compressor is used for compressing the channel dimension of the visual feature vector sequence V src to obtain a compressed visual feature vector sequence V sqz ; The audio-visual feature generator is used for fusing the compressed mixed voice feature M sqz and the compressed visual feature vector sequence V sqz to obtain an enhanced fused audio-visual feature M ave ; The voice characteristic filter is used for generating a mixed voice characteristic mask according to the enhanced audio-visual voice characteristic M ave and filtering the mixed voice characteristic M src output by the voice encoder to obtain a complex spectrum characteristic M tgt of the voice of the target speaker; The voice decoder is configured to decode the complex spectrum feature M tgt of the input target speaker voice and restore the complex spectrum feature M tgt of the target speaker voice.
4. A multimodal speech separation method based on audio-visual feature fusion according to claim 3, wherein the audio-visual feature generator comprises an audio-visual feature alignment module, an audio-visual feature fusion module and a fusion feature enhancement module.
5. The method for multimodal speech separation based on audio-visual feature fusion according to claim 4, wherein the audio-visual feature fusion module comprises a multi-layer and double-branch feature fusion module.
6. The method of claim 5, wherein each of the multi-layer dual-branch feature fusion modules comprises a speech feature extraction module and a visual feature extraction module.
7. The multi-modal speech separation method based on audio-visual feature fusion of claim 6, wherein the visual feature extraction module comprises a global visual feature extraction module and a local visual feature extraction module.
8. The method for multimodal speech separation based on audio-visual feature fusion according to claim 1, wherein in step 4, the training process of the multimodal speech separation model comprises the following steps: step 41, selecting the existing audio-visual voice data set for training a multi-modal voice separation model, performing data division, preprocessing and mixed voice operation, and generating a training set, a verification set and a test set; Step 42, selecting a scale-invariant signal-to-noise ratio as a Loss function Loss of the multi-mode voice separation model training; Step 43, training the multimodal speech separation model based on audio-visual feature fusion by using a training set, optimizing parameters of the multimodal speech separation model, verifying a model training effect by using a verification set, and selecting an optimal model; and 44, saving the model parameters with the optimal verification effect, and comprehensively evaluating the separation performance of the optimal model by using the test set.
9. A multimode voice separation system based on audio-visual feature fusion is characterized by comprising a voice preprocessing module, a video preprocessing module, an audio-visual voice separation module and a voice post-processing module; the voice preprocessing module is used for acquiring a mixed voice signal containing overlapping voices of a plurality of speakers and converting the time domain signal into a complex frequency spectrum by using short-time Fourier transform; The video preprocessing module is used for acquiring face videos of each speaker in synchronization with voice, preprocessing the face videos by using a pre-trained face recognition model, and extracting a video frame sequence of a lip region of the speaker; The audio-visual voice separation module is used for executing voice separation, and processing a complex frequency spectrum of the mixed voice and a synchronous speaker lip video frame sequence by using a trained multi-mode voice separation model to generate a pure voice frequency spectrum of a single speaker; the voice post-processing module is used for reconstructing a pure time domain voice signal of a single speaker and converting a complex frequency spectrum of the pure voice into a time domain signal by using short-time Fourier inversion.
10. The system of claim 9, wherein the multimodal speech separation model of the audio-visual speech separation module comprises a speech encoder, a video encoder, a speech feature compressor, a visual feature compressor, an audio-visual feature generator, a speech feature filter, and a speech decoder.

Description

Multi-mode voice separation method and system based on audio-visual feature fusion Technical Field The invention relates to a voice processing technology, in particular to a multi-mode voice separation method and a system based on audio-visual feature fusion. Background Speech separation aims at mimicking the auditory perception system of humans, separating the pure speech of a single speaker from a mixed signal containing the speech of multiple speakers, to improve the quality and intelligibility of the speech signal. Humans can concentrate on the speech signal of a particular speaker in a noisy environment, which is visually known as the "cocktail party effect". However, some hearing impaired groups lack this directional perceptibility. In addition, some speech signal processing systems, such as automatic speech recognition systems, can suffer significantly from the quality of the input speech signal. Therefore, the voice separation technology is widely applied in a plurality of fields, and the requirements on the performance and the efficiency of the voice separation technology are increasingly improved. The existing voice separation system introduces a deep learning technology, and trains a neural network to learn a mapping mode from mixed voice to single speaker voice in a data driving mode, thereby greatly improving the separation accuracy and robustness of the system. Some systems rely solely on the inherent feature differences of mixed speech to perform speech separation. However, these systems have difficulty separating out speech signals with very low signal-to-noise ratios. Speech perception studies have shown that the human speech perception process is in fact multi-modal, speech-synchronized visual information, such as lip movement information of the speaker, can significantly affect the accuracy of the listener's perception of speech content. Inspired by the related studies, some have introduced visual cues into speech separation systems to aid in enhancing separation performance, known as audio-visual speech separation. ‌ Audio-visual speech separation (AVSS) is a technique ‌ for separating target speaker speech from mixed sounds by combining visual information in the video, such as speaker facial movements, with audio signals. The audio-visual voice separation technology can divide an audio-visual voice separation method into a time domain method and a time-frequency domain method according to the difference of the domains of the processed voice signals. The time domain method directly performs end-to-end voice mapping on the time domain, adopts a convolution encoder to obtain high-dimensional voice characteristics, and can capture the instantaneous change of a voice signal with finer granularity, thereby causing model scale expansion and obviously increasing calculation cost. The time-frequency domain method takes the frequency spectrum as a processing object, can simultaneously utilize the time and cross-band characteristics, and can capture more abundant voice information compared with the time domain method. However, the existing time-frequency domain method generally uses large LSTM and Transformer to construct a model, resulting in high parameter number and computational complexity, which limits its application in resource-constrained scenarios. In addition, as the spectrum phase is difficult to predict, part of the method only predicts the amplitude spectrum of the pure voice, and the phase is unchanged, so that the upper performance limit of the time-frequency domain model is limited. Disclosure of Invention The invention provides a multimode voice separation method and a system thereof based on audio-visual characteristic fusion, which are used for avoiding the defects existing in the prior art, so that the parameter quantity and the calculated quantity in the voice processing process can be reduced, and the available information quantity in the voice separation process can be increased by utilizing the time domain and frequency domain characteristics. The invention adopts the following technical scheme for solving the technical problems. The invention discloses a multimode voice separation method based on audio-visual feature fusion, which comprises the following steps: the method comprises the steps of 1, collecting voice and video, and simultaneously collecting mixed voice of a plurality of speakers and facial videos of each speaker; a step 2 of voice conversion, in which a short-time Fourier transform is adopted to convert the time domain signal of the mixed voice into a complex spectrum; Preprocessing the face video by using a face recognition model, and extracting a video frame sequence of a lip region of a single speaker; Inputting the complex spectrum and the video frame sequence of the single speaker into a multi-modal voice separation model, and outputting the complex spectrum X tgt of the voice of the target speaker by the multi-modal voice separation model; And 5, carryi