CN-115910038-B - Voice signal extraction method and device, readable storage medium and electronic equipment

CN115910038BCN 115910038 BCN115910038 BCN 115910038BCN-115910038-B

Abstract

The embodiment of the disclosure discloses a voice signal extraction method, a device, a computer-readable storage medium and electronic equipment, wherein the method comprises the steps of acquiring multichannel mixed audio signals and image sequences acquired in a target area, determining target users in the target area, determining lip area image sequences of the target users based on the image sequences, determining lip state characteristic data based on the lip area image sequences, determining spatial position characteristic data of lips and a microphone array of the target users, determining audio characteristic data based on the multichannel mixed audio signals, and extracting voice signals of the target users from the multichannel mixed audio signals based on the lip state characteristic data, the audio characteristic data and the spatial position characteristic data. The embodiment of the disclosure realizes that the stability and the accuracy of voice signal extraction are improved by combining the multichannel mixed audio signal and the spatial position characteristic data to perform multi-mode voice separation.

Inventors

GONG YICHEN
LI WENPENG

Assignees

北京地平线机器人技术研发有限公司

Dates

Publication Date: 20260512
Application Date: 20220927

Claims (10)

1. A method of extracting a speech signal, comprising: Acquiring a multichannel mixed audio signal and an image sequence acquired in a target area; determining target users in the target area; determining a lip region image sequence of the target user based on the image sequence; determining lip state feature data based on the lip region image sequence; Determining audio feature data based on the multi-channel mixed audio signal; determining spatial location feature data of lips and a microphone array of the target user based on the lip region image sequence; Fusing the lip state feature data, the audio feature data and the spatial position feature data to obtain fused feature data, and extracting the voice signal of the target user from the multichannel mixed audio signal based on the fused feature data; Wherein the determining spatial location feature data of the lips of the target user and the microphone array based on the lip region image sequence includes: Determining lip position information representing a spatial position of a lip of the target user based on the lip region image sequence and preset parameters of a camera for acquiring the image sequence; Determining an angle between a target straight line where the lip position of the target user is located and a reference line of the microphone array based on the lip position information and preset position information of the microphone array; Determining angular feature data between the lip position of the target user and the microphone array based on the angle; and determining the spatial position characteristic data based on the angle characteristic data.
2. The method of claim 1, further comprising: Determining phase difference characteristic data representing the multi-channel mixed audio signals; The determining angular feature data between the lip position of the target user and the microphone array based on the angle includes: the angle characteristic data is determined based on the angle and the phase difference characteristic data.
3. The method of claim 2, wherein the determining the spatial location feature data based on the angular feature data comprises: The spatial position characteristic data is determined based on the angle characteristic data and the phase difference characteristic data.
4. The method of claim 1, wherein the fusing the lip state feature data, the audio feature data, and the spatial location feature data to obtain fused feature data, extracting the speech signal of the target user from the multi-channel mixed audio signal based on the fused feature data, comprises: fusing the lip state characteristic data, the audio characteristic data and the spatial position characteristic data by utilizing a fusion network of a pre-trained neural network model to obtain fusion characteristic data; Decoding the fusion characteristic data by utilizing a decoding network of the neural network model to obtain mask data; And extracting the voice signal of the target user from the multi-channel mixed audio signal based on the mask data.
5. The method of claim 4, wherein the extracting the target user's speech signal from the multi-channel mixed audio signal based on the mask data comprises: Compressing the mask data by using a preset activation function to obtain compressed data; And extracting the voice signal of the target user from the multi-channel mixed audio signal based on the compressed data.
6. The method of claim 4, wherein the fusing the lip state feature data, the audio feature data, and the spatial location feature data using a fusion network of pre-trained neural network models to obtain fused feature data comprises: Performing first fusion processing on the audio feature data and the spatial position feature data by using a first fusion sub-network included in the fusion network to obtain fused audio feature data; And performing second fusion processing on the fusion audio feature data and the lip state feature data by using a second fusion sub-network included in the fusion network to obtain the fusion feature data.
7. The method of claim 1, wherein the determining audio feature data based on the multi-channel mixed audio signal comprises: Performing frequency domain conversion on the multichannel mixed audio signal to obtain frequency domain data; Compressing the frequency domain data to obtain compressed frequency domain data; And encoding the compressed frequency domain data by utilizing an audio encoding network of a pre-trained neural network model to obtain the audio characteristic data.
8. An extraction device of a voice signal, comprising: The acquisition module is used for acquiring the multichannel mixed audio signal and the image sequence acquired in the target area; the first determining module is used for determining target users in the target area; A second determining module, configured to determine a lip region image sequence of the target user based on the image sequence; a third determining module, configured to determine lip state feature data based on the lip region image sequence; a fourth determining module for determining audio feature data based on the multi-channel mixed audio signal; a fifth determining module, configured to determine spatial location feature data of a lip of the target user and a microphone array based on the lip region image sequence; the extraction module is used for fusing the lip state characteristic data, the audio characteristic data and the space position characteristic data to obtain fused characteristic data, and extracting the voice signal of the target user from the multichannel mixed audio signal based on the fused characteristic data; the fifth determination module includes: a first determining unit configured to determine lip position information representing a spatial position of a lip of the target user based on the lip region image sequence and preset parameters of a camera for acquiring the image sequence; a second determining unit, configured to determine an angle between a target straight line where the lip position of the target user is located and a reference line of the microphone array based on the lip position information and preset position information of the microphone array; A third determining unit configured to determine angle feature data between the lip position of the target user and the microphone array based on the angle; and a fourth determining unit configured to determine the spatial location feature data based on the angle feature data.
9. A computer readable storage medium storing a computer program for execution by a processor to implement the method of any one of claims 1-7.
10. An electronic device, the electronic device comprising: A processor; A memory for storing executable instructions of the processor; The processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any of the preceding claims 1-7.

Description

Voice signal extraction method and device, readable storage medium and electronic equipment Technical Field The disclosure relates to the technical field of computers, and in particular relates to a method and a device for extracting a voice signal, a computer readable storage medium and electronic equipment. Background Along with the continuous development of man-machine interaction modes, the high efficiency, accuracy and convenience of man-machine interaction are targets of research in related fields. Multimodal speech separation is currently being widely studied and applied as a way of man-machine interaction. The multi-modal speech separation is to combine audio and image, and use neural network to perform multi-modal fusion on audio and visual signals to solve the problem of sound source separation. According to the method, the model is trained, so that the model learns the characteristics of the audio and the image at the same time, and the voice information of different speakers in the audio can be better learned by taking the image as an aid. The existing multi-mode voice separation method has higher requirements on the quality of lip images of a speaker, and has larger influence on the voice separation effect when lip shielding or unclear lip images occur. Disclosure of Invention The present disclosure has been made in order to solve the above technical problems. Embodiments of the present disclosure provide a method, an apparatus, a computer-readable storage medium, and an electronic device for extracting a voice signal. The embodiment of the disclosure provides a voice signal extraction method, which comprises the steps of acquiring a multi-channel mixed audio signal and an image sequence acquired in a target area, determining a target user in the target area, determining a lip area image sequence of the target user based on the image sequence, determining lip state characteristic data based on the lip area image sequence, determining audio characteristic data based on the multi-channel mixed audio signal, determining spatial position characteristic data of lips and a microphone array of the target user based on the lip area image sequence, and extracting the voice signal of the target user from the multi-channel mixed audio signal based on the lip state characteristic data, the audio characteristic data and the spatial position characteristic data. According to another aspect of the embodiment of the present disclosure, there is provided a voice signal extraction apparatus, which includes an acquisition module for acquiring a multi-channel mixed audio signal and an image sequence acquired in a target area, a first determination module for determining a target user in the target area, a second determination module for determining a lip area image sequence of the target user based on the image sequence, a third determination module for determining lip state feature data based on the lip area image sequence, a fourth determination module for determining audio feature data based on the multi-channel mixed audio signal, a fifth determination module for determining spatial position feature data of a lip and a microphone array of the target user based on the lip area image sequence, and an extraction module for extracting a voice signal of the target user from the multi-channel mixed audio signal based on the lip state feature data, the audio feature data, and the spatial position feature data. According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for execution by a processor to implement a method of performing the above-described extraction of a speech signal. According to another aspect of an embodiment of the present disclosure, there is provided an electronic device including a processor, a memory for storing executable instructions of the processor, and a processor for reading the executable instructions from the memory and executing the instructions to implement the method for extracting a voice signal. According to the voice signal extraction method, device, computer readable storage medium and electronic equipment provided by the embodiment of the disclosure, the multi-channel mixed audio signal and image sequence acquired in the target area are acquired, lip state characteristic data are determined based on the lip area image sequence, lip state characteristic data and spatial position characteristic data of the lips and the microphone array of the target user are determined based on the lip area image sequence, audio characteristic data are determined based on the multi-channel mixed audio signal, and finally the voice signal of the target user is extracted from the multi-channel mixed audio signal based on the lip state characteristic data, the audio characteristic data and the spatial position characteristic data. The embodiment of the disclosure realizes multi-mode voice separation by combi