US-12620405-B2 - Signal processing method and electronic device

US12620405B2US 12620405 B2US12620405 B2US 12620405B2US-12620405-B2

Abstract

Example signal processing methods and example electronic devices are disclosed. One example method is applied to an electronic device, where the electronic device includes a microphone array and a camera. The example method includes performing sound source localization on a first audio signal obtained by using the microphone array, to obtain sound source direction information. A first video obtained by using the camera is processed to obtain user direction information. A target sound source direction is determined based on the sound source direction information and the user direction information. A user lip video is obtained in the target sound source direction by using the camera. A second audio signal is obtained by using the microphone array. A third audio signal is obtained based on the second audio signal and the user lip video by using a voice quality enhancement model.

Inventors

Guangzhao BAO
Liwen CHEN
Lei Huang

Assignees

HUAWEI TECHNOLOGIES CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20210917
Priority Date: 20200930

Claims (20)

1 . A signal processing method, applied to an electronic device, wherein the electronic device comprises a microphone array and a camera, and the method comprises: performing sound source localization on a first audio signal obtained by using the microphone array, wherein performing the sound source localization is used to obtain sound source direction information; processing a first video obtained by using the camera, wherein processing the first video is used to obtain user direction information, the user direction information comprising at least one direction related to a user with respect to the camera; determining a target sound source direction based on the sound source direction information and the user direction information; obtaining a user lip video in the target sound source direction by using the camera; obtaining a second audio signal by using the microphone array; and obtaining a third audio signal based on the second audio signal and the user lip video by using a voice quality enhancement model, wherein the voice quality enhancement model comprises a correspondence between a semantic meaning and a lip shape, wherein the user direction information comprises at least one of the following types of directions: a first type of direction, wherein the first type of direction comprises at least one direction in which lips in a moving state are located; a second type of direction, wherein the second type of direction comprises at least one direction in which a user looking at the electronic device is located; or a third type of direction, wherein the third type of direction comprises at least one direction in which a user is located, wherein the sound source direction information comprises at least one sound source direction, and wherein determining the target sound source direction based on the sound source direction information and the user direction information comprises: combining the at least one sound source direction and the at least one type of direction to obtain at least one combined direction; and determining the target sound source direction from the at least one combined direction based on at least one parameter, wherein the at least one parameter comprises total frequency at which each of the at least one combined direction is detected in the sound source direction and the at least one type of direction.
2 . The method according to claim 1 , wherein the electronic device further comprises a directional microphone, and the method further comprises: obtaining a fourth audio signal in the target sound source direction by using the directional microphone, wherein the obtaining a third audio signal based on the second audio signal and the user lip video in the target sound source direction by using a voice quality enhancement model comprises: obtaining the third audio signal based on the second audio signal, the fourth audio signal, and the user lip video by using the voice quality enhancement model.
3 . The method according to claim 1 , wherein the determining the target sound source direction from the at least one combined direction comprises: determining the target sound source direction from the at least one combined direction based on at least one parameter, wherein the at least one parameter further comprises at least one of: a parameter indicating whether the electronic device has successfully performed speech interaction with a user within a preset time period and a preset angle range corresponding to each combined direction, wherein the preset time period is a time period between a current time and a historical time; or an included angle between each combined direction and a direction perpendicular to a display of the electronic device.
4 . The method according to claim 3 , wherein the determining the target sound source direction from the at least one combined direction based on at least one parameter comprises: determining a confidence of each combined direction based on the at least one parameter; and determining a direction corresponding to a maximum confidence value in the at least one combined direction as the target sound source direction.
5 . The method according to claim 1 , wherein the obtaining a second audio signal by using the microphone array comprises: obtaining the second audio signal in the target sound source direction by using the microphone array based on a beamforming technology.
6 . The method according to claim 1 , wherein the first audio signal is a wake-up signal.
7 . An electronic device, comprising a microphone array, a camera, at least one processor, and one or more memories coupled to the at least one processor and storing programming instructions for execution by the at least one processor to: perform sound source localization on a first audio signal obtained by using the microphone array, wherein performing the sound source localization is used to obtain sound source direction information; process a first video obtained by using the camera, wherein processing the first video is used to obtain user direction information, the user direction information comprising at least one direction related to a user with respect to the camera; determine a target sound source direction based on the sound source direction information and the user direction information; obtain a user lip video in the target sound source direction by using the camera; obtain a second audio signal by using the microphone array; and obtain a third audio signal based on the second audio signal and the user lip video by using a voice quality enhancement model, wherein the voice quality enhancement model comprises a correspondence between a semantic meaning and a lip shape, wherein the user direction information comprises at least one of the following types of directions: a first type of direction, wherein the first type of direction comprises at least one direction in which lips in a moving state are located; a second type of direction, wherein the second type of direction comprises at least one direction in which a user looking at the electronic device is located; or a third type of direction, wherein the third type of direction comprises at least one direction in which a user is located, wherein the sound source direction information comprises at least one sound source direction, and wherein the programming instructions are for execution by the at least one processor to: combine the at least one sound source direction and the at least one type of direction to obtain at least one combined direction; and determine the target sound source direction from the at least one combined direction based on at least one parameter, wherein the at least one parameter comprises total frequency at which each of the at least one combined direction is detected in the sound source direction and the at least one type of direction.
8 . The electronic device according to claim 7 , wherein the electronic device further comprises a directional microphone, and the programming instructions are for execution by the at least one processor to: obtain a fourth audio signal in the target sound source direction by using the directional microphone, wherein the programming instructions are for execution by the at least one processor to: obtain the third audio signal based on the second audio signal, the fourth audio signal, and the user lip video by using the voice quality enhancement model.
9 . The electronic device according to claim 8 , wherein the directional microphone is fastened to the camera.
10 . The electronic device according to claim 7 , wherein the programming instructions are for execution by the at least one processor to: determine the target sound source direction from the at least one combined direction based on at least one parameter, wherein the at least one parameter further comprises at least one of: a parameter indicating whether the electronic device has successfully performed speech interaction with a user within a preset time period and a preset angle range corresponding to each combined direction, wherein the preset time period is a time period between a current time and a historical time; or an included angle between each combined direction and a direction perpendicular to a display of the electronic device.
11 . The electronic device according to claim 10 , wherein the programming instructions are for execution by the at least one processor to: determine a confidence of each combined direction based on the at least one parameter; and determine a direction corresponding to a maximum confidence value in the at least one combined direction as the target sound source direction.
12 . The electronic device according to claim 7 , wherein the programming instructions are for execution by the at least one processor to: obtain the second audio signal in the target sound source direction by using the microphone array based on a beamforming technology.
13 . The electronic device according to claim 7 , wherein the first audio signal is a wake-up signal.
14 . The electronic device according to claim 7 , wherein the electronic device is a smart television.
15 . A non-transitory computer-readable storage medium applied to an electronic device, wherein the electronic device comprises a microphone array and a camera, and wherein the non-transitory computer-readable storage medium stores programming instructions for execution by at least one processor, that when executed by the at least one processor, cause a computer to perform operations comprising: performing sound source localization on a first audio signal obtained by using the microphone array, wherein performing the sound source localization is used to obtain sound source direction information; processing a first video obtained by using a camera, wherein processing the first video is used to obtain user direction information, the user direction information comprising at least one direction related to a user with respect to the camera; determining a target sound source direction based on the sound source direction information and the user direction information; obtaining a user lip video in the target sound source direction by using the camera; obtaining a second audio signal by using the microphone array; and obtaining a third audio signal based on the second audio signal and the user lip video by using a voice quality enhancement model, wherein the voice quality enhancement model comprises a correspondence between a semantic meaning and a lip shape, wherein the user direction information comprises at least one of the following types of directions: a first type of direction, wherein the first type of direction comprises at least one direction in which lips in a moving state are located; a second type of direction, wherein the second type of direction comprises at least one direction in which a user looking at the electronic device is located; or a third type of direction, wherein the third type of direction comprises at least one direction in which a user is located, wherein the sound source direction information comprises at least one sound source direction, and wherein determining the target sound source direction based on the sound source direction information and the user direction information comprises: combining the at least one sound source direction and the at least one type of direction to obtain at least one combined direction; and determining the target sound source direction from the at least one combined direction based on at least one parameter, wherein the at least one parameter comprises total frequency at which each of the at least one combined direction is detected in the sound source direction and the at least one type of direction.
16 . The non-transitory computer-readable storage medium according to claim 15 , wherein the electronic device further comprises a directional microphone, and the operations further comprise: obtaining a fourth audio signal in the target sound source direction by using the directional microphone, wherein the obtaining a third audio signal based on the second audio signal and the user lip video in the target sound source direction by using a voice quality enhancement model comprises: obtaining the third audio signal based on the second audio signal, the fourth audio signal, and the user lip video by using the voice quality enhancement model.
17 . The non-transitory computer-readable storage medium according to claim 15 , wherein the determining the target sound source direction from the at least one combined direction comprises: determining the target sound source direction from the at least one combined direction based on at least one parameter, wherein the at least one parameter further comprises at least one of: a parameter indicating whether the electronic device has successfully performed speech interaction with a user within a preset time period and a preset angle range corresponding to each combined direction, wherein the preset time period is a time period between a current time and a historical time; or an included angle between each combined direction and a direction perpendicular to a display of the electronic device.
18 . The non-transitory computer-readable storage medium according to claim 17 , wherein the determining the target sound source direction from the at least one combined direction based on at least one parameter comprises: determining a confidence of each combined direction based on the at least one parameter; and determining a direction corresponding to a maximum confidence value in the at least one combined direction as the target sound source direction.
19 . The non-transitory computer-readable storage medium according to claim 15 , wherein the obtaining a second audio signal by using the microphone array comprises: obtaining the second audio signal in the target sound source direction by using the microphone array based on a beamforming technology.
20 . The non-transitory computer-readable storage medium according to claim 15 , wherein the first audio signal is a wake-up signal.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a national stage of International Application No. PCT/CN2021/118948, filed on Sep. 17, 2021, which claims priority to Chinese Patent Application No. 202011065346.1, filed on Sep. 30, 2020. Both of the aforementioned applications are hereby incorporated by reference in their entireties. TECHNICAL FIELD Embodiments of this application relate to the acoustics field, and more specifically, to a signal processing method and an electronic device. BACKGROUND Currently, an intelligent device such as a smart television, a smart speaker, or a smart electric light can perform far-field sound pickup. For example, a user utters an instruction of “turning off a light” from 5 meters away, and the intelligent device picks up a speech and recognizes the speech, and controls the light to perform a corresponding turn-off action. In a common far-field sound pickup technology, an audio signal is picked up by using a microphone array, and ambient noise and echo are suppressed by using a beamforming technology and an echo cancellation algorithm, to obtain a clear audio signal. However, there may be various types of noise and interference in an actual environment, for example, noise from cooking and dish washing in a kitchen, noise from a television program, and interference noise from family chatting. In addition, rooms of some families are large and open, or walls are decorated by using materials with a large acoustic reflection coefficient. As a result, reverberation is severe, and sound is likely to be unclear. All these adverse factors cause a great reduction in definition of sound picked up by using the microphone array, greatly reducing a speech recognition rate. Therefore, a technology needs to be provided to greatly improve speech recognition efficiency. SUMMARY Embodiments of this application provide a signal processing method and an electronic device. A target sound source direction in which a user performing speech interaction with an electronic device is located is determined by using an audio signal and based on a video obtained by using a camera. Further, based on a user lip video obtained in the target sound source direction by using the camera and a preset voice quality enhancement model, voice quality enhancement is performed on a picked-up audio signal to obtain or restore a clear audio signal, so that speech recognition efficiency can be greatly improved. According to a first aspect, a signal processing method is provided, applied to an electronic device. The electronic device includes a microphone array and a camera, and the method includes: performing sound source localization on a first audio signal obtained by using the microphone array, to obtain sound source direction information;processing a first video obtained by using the camera, to obtain user direction information;determining a target sound source direction based on the sound source direction information and the user direction information;obtaining a user lip video in the target sound source direction by using the camera;obtaining a second audio signal by using the microphone array; andobtaining a third audio signal based on the second audio signal and the user lip video by using a voice quality enhancement model, where the voice quality enhancement model includes a correspondence between a semantic meaning and a lip shape. The sound source direction information includes at least one sound source direction, and the at least one sound source direction includes the target sound source direction. The user direction information includes some directions related to a user, for example, includes at least one type of direction related to the user. The target sound source direction is a direction in which a target user performing speech interaction with the electronic device is located, that is, a source direction of sound made by the target user. The user lip video records a plurality of lip shapes during speech of the user. There is a correspondence between a lip shape and a semantic meaning, that is, one lip shape may correspond to one or more semantic meanings. When the user is not speaking, lips are in a still state. Actually, the user lip video in the target sound source direction may also be understood as a lip video of the target user. The voice quality enhancement model performs sound pickup enhancement on an audio signal, to enhance an audio signal in the target sound source direction, and suppresses or cancels an audio signal that is in another direction and that is produced by a speaker or background noise, so as to obtain or restore a clear audio signal. The voice quality enhancement model in this embodiment of this application integrates audio and video information, and integrates a correspondence between a semantic meaning and a lip shape, that is, one or more semantic meanings may correspond to one lip shape. For example, the camera is a rotatable camera. After the target sound sourc