US-20260128053-A1 - SYSTEMS AND METHODS FOR ENHANCING SPEECH AUDIO SIGNALS

US20260128053A1US 20260128053 A1US20260128053 A1US 20260128053A1US-20260128053-A1

Abstract

A method and device for enhancing speech audio signals of an individual in a noisy environment based on a user's gaze and a captured image of the user's environment. A direction of a user's gaze is determined using image sensors configured to capture an orientation of a user's eyes and an image of the user environment is captures. Spatial audio is captured and analyzed along with the direction of gaze and image of the user environment to enhance audio of an active speaker.

Inventors

Ning Xu
Zhiyun Li

Assignees

ADEIA GUIDES INC.

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A method performed by an extended reality (XR) device being worn by a user, the method comprising: providing a view of an environment of the user, the environment comprising at least a first person and a second person, wherein the user is distinct from the first person and the second person; determining, based at least in part on a signal received by the XR device, that the first person comprises an active speaker; processing an audio originating from the active speaker to generate text corresponding to the audio; and modifying the view of the environment to include the text displayed proximate to a portion of the view of the active speaker.
2 . The method of claim 1 , wherein: the XR device comprises at least one camera configured to capture images of the environment, to provide the view of the environment; the signal received by the XR device comprises the captured images; and the captured images include at least one of the first person or the second person.
3 . The method of claim 2 , further comprising: inputting the captured images to a machine learning model configured to output data indicative of the active speaker.
4 . The method of claim 2 , wherein determining that the first person comprises the active speaker comprises: processing the captured images to identify facial characteristics of the first person; and determining, based at least in part on the facial characteristics, that the first person is actively speaking.
5 . The method of claim 2 , wherein the text of the audio is displayed below a portion of the view corresponding to the active speaker.
6 . The method of claim 1 , wherein: processing the audio comprises performing at least one of an automatic speech recognition process or a speech-to-text technique to generate the text of the audio; and displaying the text of the audio comprises projecting the text into a field of vision of the user of the XR device.
7 . The method of claim 1 , wherein the audio originating from the first person comprises first audio, the method further comprising: processing the first audio to provide enhanced audio of the first audio, wherein the enhanced audio separates the first audio from background noise of the environment and from second audio originating from the second person.
8 . The method of claim 1 , wherein determining, based at least in part on the signal received by the XR device, that the first person comprises the active speaker comprises: determining a direction of a gaze of the user, based at least in part on one or more images of the user captured by a first camera of the XR device; determining a location of the first person based at least in part on one or more images of the environment captured by a second camera of the XR device; and determining that the location of the first person corresponds to the gaze direction.
9 . The method of claim 1 , wherein the text is positioned closer to the portion of the environment comprising the first person than to another portion of the environment comprising the second person.
10 . The method of claim 1 , wherein the environment is a real-world environment, and the text comprises an augmented reality object or mixed reality object overlaid on the view of the real-world environment.
11 . An extended reality (XR) device being worn by a user, the XR device comprising: at least one sensor; control circuitry configured to: provide a view of an environment of the user, the environment comprising at least a first person and a second person, wherein the user is distinct from the first person and the second person; determine, based at least in part on a signal received by the XR device, that the first person comprises an active speaker; process an audio originating from the active speaker to generate text corresponding to the audio; and modify the view of the environment to include the text displayed proximate to a portion of the view of the active speaker.
12 . The XR device of claim 11 , further comprising: at least one camera configured to capture images of the environment, to provide the view of the environment; wherein the signal received by the XR device comprises the captured images; and wherein the captured images include at least one of the first person or the second person.
13 . The XR device of claim 12 , wherein the control circuitry is further configured to: input the captured images to a machine learning model configured to output data indicative of the active speaker.
14 . The XR device of claim 12 , wherein the control circuitry is further configured to determine that the first person comprises the active speaker by: processing the captured images to identify facial characteristics of the first person; and determining, based at least in part on the facial characteristics, that the first person is actively speaking.
15 . The XR device of claim 12 , wherein the control circuitry is further configured to display the text of the audio below a portion of the view corresponding to the active speaker.
16 . The XR device of claim 11 , wherein the control circuitry is further configured to: process the audio by performing at least one of an automatic speech recognition process or a speech-to-text technique to generate the text of the audio; and display the text of the audio by projecting the text into a field of vision of the user of the XR device.
17 . The XR device of claim 11 , wherein the audio originating from the first person comprises first audio, and the control circuitry is further configured to: process the first audio to provide enhanced audio of the first audio, wherein the enhanced audio separates the first audio from background noise of the environment and from second audio originating from the second person.
18 . The XR device of claim 17 , wherein the control circuitry is further configured to determine, based at least in part on the signal received by the XR device, that the first person comprises the active speaker by: determining a direction of a gaze of the user, based at least in part on one or more images of the user captured by a first camera of the XR device; determining a location of the first person based at least in part on one or more images of the environment captured by a second camera of the XR device; and determining that the location of the first person corresponds to the gaze direction.
19 . The XR device of claim 11 , wherein the text is positioned closer to the portion of the environment comprising the first person than to another portion of the environment comprising the second person.
20 . The XR device of claim 11 , wherein the environment is a real-world environment, and the text comprises an augmented reality object or mixed reality object overlaid on the view of the real-world environment.

Description

CROSS-REFERENCE TO RELATED APPLICATION This is a continuation of U.S. application Ser. No. 18/228,466, filed Jul. 31, 2023, the disclosure of which is hereby incorporated by reference herein in its entirety. BACKGROUND This disclosure relates to enhancing audio speech signals of a speaker in a noisy environment. In particular, techniques are disclosed for identifying and enhancing audio signals of a speaker based on a user's determined direction of gaze and image analysis of a user environment. SUMMARY It can be challenging for many individuals to hear conversational speech in crowded and noisy environments, such as social gatherings in confined spaces, loud restaurants, and the like. In particular, individuals with hearing loss or impairments often struggle to make out voices in conversations that take place with loud background environmental noise. Focusing on speech of a particular individual in settings with multiple speakers talking simultaneously or with significant background noise can be challenging. Affecting those with normal hearing in addition to individuals with hearing loss, this obstacle is known as the “cocktail party effect,” where the auditory processing ability of a person is limited when attempting to focus on a single voice while filtering out other voices and environmental sounds. A number of technological solutions have been suggested. Electronic hearing aids are designed to amplify surrounding voices and sounds, but are not designed to identify, distinguish, or enhance one voice out of many. In some embodiments, a wireless connection between headphones of a listener and a microphone placed close to a speaker can prove helpful. However, this requires the microphone or a similar recording device be placed physically close to a first speaker, which may be cumbersome or inaccessible. Additionally, if the conversation shifts to a second speaker in a different location, the microphone must be physically relocated to be close to the second speaker in order to continue receiving high quality speaker audio. Another technological solution involves the use of a microphone array configured to use beamforming techniques to focus on a specific audio source from a distance. In practice, however, implementing a sufficiently narrow audio pickup angle for most microphones is difficult and microphones that are capable of very narrow pickup angles, such as shotgun microphones often used in video production, are large and cumbersome. In some solutions, an orientation of a user is determined and used to identify a source of audio. However, in capturing audio in a confined space with many speakers, e.g., in a conversation among many people at a restaurant table, it is challenging to accurately direct a microphone to differentiate between adjacent speakers based on orientation alone. Additionally, capturing an image of a user environment and implementing image analysis to identify an active speaker can require a significant amount of processing power to efficiently determine one active speaker out of a larger group of speakers. This disclosure addresses these shortcomings. In the disclosed embodiment, the direction of a user's gaze is determined and used to identify an active speaker, and audio signals from the identified active speaker are focused on, e.g., using beamforming algorithms, and enhanced. Image sensors, e.g., cameras mounted on an interior of a pair of glasses, are used to capture images of the eyes of a user to determine a user gaze direction. Additional cameras pointed away from the user are configured to capture sequential images and/or video of a user environment in front of the user, and, based on the gaze direction and captured environment sequential images or video, a current active speaker is determined. Spatial audio is captured using a microphone or microphone array. Based on the gaze direction and captured images or video, audio of the active speaker is focused on, e.g., by adjusting microphone sensitivity using beamforming algorithms. The audio can be identified as speech of an active speaker, and is presented to the user in an enhanced format. In an embodiment, speech enhancement is performed on audio signals received from the active speaker, for example by using a machine learning model, in order to enhance the active speaker audio. When audio of the active speaker is identified, the spatial audio is played back to a user, e.g., using headphones or speakers, where volume of the environmental audio, such as background noise, is reduced and/or volume of the active speaker audio is increased. In a further embodiment, video images of the active speaker are captured and analyzed to perform voice separation and generate a refined voice signal. This refined voice signal is used as input into an automatic speech recognition function to produce more accurate text output of the active speaker's speech. In some embodiments, machine learning models are implemented to enhance the text output.