US-12621625-B2 - Spatial audio communication between devices with speaker array and/or microphone array

US12621625B2US 12621625 B2US12621625 B2US 12621625B2US-12621625-B2

Abstract

The technology generally relates to spatial audio communication between devices. For example, a first device and a second device may be connected via a communication link. The first device may capture audio signals in an environment through two or more microphones. The first device may encode the captured audio with direction information. The first device may transmit the encoded audio via the communication link to the second device. The second device may decode the encoded audio to be output by one or more speakers of the second device. The second device may output the decoded audio to recreate positions of the captured audio signals.

Inventors

Jian Guo
Frances Maria Hui Hong Kwee

Assignees

GOOGLE LLC

Dates

Publication Date: 20260505
Application Date: 20230321

Claims (20)

1 . A method comprising: receiving a first audio signal sensed by a first audio sensor, the received first audio signal sensed from first sound waves emitted by a source emitter, the first audio sensor oriented in a first direction with respect to the source emitter; receiving a second audio signal sensed by a second audio sensor, the received second audio signal sensed from second sound waves emitted by the source emitter, the second audio sensor oriented in a second direction with respect to the source emitter; determining, based on the received first and second audio signals, a combined direction, the combined direction related to the first direction and the second direction; generating audio data, the audio data: configured for output by an output device; and comprising an output audio signal associated with the first and second sound waves emitted by the source emitter; and encoding the audio data with direction information, the direction information based on the combined direction.
2 . The method of claim 1 , wherein the received first audio signal and the received second audio signal are based on first and second sound waves, respectively, emitted from the source emitter at a same time.
3 . The method of claim 1 , wherein the first and second audio sensors are first and second microphones, respectively, arranged around a recording device, the method being performed by the recording device, the recording device being a mobile computing device, a smartphone, a smart watch, true wireless earbuds, hearing aids, an AR/VR headset, a smart helmet, a computer, a laptop, a tablet, or a home assistant device.
4 . The method of claim 1 , wherein the output audio signal is separated into multiple channel audio signals, each of the multiple channel audio signals associated with one of the audio sensors.
5 . The method of claim 1 , wherein the determination of the combined direction is based at least in part on comparing a first timestamp for the received first audio signal and a second timestamp for the received second audio signal, wherein the first and second timestamps indicate a time of receipt of the first and second sound waves from the source emitter at the first and second audio sensors, respectively.
6 . The method of claim 5 , wherein the direction information comprises the first timestamp and the second timestamp.
7 . The method of claim 1 , wherein the configuring of the audio data for output by an output device comprises: determining an output time for each of two or more speakers; or determining an output volume for each of the two or more speakers.
8 . The method of claim 7 , further comprising outputting, based on the direction information, the output audio signal to the two or more speakers, wherein the output of the output audio signal arrives at a fixed point with a same audio composition as if the signal had come from a source emitter in a fixed-point direction, the fixed-point direction being relative to the fixed point.
9 . A device, comprising: a first audio sensor; a second audio sensor; and one or more processors, the one or more processors configured to: receive, by the first audio sensor, a first audio signal, the first audio signal sensed from first sound waves emitted by a source emitter, the first audio sensor oriented in a first direction with respect to the source emitter; receive, by the second audio sensor, a second audio signal, the second audio signal sensed from second sound waves emitted by the source emitter, the second audio sensor oriented in a second direction with respect to the source emitter; determine, based on the first and second audio signals, a combined direction, the combined direction related to the first direction and the second direction; generate audio data, the audio data: configured for output by an output device; and comprising an output audio signal associated with the first and second sound waves emitted by the source emitter; and encode the audio data with direction information, the direction information based on the combined direction.
10 . The device of claim 9 , wherein the first audio signal and the second audio signal are based on first and second sound waves, respectively, emitted from the source emitter at a same time.
11 . The device of claim 9 , wherein the first and second audio sensors are first and second microphones, respectively, arranged around the device, the device being a mobile computing device, a smartphone, a smart watch, true wireless earbuds, hearing aids, an AR/VR headset, a smart helmet, a computer, a laptop, a tablet, or a home assistant device.
12 . The device of claim 9 , wherein the determination of the combined direction is based at least in part on comparing a first timestamp for the received first audio signal and a second timestamp for the received second audio signal, wherein the first and second timestamps indicate a time of receipt of the first and second sound waves from the source emitter at the first and second audio sensors, respectively.
13 . The device of claim 9 , wherein the determination of the combined direction is based at least in part on comparing a first signal strength for the received first audio signal and a second signal strength for the received second audio signal.
14 . The device of claim 9 , wherein the output audio signal is separated into multiple channel audio signals, each of the multiple channel audio signals associated with one of the audio sensors.
15 . A non-transitory computer-readable medium storing instructions, which when executed by one or more processors cause the one or more processors to: receive a first audio signal sensed by a first sensor, the first audio signal sensed from first sound waves emitted by a source emitter, the first audio sensor oriented in a first direction with respect to the source emitter; receive a second audio signal sensed by a second sensor, the second audio signal sensed from second sound waves emitted by the source emitter, the second audio sensor oriented in a second direction with respect to the source emitter; determine, based on the received first and second audio signals, a combined direction, the combined direction related to the first direction and the second direction; generate audio data, the audio data: configured for output by an output device; and comprising an output audio signal associated with the first and second sound waves emitted by the source emitter; and encode the audio data with direction information, the direction information based on the combined direction.
16 . The non-transitory computer-readable medium of claim 15 , wherein the first and second audio sensors are first and second microphones, respectively, arranged around a recording device, the recording device comprising the one or more processors and being a mobile computing device, a smartphone, a smart watch, true wireless earbuds, hearing aids, an AR/VR headset, a smart helmet, a computer, a laptop, a tablet, or a home assistant device.
17 . The non-transitory computer-readable medium of claim 15 , wherein the determination of the combined direction is based at least in part on comparing a first timestamp for the received first audio signal and a second timestamp for the received second audio signal, wherein the first and second timestamps indicate a time of receipt of the first and second sound waves from the source emitter at the first and second audio sensors, respectively.
18 . The non-transitory computer-readable medium of claim 15 , wherein the determination of the combined direction is based at least in part on comparing a first signal strength for the received first audio signal and a second signal strength for the received second audio signal.
19 . The non-transitory computer-readable medium of claim 15 , wherein the received first audio signal and the received second audio signal are based on first and second sound waves, respectively, emitted from the source emitter at a same time.
20 . The non-transitory computer-readable medium of claim 15 , wherein the output audio signal is separated into multiple channel audio signals, each of the multiple channel audio signals associated with one of the audio sensors.

Description

BACKGROUND Devices may be used for communication between two or more users when the users are separated by a distance, such as for teleconferencing, video conferencing, phone calls, etc. Each device may have a microphone and speaker array. A microphone of a first device may capture audio signals, such as speech of a first user. The captured audio may be transmitted, via a communication link, to a second device for output by speakers of the second device. The transmitted audio and the output audio may be mono audio, thereby lacking spatial cues. A second user listening to the output audio may, therefore, have a dull listening experience, as, without spatial cues, the second user may not have an indication of where the first user was positioned relative to the first device. Moreover, mono audio may prevent the user from having an immersive experience as the speakers of the second device may output the audio equally, thereby failing to provide spatial cues. SUMMARY The technology generally relates to spatial audio communication between devices. For example, a first device and a second device may be connected via a communication link. The first device may capture audio signals in an environment through two or more microphones. The first device may encode the captured audio with location information. The first device may transmit the encoded audio via the communication link to the second device. The second device may decode the encoded audio to be output by one or more speakers of the second device. The second device may output the decoded audio to recreate positions of the captured audio signals. A first aspect of this disclosure generally relates to a device comprising one or more processors. The one or more processors may be configured to receive, from two or more microphones, audio input, determine, based on the received audio input, a location of a source of the audio input relative to the device, and encode audio data associated with the audio input and the determined location. The one or more processors may be further configured to encode the audio data and the determined location with a timestamp, wherein the timestamp indicates a time the two or more microphones received the audio input. When determining the location of the source, the one or more processors may be further configured to triangulate the location based on a time each of the two or more microphones received the audio input. The one or more processors may be configured to receive encoded audio from a second device. The one or more processors may be further configured to decode the received encoded audio. The device may further comprise two or more speakers. When decoding the received encoded audio, the one or more processors may be configured to decode the received encoded audio based on the two or more speakers. The one or more processors may be further configured to output the received encoded audio based on the one or more speakers. Another aspect of this disclosure generally relates to a method comprising the following: receiving, by one or more processors from a device including two or more microphones, audio input; determining, by the one or more processors and based on the received audio input, a location of a source of the audio input relative to the device; and encoding, by the one or more processors, audio data associated with the audio input and the determined location. Yet another aspect of this disclosure generally relates to a non-transitory computer-readable medium storing instructions, which when executed by one or more processors cause the one or more processors to receive, from two or more microphones, audio input, determine, based on the received audio input, a location of a source of the audio input relative to the device, and encode audio data associated with the audio input and the determined location. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a functional block diagram of an example system in accordance with aspects of the disclosure. FIGS. 2A and 2B illustrate example environments for capturing audio signals in accordance with aspects of the disclosure. FIGS. 3A and 3B illustrate example environments for outputting audio signals in accordance with aspects of the disclosure. FIG. 4 is a flow diagram illustrating an example method of encoding audio data with audio input according to aspects of the disclosure. DETAILED DESCRIPTION The technology generally relates to spatial audio communication between devices. For example, two or more devices may be connected via a communication link such that audio may be transmitted from one device to be output by another. A first device may capture audio signals in an environment through two or more microphones, the audio signals based on sound waves emitted from a source emitter. The two or more microphones may be arranged around the device and may be integrated or non-integrated with the device. The captured audio signals may be encoded with information on a direction of the sour