US-12626707-B2 - Low latency audio for immersive group communication sessions

US12626707B2US 12626707 B2US12626707 B2US 12626707B2US-12626707-B2

Abstract

Aspects of the subject technology may provide low latency audio for group communication sessions. Low latency audio may be provided, some examples, by an electronic device using a lowest audio block size that is lower than a lowest audio block size that is available to one or more other electronic devices in an group communication session.

Inventors

Karthick Santhanam
Christopher M. GARRIDO
David L. Biderman
Luciano M. Verger
Patrick Miauton
Sachin Abhyankar

Assignees

APPLE INC.

Dates

Publication Date: 20260512
Application Date: 20240207

Claims (20)

1 . A method, comprising: receiving, by a first device from a second device during a communication session between the first device and the second device, a first incoming audio stream including incoming audio blocks having a first audio block size, wherein the first audio block size is a lowest audio block size that is available at the first device and the second device; processing the first incoming audio stream through a first buffer having a first buffer size that is based on the first audio block size; determining, by the first device, that a third device has joined the communication session, wherein a lowest audio block size that is available at the third device is larger than the lowest audio block size that is available at the first device and the second device; receiving, by the first device from the third device, a second incoming audio stream including incoming audio blocks having a second audio block size corresponding to the lowest audio block size that is available at the third device; and processing the first incoming audio stream through a second buffer having a second buffer size that is based on the second audio block size.
2 . The method of claim 1 , wherein: the second device and the third device are associated with a user; the third device replaces the second device in the communication session; the second incoming audio stream replaces the first incoming audio stream; and processing the first incoming audio stream through the second buffer having the second buffer size that is based on the second audio block size comprises increasing the first buffer size of the first buffer to form the second buffer.
3 . The method of claim 1 , wherein second device remains in the communication session with the first device and the third device, the method further comprising: obtaining, by the first device, a plurality of audio samples; sending, by the first device to the second device during the communication session, a first outgoing audio stream including the audio samples in first outgoing audio blocks having the first audio block size; and sending, by the first device to the third device during the communication session, a second outgoing audio stream including the audio samples in second outgoing audio blocks having the second audio block size.
4 . The method of claim 3 , wherein obtaining the plurality of audio samples comprises obtaining the plurality of audio samples at a first sampling rate that is faster than a second sampling rate at which the third device is capable of obtaining audio samples, and wherein the method further comprises: obtaining a time stamp corresponding to at least one of the second outgoing audio blocks; modifying the time stamp based on a ratio of the first sampling rate and the second sampling rate to generate a modified time stamp; and providing the modified time stamp, with the at least one of the second outgoing audio blocks, to the third device.
5 . The method of claim 4 , wherein the second device is capable of obtaining audio samples at the first sampling rate, and wherein the method further comprises: obtaining a time stamp corresponding to at least one of the first outgoing audio blocks; modifying the time stamp corresponding to the at least one of the first outgoing audio blocks based on the ratio of the first sampling rate and the second sampling rate to generate an additional modified time stamp; and providing the additional modified time stamp, with the at least one of the first outgoing audio blocks, to the second device.
6 . The method of claim 4 , further comprising: receiving a first incoming time stamp with the first incoming audio stream; modifying the first incoming time stamp based on the ratio of the first sampling rate and the second sampling rate to generate a modified first incoming time stamp; and processing the first incoming audio stream based at least in part on the modified first incoming time stamp.
7 . The method of claim 6 , further comprising: receiving a second incoming time stamp with the second incoming audio stream; and processing the second incoming audio stream based at least in part on the second incoming time stamp without modification to the second incoming time stamp.
8 . The method of claim 1 , wherein: processing the first incoming audio stream comprises generating a first audio output from the first device, the first audio output corresponding to avatar information received from the second device, and processing the second incoming audio stream comprises generating a second audio output from the first device, the second audio output corresponding to video information received from the third device.
9 . The method of claim 1 , further comprising, by the first device, opting to receive, in place of the first incoming audio stream including the incoming audio blocks having the first audio block size from the second device, a third incoming audio stream including incoming audio blocks having the second audio block size from the second device.
10 . A non-transitory machine readable medium comprising instructions which, when executed by one or more processors, causes the one or more processors to perform operations comprising: receiving, by a first device from a second device during a communication session between the first device and the second device, a first incoming audio stream including incoming audio blocks having a first audio block size, wherein the first audio block size is a lowest audio block size that is available at the first device and the second device; processing the first incoming audio stream through a first buffer having a first buffer size that is based on the first audio block size; determining, by the first device, that a third device has joined the communication session, wherein a lowest audio block size that is available at the third device is larger than the lowest audio block size that is available at the first device and the second device; receiving, by the first device from the third device, a second incoming audio stream including incoming audio blocks having a second audio block size corresponding to the lowest audio block size that is available at the third device; and processing the first incoming audio stream through a second buffer having a second buffer size that is based on the second audio block size.
11 . The non-transitory machine readable medium of claim 10 , wherein: the second device and the third device are associated with a user; the third device replaces the second device in the communication session; the second incoming audio stream replaces the first incoming audio stream; and processing the first incoming audio stream through the second buffer having the second buffer size that is based on the second audio block size comprises increasing the first buffer size of the first buffer to form the second buffer.
12 . The non-transitory machine readable medium of claim 10 , wherein second device remains in the communication session with the first device and the third device, the operations further comprising: obtaining, by the first device, a plurality of audio samples; sending, by the first device to the second device during the communication session, a first outgoing audio stream including the audio samples in first outgoing audio blocks having the first audio block size; and sending, by the first device to the third device during the communication session, a second outgoing audio stream including the audio samples in second outgoing audio blocks having the second audio block size.
13 . The non-transitory machine readable medium of claim 12 , wherein obtaining the plurality of audio samples comprises obtaining the plurality of audio samples at a first sampling rate that is faster than a second sampling rate at which the third device is capable of obtaining audio samples, and wherein the operations further comprise: obtaining a time stamp corresponding to at least one of the second outgoing audio blocks; modifying the time stamp based on a ratio of the first sampling rate and the second sampling rate to generate a modified time stamp; and providing the modified time stamp, with the at least one of the second outgoing audio blocks, to the third device.
14 . The non-transitory machine readable medium of claim 13 , wherein the second device is capable of obtaining audio samples at the first sampling rate, and wherein the operations further comprise: obtaining a time stamp corresponding to at least one of the first outgoing audio blocks; modifying the time stamp corresponding to the at least one of the first outgoing audio blocks based on the ratio of the first sampling rate and the second sampling rate to generate an additional modified time stamp; and providing the additional modified time stamp, with the at least one of the first outgoing audio blocks, to the second device.
15 . The non-transitory machine readable medium of claim 13 , the operations further comprising: receiving a first incoming time stamp with the first incoming audio stream; modifying the first incoming time stamp based on the ratio of the first sampling rate and the second sampling rate to generate a modified first incoming time stamp; and processing the first incoming audio stream based at least in part on the modified first incoming time stamp.
16 . The non-transitory machine readable medium of claim 15 , the operations further comprising: receiving a second incoming time stamp with the second incoming audio stream; and processing the second incoming audio stream based at least in part on the second incoming time stamp without modification to the second incoming time stamp.
17 . A first device comprising: a memory; and at least one processor configured to: receive, from a second device during a communication session between the first device and the second device, a first incoming audio stream including incoming audio blocks having a first audio block size, wherein the first audio block size is a lowest audio block size that is available at the first device and the second device; process the first incoming audio stream through a first buffer having a first buffer size that is based on the first audio block size; determine that a third device has joined the communication session, wherein a lowest audio block size that is available at the third device is larger than the lowest audio block size that is available at the first device and the second device; receive, from the third device, a second incoming audio stream including incoming audio blocks having a second audio block size corresponding to the lowest audio block size that is available at the third device; and process the first incoming audio stream through a second buffer having a second buffer size that is based on the second audio block size.
18 . The first device of claim 17 , wherein the at least one processor is further configured to: obtain a plurality of audio samples; send, to the second device during the communication session, a first outgoing audio stream including the audio samples in first outgoing audio blocks having the first audio block size; and send, to the third device during the communication session, a second outgoing audio stream including the audio samples in second outgoing audio blocks having the second audio block size.
19 . The first device of claim 18 , wherein the at least one processor is configured to obtain the plurality of audio samples at a first sampling rate that is faster than a second sampling rate at which the third device is capable of obtaining audio samples, and is further configured to: obtain a time stamp corresponding to at least one of the second outgoing audio blocks; modify the time stamp based on a ratio of the first sampling rate and the second sampling rate to generate a modified time stamp; and provide the modified time stamp, with the at least one of the second outgoing audio blocks, to the third device.
20 . The first device of claim 19 , wherein the second device is capable of obtaining audio samples at the first sampling rate, and wherein the at least one processor is further configured to: obtain a time stamp corresponding to at least one of the first outgoing audio blocks; modify the time stamp corresponding to the at least one of the first outgoing audio blocks based on the ratio of the first sampling rate and the second sampling rate to generate an additional modified time stamp; and provide the additional modified time stamp, with the at least one of the first outgoing audio blocks, to the second device.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/470,956, entitled, “Low Latency Audio for Immersive Group Communication Sessions”, filed on Jun. 4, 2023, and U.S. Provisional Patent Application No. 63/457,798, entitled, “Low Latency Audio for Immersive Group Communication Sessions”, filed on Apr. 7, 2023, the disclosure of each of which is hereby incorporated herein in its entirety. TECHNICAL FIELD The present description relates generally to electronic communications, including, for example, low latency audio for immersive group communication sessions. BACKGROUND Audio content is often transmitted between electronics devices during calls or video conferences between the electronic devices. BRIEF DESCRIPTION OF THE DRAWINGS Certain features of the subject technology are set forth in the appended claims. However, for purpose of explanation, several embodiments of the subject technology are set forth in the following figures. FIG. 1 illustrates an example system architecture including various electronic devices that may implement the subject system in accordance with one or more implementations. FIG. 2 illustrates a block diagram of example features of an electronic device in accordance with one or more implementations. FIG. 3 illustrates an example transmission of audio data including redundant audio data during a communication session in accordance with one or more implementations. FIG. 4 illustrates an example packet of audio data in accordance with one or more implementations. FIG. 5 illustrates an example process that may be performed for providing low latency audio for group communication sessions, in accordance with one or more implementations. FIG. 6 illustrates an example of a group communication session including multiple devices providing multiple respective audio streams with multiple corresponding audio block sizes, in accordance with one or more implementations. FIG. 7 illustrates an example an electronic device participating in a group communication session and opting into a higher audio block size audio stream, in accordance with one or more implementations. FIG. 8 is a diagram illustrating an electronic device participating in a group communication session and modifying a buffer size responsive to a change in an audio block size in an incoming audio stream, in accordance with one or more implementations. FIG. 9 illustrates an example process that may be performed for providing group communication sessions for devices using various audio block sizes, in accordance with one or more implementations. FIG. 10 illustrates an electronic system with which one or more implementations of the subject technology may be implemented. DETAILED DESCRIPTION The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein and can be practiced using one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology. A physical environment refers to a physical world that people can sense and/or interact with without aid of electronic devices. The physical environment may include physical features such as a physical surface or a physical object. For example, the physical environment corresponds to a physical park that includes physical trees, physical buildings, and physical people. People can directly sense and/or interact with the physical environment such as through sight, touch, hearing, taste, and smell. In contrast, an extended reality (XR) environment refers to a wholly or partially simulated environment that people sense and/or interact with via an electronic device. For example, the XR environment may include augmented reality (AR) content, mixed reality (MR) content, virtual reality (VR) content, and/or the like. With an XR system, a subset of a person's physical motions, or representations thereof, are tracked, and, in response, one or more characteristics of one or more virtual objects simulated in the XR environment are adjusted in a manner that comports with at least one law of physics. As one example, the XR system may detect head movement and, in response, adjust graphical content and an acoustic field presented to the person in a manner similar to how such views and sounds would change in a physical environment. As another example, the XR system may detect movement of the electronic device presenting the XR environmen