US-12620182-B2 - Method and device for presenting an audio and synthesized reality experience

US12620182B2US 12620182 B2US12620182 B2US 12620182B2US-12620182-B2

Abstract

In various implementations, methods of presenting an audio/SR experience are disclosed. In one embodiment, while playing an audio file in an environment, in response to determining that the respective temporal criterion and the respective environmental criterion of an SR content event is met, the SR content event is displayed in association with the environment. In one embodiment, SR content is obtained and displayed in association with an environment based on an audio file and a 3D point cloud of the environment. In one embodiment, SR content is obtained and displayed in association with an environment based on spoken words of a real sound of the environment.

Inventors

Ian M. Richter

Assignees

APPLE INC.

Dates

Publication Date: 20260505
Application Date: 20230824

Claims (20)

1 . A method comprising: at a device including a processor, non-transitory memory, a microphone, and a display: recording, via the microphone, a sound produced in an environment while displaying, on the display, a volumetric environment based on the environment; performing object detection to identify in the volumetric environment representations of physical objects in the environment; detecting, using the one or more processors, one or more spoken words in the sound; obtaining, based on the one or more spoken words, mixed reality (MR) content; and modifying the volumetric environment dynamically while the one or more spoken words are playing, including adding the MR content corresponding to the one or more spoken words and concurrently presenting supplemental content selected based on at least one of a tempo, a volume dynamic, or a frequency dynamic of audio data played in the volumetric environment, wherein the MR content is displayed on a portion of the display that is selected based on a location of one of the detected representations as indicated by the one or more spoken words.
2 . The method of claim 1 , wherein obtaining, based on the one or more spoken words, the MR content includes detecting, in the one or more spoken words, a trigger word and obtaining the MR content based on the trigger word.
3 . The method of claim 2 , wherein obtaining, based on the one or more spoken words, the MR content includes detecting, in the one or more spoken words, a modifier word associated with the trigger word and obtaining the MR content based on the modifier word.
4 . The method of claim 1 , wherein obtaining, based on the one or more spoken words, the MR content includes selecting the MR content from a library of labeled MR content elements based on at least one of the one or more spoken words.
5 . The method of claim 1 , further comprising: playing, via a speaker, an audio file associated with the MR content.
6 . The method of claim 1 , wherein obtaining the MR content is further based on one or more spatial characteristics of the environment.
7 . The method of claim 1 , wherein obtaining the MR content is based on an environmental class of the environment.
8 . The method of claim 1 , wherein obtaining the MR content is based on an object of a particular shape detected in the environment.
9 . The method of claim 1 , wherein obtaining the MR content is based on an object of a particular type detected in the environment.
10 . A device comprising: one or more processors; a non-transitory memory; a microphone; a display; and one or more programs stored in the non-transitory memory, which, when executed by the one or more processors, cause the device to: record, via the microphone, a sound produced in an environment while displaying, on the display, a volumetric environment based on the environment; perform object detection to identify in the volumetric environment representations of physical objects in the environment; detect, using the one or more processors, one or more spoken words in the sound; obtain, based on the one or more spoken words, mixed reality (MR) content; and modify the volumetric environment dynamically while the one or more spoken words are playing, including adding the MR content corresponding to the one or more spoken words and concurrently presenting supplemental content selected based on at least one of a tempo, a volume dynamic, or a frequency dynamic of audio data played in the volumetric environment, wherein the MR content is displayed on a portion of the display that is selected based on a location of one of the detected representations as indicated by the one or more spoken words.
11 . The device of claim 10 , wherein obtaining, based on the one or more spoken words, the MR content includes detecting, in the one or more spoken words, a trigger word and obtaining the MR content based on the trigger word.
12 . The device of claim 11 , wherein obtaining, based on the one or more spoken words, the MR content includes detecting, in the one or more spoken words, a modifier word associated with the trigger word and obtaining the MR content based on the modifier word.
13 . The device of claim 10 , wherein obtaining, based on the one or more spoken words, the MR content includes selecting the MR content from a library of labeled MR content elements based on at least one of the one or more spoken words.
14 . The device of claim 10 , wherein the one or more programs, which, when executed by the one or more processors, further cause the device to play, via a speaker, an audio file associated with the MR content.
15 . The device of claim 10 , wherein obtaining the MR content is further based on one or more spatial characteristics of the environment.
16 . The device of claim 10 , wherein obtaining the MR content is based on an environmental class of the environment.
17 . The device of claim 10 , wherein obtaining the MR content is based on an object of a particular shape detected in the environment.
18 . The device of claim 10 , wherein obtaining the MR content is based on an object of a particular type detected in the environment.
19 . A non-transitory memory storing one or more programs, which, when executed by one or more processors of a device with a microphone and a display, cause the device to: record, via the microphone, a sound produced in an environment while displaying, on the display, a volumetric environment based on the environment; perform object detection to identify in the volumetric environment representations of physical objects in the environment; detect, using the one or more processors, one or more spoken words in the sound; obtain, based on the one or more spoken words, mixed reality (MR) content; and modify the volumetric environment dynamically while the one or more spoken words are playing, including adding the MR content corresponding to the one or more spoken words and concurrently presenting supplemental content selected based on at least one of a tempo, a volume dynamic, or a frequency dynamic of audio data played in the volumetric environment, wherein the MR content is displayed on a portion of the display that is selected based on a location of one of the detected representations as indicated by the one or more spoken words.
20 . The non-transitory memory of claim 19 , wherein obtaining, based on the one or more spoken words, the MR content includes detecting, in the one or more spoken words, a trigger word and obtaining the MR content based on the trigger word.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation application of and claims priority to U.S. patent application Ser. No. 17/053,676, filed on Nov. 6, 2020, which is a national phase entry of International patent application number PCT/US2019/034324, filed on May 29, 2019, while claims priority to U.S. patent application No. 62/677,904, filed on May 30, 2018, which are hereby incorporated by reference in their entirety. TECHNICAL FIELD The present disclosure generally relates to audio and synthesized reality experiences, and in particular, to systems, methods, and devices for presenting a synthesized reality experience to accompany audio. BACKGROUND A physical setting refers to a world that individuals can sense and/or with which individuals can interact without assistance of electronic systems. Physical settings (e.g., a physical forest) include physical elements (e.g., physical trees, physical structures, and physical animals). Individuals can directly interact with and/or sense the physical setting, such as through touch, sight, smell, hearing, and taste. In contrast, a synthesized reality (SR) setting refers to an entirely or partly computer-created setting that individuals can sense and/or with which individuals can interact via an electronic system. In SR, a subset of an individual's movements is monitored, and, responsive thereto, one or more attributes of one or more virtual objects in the SR setting is changed in a manner that conforms with one or more physical laws. For example, a SR system may detect an individual walking a few paces forward and, responsive thereto, adjust graphics and audio presented to the individual in a manner similar to how such scenery and sounds would change in a physical setting. Modifications to attribute(s) of virtual object(s) in a SR setting also may be made responsive to representations of movement (e.g., audio instructions). An individual may interact with and/or sense a SR object using any one of his senses, including touch, smell, sight, taste, and sound. For example, an individual may interact with and/or sense aural objects that create a multi-dimensional (e.g., three dimensional) or spatial aural setting, and/or enable aural transparency. Multi-dimensional or spatial aural settings provide an individual with a perception of discrete aural sources in multi-dimensional space. Aural transparency selectively incorporates sounds from the physical setting, either with or without computer-created audio. In some SR settings, an individual may interact with and/or sense only aural objects. One example of SR is virtual reality (VR). A VR setting refers to a simulated setting that is designed only to include computer-created sensory inputs for at least one of the senses. A VR setting includes multiple virtual objects with which an individual may interact and/or sense. An individual may interact and/or sense virtual objects in the VR setting through a simulation of a subset of the individual's actions within the computer-created setting, and/or through a simulation of the individual or his presence within the computer-created setting. Another example of SR is mixed reality (MR). A MR setting refers to a simulated setting that is designed to integrate computer-created sensory inputs (e.g., virtual objects) with sensory inputs from the physical setting, or a representation thereof. On a reality spectrum, a mixed reality setting is between, and does not include, a VR setting at one end and an entirely physical setting at the other end. In some MR settings, computer-created sensory inputs may adapt to changes in sensory inputs from the physical setting. Also, some electronic systems for presenting MR settings may monitor orientation and/or location with respect to the physical setting to enable interaction between virtual objects and real objects (which are physical elements from the physical setting or representations thereof). For example, a system may monitor movements so that a virtual plant appears stationery with respect to a physical building. One example of mixed reality is augmented reality (AR). An AR setting refers to a simulated setting in which at least one virtual object is superimposed over a physical setting, or a representation thereof. For example, an electronic system may have an opaque display and at least one imaging sensor for capturing images or video of the physical setting, which are representations of the physical setting. The system combines the images or video with virtual objects, and displays the combination on the opaque display. An individual, using the system, views the physical setting indirectly via the images or video of the physical setting, and observes the virtual objects superimposed over the physical setting. When a system uses image sensor(s) to capture images of the physical setting, and presents the AR setting on the opaque display using those images, the displayed images are called a video pass-through. Alter