US-12627856-B1 - Enhancing video content with attention-based complementary virtual sound objects
Abstract
The present disclosure is directed to systems and methods for incorporating additional sound cues based on an attention level. In some embodiments, the systems and methods generate for output a content item comprising visual and audio components. In some embodiments, the systems and methods determine an attention level respective to the content item. In some embodiments, the systems and methods identify an object depicted in the content item. In some embodiments, the systems and methods, based on determining the attention level is below a threshold, and based on determining the audio component lacks sound attributable to the object, generates an additional audio component for the object. In some embodiments, the systems and methods generate for output the additional audio component with the audio component. In some embodiments, the systems and methods, based on the attention level above the threshold, modify the output to cease playing the additional audio component.
Inventors
- Zhiyun Li
- Ning Xu
Assignees
- ADEIA GUIDES INC.
Dates
- Publication Date
- 20260512
- Application Date
- 20241220
Claims (20)
- 1 . A method comprising: generating for output, on a device comprising a display, a content item, the content item comprising a visual component and an audio component; determining, using at least one sensor, an attention level of a user of the device respective to at least a portion of the visual component of the content item; identifying at least one object depicted in the visual component of the content item; based on determining that the attention level of the user at a first time is below a threshold attention level: based on determining that the audio component lacks sound attributable to the at least one object, generating at least one additional audio component for the at least one object; and generating for output the at least one additional audio component simultaneously with the audio component; and based on determining that the attention level of the user is above the threshold attention level at a second time after the first time, modifying the output of the content item to cease playing the at least one additional audio component.
- 2 . The method of claim 1 , wherein the generating for output the at least one additional audio component is performed gradually creating at least one of fade-in effect or fade-out effect.
- 3 . The method of claim 1 , further comprising monitoring the attention level of the user while generating for output the content item using one of eye tracking or activity recognition.
- 4 . The method of claim 1 , further comprising selecting, from a list of additional audio components, the at least one additional audio component based on the determining that the audio component lacks sound attributable to the at least one object, wherein the selected at least one additional audio component is related to a context of the content item.
- 5 . The method of claim 4 , further comprising selecting, from the list of additional audio components, the at least one additional audio component based on a storyline of the content item.
- 6 . The method of claim 1 , wherein: the determining the attention level of the user of the device comprises determining that the attention level is above a threshold for a first region of the visual component, and that the attention level is below the threshold for a second region of the visual component; and the identifying the at least one object depicted in the visual component of the content item comprises identifying an object in the second region of the visual component.
- 7 . The method of claim 1 , further comprising based on determining that the attention level of the user at the first time is below the threshold attention level: analyzing the audio component of the content item to identify at least one background audio profile of the audio component; selecting at least one background audio profile based on an importance of the selected at least one background audio profile; and modifying a volume of sound associated with the selected at least one background audio profile in the audio component.
- 8 . The method of claim 7 , wherein the importance of the at least one background audio profile is based on at least a storyline of the content item, metadata of the content item, or scene analysis.
- 9 . The method of claim 1 , further comprising, based on determining that the attention level of the user at the first time is below the threshold attention level, reducing a bitrate of the visual component of the content item.
- 10 . The method of claim 9 , further comprising, based on determining that the attention level of the user at the first time is below the threshold attention level, modifying volume.
- 11 . The method of claim 1 , wherein determining that the audio component lacks sound attributable to the at least one object is based on metadata of the content item or analyzing the audio component of the content item.
- 12 . The method of claim 1 , wherein the generating for output at least one additional audio component simultaneously with the audio component comprises generating for output the at least one additional audio component at a first timepoint in the content item prior to a second timepoint in the content item, wherein the at least one object is depicted in the visual component of the content item at the second timepoint.
- 13 . The method of claim 1 , wherein the identifying at least one object depicted in the visual component of the video content item is performed using a computer vision algorithm.
- 14 . The method of claim 1 , wherein the generating at least one additional audio component for the at least one object is based on a storyline, a context, or metadata of the content item.
- 15 . The method of claim 1 , further comprising: determining a position of the at least one object; and wherein the generating for output the at least one additional audio component simultaneously with the audio component comprises generating for output the at least one additional audio component according to the determined position.
- 16 . A system comprising: processing circuitry configured to: generate for output, on a device comprising a display, a content item, the content item comprising a visual component and an audio component; determine, using at least one sensor, an attention level of a user of the device respective to at least a portion of the visual component of the content item; identify at least one object depicted in the visual component of the content item; based on determining that the attention level of the user at a first time is below a threshold attention level: based on determining that the audio component lacks sound attributable to the at least one object, generate at least one additional audio component for the at least one object; and generate for output the at least one additional audio component simultaneously with the audio component; and based on determining that the attention level of the user is above the threshold attention level at a second time after the first time, modify the output of the content item to cease playing the at least one additional audio component.
- 17 . The system of claim 16 , wherein the generating for output the at least one additional audio component is performed gradually creating at least one of fade-in effect or fade-out effect.
- 18 . The system of claim 16 , the processing circuitry further configured to monitor the attention level of the user while generating for output the content item using one of eye tracking or activity recognition.
- 19 . The system of claim 16 , the processing circuitry further configured to select, from a list of additional audio components, the at least one additional audio component based on the determining that the audio component lacks sound attributable to the at least one object, wherein the selected at least one additional audio component is related to a context of the content item.
- 20 . The system of claim 19 , further comprising selecting, from the list of additional audio components, the at least one additional audio component based on a storyline of the content item.
Description
BACKGROUND The present disclosure is related to systems and techniques for enhancing video content with supplemental audio. SUMMARY The present disclosure relates to presenting additional audio for visual cues in a content item. In some embodiments, the described systems monitor the attention (e.g., by detecting interaction with another device) or gaze of a viewer (e.g., using a camera) to determine if additional sounds would enhance display of the content item. The additional sounds may be generated or prerecorded sounds associated with objects in a scene of the content item. In some embodiments, the system playing additional sounds provides context for settings or events. In some embodiments, the system analyzes a scene in a content item to determine what sounds will provide information important to the context or storyline of the content item. In today's fast paced world, viewers have access to continuous streams of information, ideas, and connections via many devices. With constant access to new information, viewers often watch video content or listen to audio content while also consuming information, such as social media, via the same or another device. As a result, the attention of a viewer is often split between two or more applications or content streams. In such a scenario, a viewer gives the content on one device only partial attention while they also consume or interact with content on another device (e.g., a phone) at the same time. In some circumstances, a viewer might not be looking at the screen on which the content stream is presented. For example, if the viewer places a smartphone playing content in his or her pocket, or if the viewer is working in another window that covers the playing content, the viewer cannot see the visual component of the content stream. However, even in these scenarios, the viewer is often listening to the content's audio component. Still, as a result of the limited attention level, the viewer will likely miss key details of the displayed content stream. In particular, the viewer is most likely to miss details that the content streams convey through visual cues. This disconnect causes content streams to ineffectively convey information, only partially reaching the viewer. Further, as a result of this inefficiency, a viewer may also request to replay the content. This result causes additional stress on the system as the system uses limited resources, such as network and computing power, to replay already presented content. In one approach, a system tracks the gaze of the viewer to approximate an attention level and pauses the content if it detects that the viewer is looking away. However, such an approach can be disruptive, and unexpectedly stop output of the content. For example, a system may pause the content when a viewer looks away despite the viewer still being engaged and listening. In another example, a system provides textual descriptions, composed in advance, of visual components of a scene. These descriptions provide information to members of the audience who might benefit from additional information, such as the visually impaired. In many scenarios, the system reads the textual descriptions aloud so that audience members may receive this information audibly rather than visually. While these descriptions can, in certain situations, provide useful details to viewers, the spoken details also may disrupt the natural flow of a content presentation, obscure dialogue and sounds, and distract audience members. This approach further requires considerable analysis and output resources, which can strain the system. The present disclosure describes systems and techniques for unobtrusively providing supplemental information in a content presentation, where the supplemental information conveys information to the audience that previously was only available through the visual component of the content. For example, the system may add sounds of footsteps to indicate, through sound rather than visuals, that a character has entered a room. In some embodiments, the system adds this supplemental information upon detecting that an audience attention level is low. Unlike some approaches, the described techniques continue playing the content. In some embodiments, the system plays supplemental information with the original, or near to original, content presentation. This approach limits disruptions to the content stream while effectively conveying information to distracted viewers In some embodiments, the systems and techniques monitor (e.g., using a sensor) the attention level of one or more viewers, and supplement the content stream with the additional sounds when one or more attention levels are below a threshold. In some embodiments, the systems and techniques use computer vision algorithms to analyze the visual characteristics of a scene to determine valuable information a viewer is likely to miss if not giving full attention. For example, a video scene understanding algorithm