EP-3874737-B1 - SCENE ANNOTATION USING MACHINE LEARNING

EP3874737B1EP 3874737 B1EP3874737 B1EP 3874737B1EP-3874737-B1

Inventors

ADAMS, JUSTICE
JATI, Arindam
OMOTE, MASANORI
ZHENG, JIAN
KRISHNAMURTHY, SUDHA

Dates

Publication Date: 20260513
Application Date: 20190930

Claims (9)

A system (100) for enhancing the accessibility of Audio Visual content, the system comprising: a scene annotation module (120) configured to classify scene elements from an image frame received from a host system (102) and generate a caption describing the scene elements; wherein the scene annotation module includes: a first neural network configured to generate a feature vector from the image frame, and a second neural network configured to generate a caption describing elements within the image frame from the feature vector; an acoustic effect annotation module (150) configured to classify one or more primary, most important, acoustic effects occurring within an audio segment corresponding to the image frame and generate captions for the primary acoustic effects; wherein the acoustic effect annotation module includes: a first neural network trained to predict which of the sounds occurring in the audio segment are the most important, wherein one or more of the most important sounds are selected as the primary acoustic effects, and a second neural network trained to classify the primary acoustic effects; and a controller (101) coupled to the host system, the scene annotation module, and the acoustic effect annotation module, wherein the controller is configured to synchronize the output of the scene annotation module with outputs of one or more neural network modules, wherein the one or more neural network modules include the acoustic effect annotation module; wherein the controller is configured to activate the scene annotation module in response to an input from a user; and wherein the controller is configured to combine the captions generated by the scene annotation module and the acoustic effect annotation module with the image frame for presentation to a user on an output device.
The system of claim 1, wherein the caption describing elements within the image frame is a sentence predicted by the second neural network of the scene annotation module.
The system of claim 1, further comprising a text to speech synthesis module coupled to the scene annotation module, wherein the text to speech synthesis module is configured to convert the caption to synthesized speech data describing the scene elements within the image frame.
The system of claim 1, wherein the image frame data is video game frame data.
A method for enhancing the accessibility of Audio Visual content, comprising: classifying scene elements from an image frame received from a host system with a scene annotation module and generating a caption describing the scene elements with the scene annotation module; wherein the scene annotation module includes: a first neural network configured to generate a feature vector from the image frame, and a second neural network configured to generate a caption describing elements within the image frame from the feature vector; classifying one or more primary, most important, acoustic effects occurring within an audio segment corresponding to the image frame with an acoustic effect annotation module (150), and generating captions for the primary acoustic effects with the acoustic effect annotation module; wherein the acoustic effect annotation module includes: a first neural network trained to predict which of the sounds occurring in the audio segment are the most important, wherein one or more of the most important sounds are selected as the primary acoustic effects, and a second neural network trained to classify the primary acoustic effects; synchronizing the output of the scene annotation module with outputs of one or more neural network modules, wherein the one or more neural network modules include the acoustic effect annotation module, with a controller (101); wherein the controller is coupled to the host system, the scene annotation module, and the acoustic effect annotation module; activating the scene annotation module in response to an input from a user with the controller; and combining, with the controller, the captions generated by the scene annotation module and the acoustic effect annotation module with the image frame for presentation to a user on an output device.
The method of claim 5, wherein the caption describing elements within the image frame is a sentence predicted by the second neural network of the scene annotation module.
The method of claim 5, further comprising converting the caption, generated by the scene annotation module, to synthesized speech data describing the scene elements within the image frame with a speech synthesis module coupled to the scene annotation module.
The method of claim 5, wherein the image frame data is video game frame data.
A non-transitory computer-readable medium having computer readable instructions embodied therein, the computer-readable instructions being configured, when executed to implement the method of claim 5.

Description

FIELD OF THE DISCLOSURE The present disclosure relates to the field of audio-visual media enhancement specifically the addition of content to existing audio-visual media to improve accessibility for impaired persons. BACKGROUND OF THE INVENTION Not all audio-visual media, e.g., videogames, are accessible to disabled persons. While it is increasingly common for videogames to have captioned voice acting for the hearing impaired, other impairments such as vision impairments receive no accommodation. Additionally older movies and games did not include captioning. The combined interactive Audio Visual nature of videogames means that simply going through scenes and describing them is impossible. Many videogames today include open world components where the user has a multitude of options meaning that no two-action sequences in the game are identical. Additionally customizing color pallets for the colorblind is impossible for many video games and movies due to the sheer number of scenes and colors within each scene. Finally there already exist many videogames and movies that do not have accommodations for disabled people, adding such accommodations is time consuming and labor intensive. Previously proposed arrangements are disclosed in US 2017/200065 A1, US 2014/114643 A1, US 2007/011012 A1, and US 2017/132821 A1. It is within this context that embodiments of the present invention arise. BRIEF DESCRIPTION OF THE DRAWINGS The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which: FIG. 1 is a schematic diagram of an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2A is a simplified node diagram of a recurrent neural network for use in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2B is a simplified node diagram of an unfolded recurrent neural network for use in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2C is a simplified diagram of a convolutional neural network for use in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2D is a block diagram of a method for training a neural network in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 3 is a block diagram showing the process of operation of the Action Description component system according to aspects of the present disclosure.FIG. 4 is a diagram that depicts an image frame with tagged scene elements according to aspects of the present disclosure.FIG. 5 is a block diagram of the training method for the Scene Annotation component system encoder-decoder according to aspects of the present disclosure.FIG. 6 is a block diagram showing the process of operation for the Color Accommodation component system according to aspects of the present disclosure.FIG. 7 is a block diagram depicting the training of the Graphical Style Modification component system according to aspects of the present disclosure.FIG. 8 is a block diagram showing the process of operation of the Acoustic Effect Annotation component system according to aspects of the present disclosure. DESCRIPTION OF THE SPECIFIC EMBODIMENTS Although the following detailed description contains many specific details for the purposes of illustration, anyone of ordinary skill in the art will appreciate that many variations and alterations to the following details are within the scope of the invention, which is defined by the appended claims. Accordingly, examples of embodiments of the invention described below are set forth without any loss of generality to, and without imposing limitations upon, the claimed invention. While numerous specific details are set forth in order to provide a thorough understanding of embodiments of the invention, those skilled in the art will understand that other embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure aspects of the present disclosure. Some portions of the description herein are presented in terms of algorithms and symbolic representations of operations on data bits or binary digital signals within a computer memory. These algorithmic descriptions and representations may be the techniques used by those skilled in the data processing arts to convey the substance of their work to others skilled in the art. An algorithm, as used herein, is a self-consistent sequence of actions or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these