EP-4742059-A2 - SCENE ANNOTATION USING MACHINE LEARNING

EP4742059A2EP 4742059 A2EP4742059 A2EP 4742059A2EP-4742059-A2

Abstract

A system enhances existing audio-visual content with audio describing the setting of the visual content. A scene annotation module classifies scene elements from an image frame received from a host system and generates a caption describing the scene elements. A text to speech synthesis module may then convert the caption to synthesized speech data describing the scene elements within the image frame

Inventors

ADAMS, JUSTICE
JATI, Arindam
OMOTE, MASANORI
ZHENG, JIAN
KRISHNAMURTHY, SUDHA

Assignees

Sony Interactive Entertainment Inc.

Dates

Publication Date: 20260513
Application Date: 20190930

Claims (15)

A method comprising: receiving a video stream that comprises multiple image frames; providing the multiple image frames to a machine learning model that is trained to output data identifying one or more actions occurring within input image frames; receiving, from the machine learning model, data indicating one or more particular actions that are identified as occurring within the multiple image frames; modifying, based at least on the one or more actions that are identified as occurring within the multiple image frames, one or more of the multiple image frames to include a representation of the one or more particular actions that are identified as occurring within the multiple image frames; and providing the modified image frames for output.
The method of claim 1, wherein providing the multiple image frames to the machine learning model that is trained to output data identifying one or more actions occurring within input image frames comprises: providing the multiple image frames to a model comprising multiple, separately trained, neural networks configured to process the multiple image frames in series.
The method of claim 2, wherein receiving, from the machine learning model, the data indicating the one or more particular actions that are identified as occurring within the multiple image frames, comprises: receiving, as output from a final neural network of the multiple neural networks in series, a text description of the one or more particular actions identified as occurring within the multiple image frames.
The method of claim 1, comprising: detecting, based on the multiple image frames, movement; determining that the detected movement satisfies a threshold level of movement; and in response to determining that the detected movement satisfies the threshold level of movement, providing the multiple image frames to the machine learning model that is trained to output data identifying one or more actions occurring within input image frames.
The method of claim 4, wherein detecting movement based on the multiple image frames comprises: providing the multiple image frames to a motion detection encoder trained to detect movement; and receiving, as a detection of movement within the multiple image frames, output from the motion detection encoder.
The method of claim 1, comprising: providing the multiple image frames to a first neural network of the machine learning model trained to output a feature vector for each image frame of input image frames.
The method of claim 6, comprising: receiving one or more feature vectors for each image frame of the multiple image frames as first output data from the first neural network; and providing the first output data comprising the feature vectors to a second neural network trained to output feature data representing a window of time that comprises multiple image frames.
The method of claim 7, comprising: receiving feature data representing the window of time that comprises the multiple image frames of the video stream as second output data from the second neural network; and providing the second output data to a third neural network trained to classify input feature data according to action occurring in associated image frames, wherein receiving the data indicating the one or more particular actions that are identified as occurring within the multiple image frames comprises: receiving classification output from the third neural network processing the provided second output data.
The method of claim 8, wherein the third neural network is a recurrent neural network (RNN) and both the first neural network and the second neural network are convolutional neural network (CNN).
The method of claim 1, wherein modifying the one or more of the multiple image frames to include the representation of the one or more particular actions that are identified as occurring within the multiple image frames comprises: modifying the one or more of the multiple image frames to include a textual description of the one or more particular actions that are identified as occurring within the multiple image frames.
One or more non-transitory computer storage media encoded with computer program instructions that when executed by one or more computers cause the one or more computers to perform the method of any preceding claim.
A system comprising: one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving a video stream that comprises multiple image frames; providing the multiple image frames to a machine learning model that is trained to output data identifying one or more actions occurring within input image frames; receiving, from the machine learning model, data indicating one or more particular actions that are identified as occurring within the multiple image frames; modifying, based at least on the one or more actions that are identified as occurring within the multiple image frames, one or more of the multiple image frames to include a representation of the one or more particular actions that are identified as occurring within the multiple image frames; and providing the modified image frames for output.
The system of claim 12, wherein providing the multiple image frames to the machine learning model that is trained to output data identifying one or more actions occurring within input image frames comprises: providing the multiple image frames to a model comprising multiple, separately trained, neural networks configured to process the multiple image frames in series.
The system of claim 13, wherein there are three separately trained neural networks, and the third neural network is a recurrent neural network (RNN) and both the first neural network and the second neural network are convolutional neural network (CNN).
The system of any one of claims 12-14, wherein modifying the one or more of the multiple image frames to include the representation of the one or more particular actions that are identified as occurring within the multiple image frames comprises: modifying the one or more of the multiple image frames to include a textual description of the one or more particular actions that are identified as occurring within the multiple image frames.

Description

FIELD OF THE DISCLOSURE The present disclosure relates to the field of audio-visual media enhancement specifically the addition of content to existing audio-visual media to improve accessibility for impaired persons. BACKGROUND OF THE INVENTION Not all audio-visual media, e.g., videogames, are accessible to disabled persons. While it is increasingly common for videogames to have captioned voice acting for the hearing impaired, other impairments such as vision impairments receive no accommodation. Additionally older movies and games did not include captioning. The combined interactive Audio Visual nature of videogames means that simply going through scenes and describing them is impossible. Many videogames today include open world components where the user has a multitude of options meaning that no two-action sequences in the game are identical. Additionally customizing color pallets for the colorblind is impossible for many video games and movies due to the sheer number of scenes and colors within each scene. Finally there already exist many videogames and movies that do not have accommodations for disabled people, adding such accommodations is time consuming and labor intensive. It is within this context that embodiments of the present invention arise. BRIEF DESCRIPTION OF THE DRAWINGS The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which: FIG. 1 is a schematic diagram of an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2A is a simplified node diagram of a recurrent neural network for use in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2B is a simplified node diagram of an unfolded recurrent neural network for use in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2C is a simplified diagram of a convolutional neural network for use in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 2D is a block diagram of a method for training a neural network in an On-Demand Accessibility System according to aspects of the present disclosure.FIG. 3 is a block diagram showing the process of operation of the Action Description component system according to aspects of the present disclosure.FIG. 4 is a diagram that depicts an image frame with tagged scene elements according to aspects of the present disclosure.FIG. 5 is a block diagram of the training method for the Scene Annotation component system encoder-decoder according to aspects of the present disclosure.FIG. 6 is a block diagram showing the process of operation for the Color Accommodation component system according to aspects of the present disclosure.FIG. 7 is a block diagram depicting the training of the Graphical Style Modification component system according to aspects of the present disclosure.FIG. 8 is a block diagram showing the process of operation of the Acoustic Effect Annotation component system according to aspects of the present disclosure. DESCRIPTION OF THE SPECIFIC EMBODIMENTS Although the following detailed description contains many specific details for the purposes of illustration, anyone of ordinary skill in the art will appreciate that many variations and alterations to the following details are within the scope of the invention. Accordingly, examples of embodiments of the invention described below are set forth without any loss of generality to, and without imposing limitations upon, the claimed invention. While numerous specific details are set forth in order to provide a thorough understanding of embodiments of the invention, those skilled in the art will understand that other embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure aspects of the present disclosure. Some portions of the description herein are presented in terms of algorithms and symbolic representations of operations on data bits or binary digital signals within a computer memory. These algorithmic descriptions and representations may be the techniques used by those skilled in the data processing arts to convey the substance of their work to others skilled in the art. An algorithm, as used herein, is a self-consistent sequence of actions or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. Unless specifically stated or otherwise as apparent from the following discussion, it