US-20260128034-A1 - IMAGE DESCRIPTION GENERATION FOR SCREEN READERS

US20260128034A1US 20260128034 A1US20260128034 A1US 20260128034A1US-20260128034-A1

Abstract

In some implementations, a browser extension may receive a setting indicating a level of verbosity and may receive an image and a set of words associated with the image. The browser extension may identify a foreground of the image and a background of the image and may identify, within the foreground of the image, a set of objects. The browser extension may rank the set of objects based on one or more properties of the set of objects and the set of words and may select a subset of objects from the set of objects based on the setting and the ranking. Accordingly, the browser extension may generate descriptions of the selected subset of objects based on the setting and may input the generated descriptions to a text to speech algorithm.

Inventors

Michael Mossoba
Abdelkader M'Hamed Benkreira
Noel LYLES
Joshua Edwards

Assignees

CAPITAL ONE SERVICES, LLC

Dates

Publication Date: 20260507
Application Date: 20251229

Claims (20)

1 . A system for image description generation, the system comprising: one or more memories; and one or more processors, communicatively coupled to the one or more memories, configured to: receive, from an input device, a setting indicating a level of verbosity; identify, within an image, a set of objects, wherein a set of words is associated with the image; rank the set of objects based on one or more properties of the set of objects and the set of words; select a subset of objects from the set of objects based on the setting and the ranking; and generate a description of the selected subset of objects based on the setting, wherein the one or more processors, to generate the description of the selected subset of objects, are configured to select the description, from a database, based on the level of verbosity, wherein the description satisfies a length threshold for the level of verbosity based on selection from the database and trimming by a natural language processing model.
2 . The system of claim 1 , wherein the one or more processors are further configured to: receive an indication of a webpage; and transmit, to a remote server associated with the webpage, a request for content indexed to the webpage, wherein the image and the set of words are received from the remote server in response to the request.
3 . The system of claim 1 , wherein the one or more processors, to receive the setting indicating the level of verbosity, are configured to: receive a voice command indicating the level of verbosity.
4 . The system of claim 1 , wherein the one or more processors are configured to: apply a computer vision model to identify the set of objects and a set of bounding boxes corresponding to the set of objects, wherein the set of objects are ranked based on the set of bounding boxes.
5 . The system of claim 4 , wherein the set of objects are ranked further based on sizes and locations of the set of bounding boxes.
6 . The system of claim 1 , wherein the one or more processors are configured to: rank the set of objects; and determine if one or more objects in the set of objects are mentioned in the set of words.
7 . The system of claim 1 , wherein the one or more processors are configured to: input the generated description to an application programming interface (API) associated with a text-to-speech algorithm and provided by an operating system.
8 . A method of image description generation, comprising: receiving, from an input device, a setting indicating a level of verbosity; identifying, within an image, a set of objects, wherein a set of words is associated with the image; ranking the set of objects based on one or more properties of the set of objects and the set of words; selecting a subset of objects from the set of objects based on the setting and the ranking; and generating a description of the selected subset of objects based on the setting, wherein generating the description of the selected subset of objects comprises selecting the description, from a database, based on the level of verbosity, wherein the description satisfies a length threshold for the level of verbosity based on selection from the database and trimming by a natural language processing model.
9 . The method of claim 8 , further comprising: determining that a background of the image is more important than a foreground of the image based on the set of objects not being included in the set of words.
10 . The method of claim 9 , wherein determining that the background is more important than the foreground comprises: determining at least one object included in the foreground is included in the set of words at a distance from the image that satisfies a distance threshold, wherein the distance is a distance in characters within a source code.
11 . The method of claim 9 , wherein determining that the background is more important than the foreground comprises: determining that the background is included in the set of words.
12 . The method of claim 9 , wherein the background is included in the set of words at a distance from the image that satisfies a distance threshold.
13 . The method of claim 8 , wherein a length of a description of a background of the image is based on the setting.
14 . The method of claim 8 , further comprising: transmitting, to a remote server, a request for content, wherein the image and the set of words are received from the remote server in response to the request.
15 . A non-transitory computer-readable medium storing a set of instructions for image description generation, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the device to: receive, from an input device, a setting indicating a level of verbosity; identify, within an image, a set of objects, wherein a set of words is associated with the image; rank the set of objects based on one or more properties of the set of objects and the set of words; select a subset of objects from the set of objects based on the setting and the ranking; and generate a description of the selected subset of objects based on the setting, wherein the one or more instructions, to cause the device to generate the description of the selected subset of objects, cause the device to select the description, from a database, based on the level of verbosity, wherein the description satisfies a length threshold for the level of verbosity based on selection from the database and trimming by a natural language processing model.
16 . The non-transitory computer-readable medium of claim 15 , wherein the one or more instructions, when executed by the one or more processors, further cause the device to: apply a machine learning model to rank the set of objects.
17 . The non-transitory computer-readable medium of claim 15 , wherein the one or more instructions, when executed by the one or more processors, further cause the device to: generate a narrative using a narrative model, wherein the narrative includes the description of the selected subset of objects, and wherein a plurality of connecting phrases of the narrative are selected pseudo-randomly.
18 . The non-transitory computer-readable medium of claim 15 , wherein the one or more instructions further cause the device to: use a background mixture model to distinguish background pixels of the image from foreground pixels of the image based on a corresponding mixture of Gaussian functions representing each pixel.
19 . The non-transitory computer-readable medium of claim 15 , wherein the one or more instructions, that cause the device to generate the description, cause the device to: select, using the natural language processing model, a length of the description, based on a voice command indicating the level of verbosity.
20 . The non-transitory computer-readable medium of claim 15 , wherein the one or more instructions further cause the device to: transmit, to a remote server, a request for content, wherein the image and the set of words are received from the remote server in response to the request.

Description

RELATED APPLICATION This application is a continuation of U.S. patent application Ser. No. 17/810,765, filed Jul. 5, 2022, which is incorporated herein by reference in its entirety. BACKGROUND Users with visual impairments often use screen readers to generate audio based on content displayed on a screen. For example, a visually impaired user may navigate to a webpage, using a user device, and use a text-to-speech algorithm to generate audio based on content of the webpage. SUMMARY Some implementations described herein relate to a system for image description generation for screen readers. The system may include one or more memories and one or more processors communicatively coupled to the one or more memories. The one or more processors may be configured to receive a setting indicating a level of verbosity. The one or more processors may be configured to receive an image and a set of words associated with the image. The one or more processors may be configured to identify a foreground of the image and a background of the image. The one or more processors may be configured to identify, within the foreground of the image, a set of objects. The one or more processors may be configured to rank the set of objects based on one or more properties of the set of objects and the set of words. The one or more processors may be configured to select a subset of objects from the set of objects based on the setting and the ranking. The one or more processors may be configured to generate descriptions of the selected subset of objects based on the setting. The one or more processors may be configured to input the generated descriptions to a text-to-speech algorithm. Some implementations described herein relate to a method of image description generation for screen readers. The method may include receiving an image and a set of words associated with the image. The method may include identifying a foreground of the image and a background of the image. The method may include determining, based on the set of words, that the background is more important than the foreground. The method may include generating a description of the background. The method may include inputting the generated description to a text-to-speech algorithm. Some implementations described herein relate to a non-transitory computer-readable medium that stores a set of instructions for image description generation for screen readers for a device. The set of instructions, when executed by one or more processors of the device, may cause the device to receive a setting indicating a level of verbosity. The set of instructions, when executed by one or more processors of the device, may cause the device to receive an image and a set of words associated with the image. The set of instructions, when executed by one or more processors of the device, may cause the device to identify a foreground of the image and a background of the image. The set of instructions, when executed by one or more processors of the device, may cause the device to identify, within the foreground of the image, a set of objects. The set of instructions, when executed by one or more processors of the device, may cause the device to rank the set of objects based on one or more properties of the set of objects and the set of words. The set of instructions, when executed by one or more processors of the device, may cause the device to select a subset of objects from the set of objects based on the setting and the ranking. The set of instructions, when executed by one or more processors of the device, may cause the device to generate descriptions of the selected subset of objects based on the setting. The set of instructions, when executed by one or more processors of the device, may cause the device to combine the generated descriptions using a plurality of connecting phrases into a narrative. The set of instructions, when executed by one or more processors of the device, may cause the device to input the narrative to a text-to-speech algorithm. BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1D are diagrams of an example implementation relating to image description generation for screen readers, in accordance with some embodiments of the present disclosure. FIGS. 2A-2B are diagrams of an example of training and using a machine learning model, in accordance with some embodiments of the present disclosure. FIG. 3 is a diagram of an example environment in which systems and/or methods described herein may be implemented, in accordance with some embodiments of the present disclosure. FIG. 4 is a diagram of example components of one or more devices of FIG. 3, in accordance with some embodiments of the present disclosure. FIG. 5 is a flowchart of an example process relating to image description generation for screen readers, in accordance with some embodiments of the present disclosure. DETAILED DESCRIPTION The following detailed description of example implementations refers to the accompanying drawings. The s