US-20260127369-A1 - UTILIZING A MULTI-ENCODER MULTIMODAL LANGUAGE MODEL ARCHITECTURE TO ENHANCE READING ABILITY IN GENERATING QUERY RESPONSES FROM TEXTUAL CONTENT IN DIGITAL IMAGES

US20260127369A1US 20260127369 A1US20260127369 A1US 20260127369A1US-20260127369-A1

Abstract

The present disclosure relates to systems, non-transitory computer-readable media, and methods for reading text within digital images utilizing multimodal language models. In particular, in some embodiments, the disclosed systems generate, utilizing a first visual encoder, a first set of visual features of a digital image comprising text. In addition, in some embodiments, the disclosed systems generate, utilizing a second visual encoder, a second set of visual features of the digital image. Moreover, in some embodiments, the disclosed systems determine, utilizing a visual-text encoder, a text string corresponding to the text of the digital image. Furthermore, in some embodiments, the disclosed systems generate, for a query directed to the text of the digital image, a response from the first set of visual features, the second set of visual features, and the text string utilizing a large language model.

Inventors

Ruiyi Zhang
Yufan Zhou
Jian Chen
Jiuxiang Gu
Tong Sun

Assignees

ADOBE INC.

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (20)

1 . A computer-implemented method comprising: generating, utilizing a first visual encoder, a first set of visual features of a digital image comprising text; generating, utilizing a second visual encoder, a second set of visual features of the digital image; determining, utilizing a visual-text encoder, a text string corresponding to the text of the digital image; and generating, for a query directed to the text of the digital image, a response from the first set of visual features, the second set of visual features, and the text string utilizing a large language model.
2 . The computer-implemented method of claim 1 , wherein generating the second set of visual features comprises generating visual features that have a lower resolution than the first set of visual features.
3 . The computer-implemented method of claim 1 , further comprising: determining, utilizing the visual-text encoder, text location information for the text string within the digital image; and generating the response from the first set of visual features, the second set of visual features, the text string, and the text location information utilizing the large language model.
4 . The computer-implemented method of claim 1 , wherein generating the response comprises prompting the large language model with tokens for the first set of visual features, the second set of visual features, the text string, and the query.
5 . The computer-implemented method of claim 1 , further comprising: combining the first set of visual features and the second set of visual features into a set of combined visual features for the digital image; and generating, utilizing a projection layer to transform the set of combined visual features, visual tokens for the digital image.
6 . The computer-implemented method of claim 5 , further comprising: generating, utilizing a text tokenizer, text tokens for the text of the digital image; and generating, utilizing the text tokenizer, query tokens for the query directed to the text of the digital image.
7 . The computer-implemented method of claim 6 , wherein generating the response comprises prompting the large language model with the visual tokens, the text tokens, and the query tokens to generate the response for the query.
8 . A system comprising: a memory component; and one or more processing devices coupled to the memory component, the one or more processing devices to perform operations comprising: generating, utilizing a first visual encoder, low-resolution visual features of a digital image; generating, utilizing a second visual encoder, high-resolution visual features of the digital image, wherein the high-resolution visual features have a higher resolution than the low-resolution visual features; combining the high-resolution visual features and the low-resolution visual features into a set of combined visual features for the digital image; generating, utilizing a projection layer, visual tokens from the set of combined visual features for the digital image; and generating, for a query directed to text within the digital image, a response based on the visual tokens.
9 . The system of claim 8 , wherein the operations further comprise: generating a prompt comprising instructions to determine a text string corresponding to the text within the digital image; generating the text string from the prompt utilizing a large language model; and adjusting parameters of the projection layer to reduce a measure of loss determined by comparing the text string with a ground truth text string for the text within the digital image.
10 . The system of claim 8 , wherein the operations further comprise: generating a prompt comprising instructions to determine text location information for the text within the digital image; generating the text location information from the prompt utilizing a large language model; and adjusting parameters of the projection layer to reduce a measure of loss determined by comparing the text location information with ground truth text location information for the text within the digital image.
11 . The system of claim 8 , wherein the operations further comprise: generating a prompt comprising instructions to determine plain text and text location information for the text within the digital image; parsing the digital image to generate the plain text and the text location information from the prompt utilizing a large language model; and adjusting parameters of the projection layer to reduce a measure of loss determined by: comparing the plain text with ground truth text for the text within the digital image; and comparing the text location information with ground truth text location information for the text within the digital image.
12 . The system of claim 8 , wherein the operations further comprise: generating a prompt comprising instructions to reconstruct a layout of the text within the digital image; generating a textual layout of the text within the digital image from the prompt utilizing a large language model; and adjusting parameters of the projection layer to reduce a measure of loss determined by comparing the textual layout with a ground truth layout for the text within the digital image.
13 . The system of claim 8 , wherein the operations further comprise: determining, utilizing a visual-text encoder, a text string and text location information corresponding to the text within the digital image; and generating the response based on the visual tokens, the text string, the text location information, and the query.
14 . The system of claim 8 , wherein generating the response comprises prompting a large language model with the visual tokens and tokens for the query.
15 . A non-transitory computer-readable medium storing executable instructions that, when executed by a processing device, cause the processing device to perform operations comprising: generating, utilizing a high-resolution visual encoder and a low-resolution visual encoder, a set of visual features for a digital image; generating, utilizing a projection layer, visual tokens from the set of visual features for the digital image; determining, utilizing a visual-text encoder to extract text information from the digital image, a text string identifying text within the digital image; and generating, utilizing a large language model, a response for a query directed to the text based on the visual tokens.
16 . The non-transitory computer-readable medium of claim 15 , wherein the operations further comprise: generating the response for the query utilizing the large language model from the visual tokens and tokens for the query; and adjusting parameters of the large language model to reduce a measure of loss determined by comparing the response with a ground truth response for the query.
17 . The non-transitory computer-readable medium of claim 16 , wherein the operations further comprise: generating, from the text information, text tokens for the text string identifying the text within the digital image; and generating, utilizing the large language model, the response for the query from the text tokens.
18 . The non-transitory computer-readable medium of claim 16 , wherein the operations further comprise adjusting parameters of the projection layer to reduce the measure of loss determined by comparing the response with the ground truth response for the query.
19 . The non-transitory computer-readable medium of claim 15 , wherein generating the set of visual features comprises: utilizing the high-resolution visual encoder to generate high-resolution visual features for the digital image; utilizing the low-resolution visual encoder to generate low-resolution visual features for the digital image at a lower resolution than the high-resolution visual features; and combining the high-resolution visual features and the low-resolution visual features into a set of combined visual features for the digital image.
20 . The non-transitory computer-readable medium of claim 15 , wherein the operations further comprise: determining, utilizing the visual-text encoder, text location information for the text string identifying the text within the digital image; and generating, utilizing the large language model, the response for the query from text tokens for the text string and the text location information.

Description

BACKGROUND Recent years have seen developments in hardware and software platforms implementing vision models for reading text within digital images. For example, existing systems utilize large language models to understand and manipulate digital images. Despite these developments, existing systems suffer from a number of technical deficiencies, including inaccuracy and inefficiency. Indeed, many existing systems struggle with comprehending intensive textual content embedded within images, primarily due to the limited text recognition and layout understanding ability of implementing models. BRIEF SUMMARY Embodiments of the present disclosure provide benefits and/or solve one or more problems in the art with systems, non-transitory computer-readable media, and methods for utilizing a multi-encoder multimodal language model architecture to enhance model reading ability in generating query responses from textual content in digital images. In particular, in some embodiments, the disclosed systems utilize a multimodal large language model that utilizes dual visual encoders along with a visual text encoder that enables efficient extraction of visual texts. For example, the disclosed systems generate a first set of visual features of a digital image utilizing a high-resolution visual encoder and a second set of visual features of the digital image utilizing a low-resolution visual encoder. Additionally, in some implementations, the disclosed systems determine text strings corresponding to text depicted in the digital image, utilizing a visual-text encoder. Moreover, in some embodiments, the disclosed systems tokenize the visual features and the text strings, as well as a user query directed to the text. Furthermore, in some implementations, the disclosed systems prompt a large language model with the tokens for the visual features, text strings, and user query to generate a response to the user query. In addition, in some embodiments, the disclosed systems train one or more machine learning models used to generate the responses to the queries. For instance, in some implementations, the disclosed systems pretrain a projection layer that tokenizes the visual features according to one or more feature alignment tasks. Moreover, in some embodiments, the disclosed systems finetune the projection layer and the large language model for prompt instruction to enhance accuracy of response generation. By utilizing a multi-encoder multimodal large language model architecture and/or layout-aware pretraining and instruction finetuning, the disclosed systems demonstrate substantial enhancements in text-rich image understanding, surpassing multiple baselines on public benchmarks. The following description sets forth additional features and advantages of one or more embodiments of the disclosed methods, non-transitory computer-readable media, and systems. In some cases, such features and advantages are evident to a skilled artisan having the benefit of this disclosure, or may be learned by the practice of the disclosed embodiments. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description provides one or more embodiments with additional specificity and detail through the use of the accompanying drawings, as briefly described below. FIG. 1 illustrates a diagram of an environment in which a multimodal reading system operates in accordance with one or more embodiments. FIG. 2 illustrates the multimodal reading system parsing text from a digital image and responding to a query directed to the text in accordance with one or more embodiments. FIG. 3 illustrates the multimodal reading system extracting textual information from a digital image and generating a query response about the textual information in accordance with one or more embodiments. FIG. 4A-4C illustrate the multimodal reading system pretraining a projection layer in accordance with one or more embodiments. FIG. 5 illustrates the multimodal reading system finetuning a projection layer and a large language model in accordance with one or more embodiments. FIG. 6 illustrates the multimodal reading system reading a text-rich digital image and answering a question directed to the text of the digital image in accordance with one or more embodiments. FIG. 7 illustrates experimental results of the multimodal reading system, with comparisons to existing systems, in accordance with one or more embodiments. FIG. 8 illustrates a diagram of an example architecture of the multimodal reading system in accordance with one or more embodiments. FIG. 9 illustrates a flowchart of a series of acts for reading text within a digital image and generating a response to a query directed to the text within the digital image in accordance with one or more embodiments. FIG. 10 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure. DETAILED DESCRIPTION This disclosure describes one or more embodiments of a multimodal reading system