US-12626433-B2 - Generating supplemental text and image content in multimodal digital content items via machine learning

US12626433B2US 12626433 B2US12626433 B2US 12626433B2US-12626433-B2

Abstract

The present disclosure relates to systems, non-transitory computer-readable media, and methods for expanding a digital document including a sequence of informational data via supplemental multimodal digital content. In particular, the system expands digital documents with multimodal granular details to dynamically integrate supplemental in-depth information to the digital document. For example, in response to a selection of a specific portion of a digital document, the system generates expanded multimodal content (e.g., text and image content) for the selected portion of the digital document from external text and image sources. Indeed, the system uses existing content from the digital document to select images and combine the selected images with text into image-text pairs that are textually and visually consistent with the digital document. Moreover, the system expands the digital document by inserting the image-text pairs in connection with the selected portion of the digital document.

Inventors

Anant Shankhdhar
Samyak Sanjay Mehta
Shreya Singh
K V Vikram
Tripti Shukla
Srikrishna Karanam
Balaji Vasan Srinivasan
Vishwa Vinay
Niyati Himanshu Chhaya

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20230524

Claims (20)

1 . A computer-implemented method comprising: generating, by at least one processor and for a selected content item from a plurality of content items of a selected digital document, a plurality of extracted text content items by extracting text from one or more digital documents of a plurality of ranked digital documents of a digital document repository, wherein the plurality of extracted text content items correspond to supplemental textual content comprising granular detail related to the selected content item; retrieving, by the at least one processor and from an image repository, a plurality of selected digital images based on the plurality of extracted text content items; and modifying, by the at least one processor, the selected digital document by inserting image-text pairs comprising the plurality of extracted text content items and the plurality of selected digital images in connection with the selected content item.
2 . The computer-implemented method of claim 1 , further comprising detecting the selected content item from the plurality of content items by detecting a selection of an image-text pair within the selected digital document.
3 . The computer-implemented method of claim 1 , wherein determining the plurality of ranked digital documents from the digital document repository comprises selecting one or more digital documents from the digital document repository based on a similarity of textual content within the one or more digital documents to the selected content item.
4 . The computer-implemented method of claim 1 , wherein generating the plurality of extracted text content items comprises: generating text content dependency graphs for the extracted text from the one or more digital documents; generating a content item dependency graph for the selected content item; and selecting a subset of the extracted text from the one or more digital documents based on the text content dependency graphs and the content item dependency graph.
5 . The computer-implemented method of claim 4 , wherein modifying the selected digital document comprises: determining a text content order for the plurality of extracted text content items based on an order of the extracted text in the one or more digital documents; and inserting the plurality of extracted text content items and the plurality of selected digital images in the selected digital document based on the text content order.
6 . The computer-implemented method of claim 1 , wherein modifying the selected digital document comprises replacing the selected content item with the image-text pairs.
7 . The computer-implemented method of claim 1 , wherein retrieving the plurality of selected digital images comprises: parsing the plurality of extracted text content items to obtain key phrases associated with the selected content item; and retrieving, from the image repository, the plurality of selected digital images based on the key phrases.
8 . The computer-implemented method of claim 7 , wherein: parsing the plurality of extracted text content items to obtain the key phrases comprises: extracting a plurality of keywords from the plurality of extracted text content items; and generating a set of queries comprising the key phrases based on the plurality of keywords or one or more combinations of the plurality of keywords; and retrieving the plurality of selected digital images comprises performing digital image searches based on the set of queries comprising the key phrases.
9 . The computer-implemented method of claim 1 , wherein determining the plurality of selected digital images comprises: extracting first image features from the plurality of selected digital images; extracting second image features from one or more digital images in the selected digital document; and selecting the plurality of selected digital images based on the first image features and the second image features.
10 . A system comprising: one or more memory devices comprising a digital document; and one or more processors configured to cause the system to: determine in response to an indication of a selected content item from a plurality of ordered content items of a selected digital document, a plurality of ranked digital documents from a digital document repository; generate, for the selected content item and utilizing a natural language processing model, a plurality of text content items by comparing text of the selected content item to text extracted from one or more documents of the plurality of ranked digital documents, wherein the text extracted from the one or more documents corresponds to supplemental textual content comprising granular detail related to the selected content item; select, from an image repository, a plurality of selected digital images based on one or more queries generated from the plurality of text content items; and modify the selected digital document by inserting digital content comprising the plurality of text content items and the plurality of selected digital images into the plurality of ordered content items of the selected digital document.
11 . The system of claim 10 , where the one or more processors are further configured to cause the system to determine the indication of the selected content item from the plurality of ordered content items by detecting a selection of an image-text pair within the selected digital document.
12 . The system of claim 10 , where the one or more processors are further configured to cause the system to generate the plurality of text content items by: generating text content dependency graphs for the extracted text from the plurality of ranked digital documents; generating a content item dependency graph for the selected content item; and selecting a subset of the extracted text from the plurality of ranked digital documents based on the text content dependency graphs and the content item dependency graph.
13 . The system of claim 12 , where the one or more processors are further configured to cause the system to modify the selected digital document by: determining a text content order for the plurality of text content items based on an order of the extracted text in the plurality of ranked digital documents; and inserting the plurality of text content items and the plurality of selected digital images in the selected digital document based on the text content order.
14 . The system of claim 10 , where the one or more processors are further configured to cause the system to modify the selected digital document by inserting digital content comprising the plurality of text content items and the plurality of selected digital images into the plurality of ordered content items adjacent to the selected content item.
15 . The system of claim 10 , where the one or more processors are further configured to cause the system to retrieve the plurality of selected digital images by: parsing the plurality of text content items to obtain key phrases associated with the selected content item; and retrieving, from the image repository, the plurality of selected digital images based on the one or more queries comprising the key phrases.
16 . The system of claim 10 , where the one or more processors are further configured to cause the system to: determining, in response to an indication of a second selected content item from the plurality of ordered content items of the digital document, a second plurality of ranked digital documents from the digital document repository; generating, for the selected content item, a second plurality of text content items by extracting text from one or more documents of the second plurality of ranked digital documents; retrieving, from the image repository, a second plurality of selected digital images based on the plurality of text content items; and modifying, the selected digital document by inserting second digital content comprising the second plurality of text content items and the second plurality of selected digital images in connection with the second selected content item.
17 . A non-transitory computer readable medium storing executable instructions which, when executed by a processing device, cause the processing device to perform operations comprising: determining, in response to an indication of a selected content item from a plurality of content items of a selected digital document, a plurality of ranked digital documents from a digital document repository; generating, for the selected content item, a plurality of text content items by extracting text from one or more documents of the plurality of ranked digital documents, wherein the text from the one or more documents corresponds to supplemental textual content comprising granular detail related to the selected content item; retrieving, from an image repository, a plurality of selected digital images based on the plurality of text content items; and modifying, the selected digital document by inserting digital content comprising the plurality of text content items and the plurality of selected digital images in connection with the selected content item.
18 . The non-transitory computer readable medium of claim 17 , wherein determining the plurality of selected digital images comprises: extracting first image features from the plurality of selected digital images; extracting second image features from one or more digital images in the selected digital document; and selecting the plurality of selected digital images based on the first image features and the second image features.
19 . The non-transitory computer readable medium of claim 17 , wherein retrieving the plurality of selected digital images comprises: parsing the plurality of text content items to obtain key phrases associated with the selected content item; generating an abbreviated text content item from the plurality of text content items by removing stop-words from the plurality of text content items; and retrieving the plurality of selected digital images comprises performing digital image searches based on queries comprising the key phrases and the abbreviated text content item.
20 . The non-transitory computer readable medium of claim 17 , wherein the executable instructions cause the processing device to perform operations further comprising: determining, in response to an indication of a second selected content item from the plurality of content items of the selected digital document, a second plurality of ranked digital documents from the digital document repository; generating, for the selected content item, a second plurality of text content items by extracting text from one or more documents of the second plurality of ranked digital documents; retrieving, from the image repository, a second plurality of selected digital images based on the plurality of text content items; and modifying, the selected digital document by inserting second digital content comprising the second plurality of text content items and the second plurality of selected digital images in connection with the second selected content item.

Description

BACKGROUND Recent years have seen significant improvements in hardware and software platforms for generating and distributing digital documentation, resulting in an increased prevalence of digital documentation for many different subjects. For example, many entities and systems utilize digital documentation including procedural information (e.g., instructions or steps for performing specific processes) or non-procedural information (e.g., travelogs or descriptions of specific topics) with text and/or images to provide users with understanding of different concepts. Because some types of digital content (e.g., text, images) are better for describing/illustrating certain types of content and/or various display environments than others, generating digital documents that utilize the various modalities of communication to accurately and intuitively provide relevant information on various topics can be challenging. Conventional systems, consequently, have a number of shortcomings with regard to flexibility and accuracy in providing digital documentation with multimodal content for providing accurate and efficient understanding of specific concepts. SUMMARY Embodiments of the present disclosure solve one or more of the foregoing or other problems in the art with systems, non-transitory computer-readable media, and methods for expanding a digital document including a sequence of informational data via supplemental multimodal digital content. In particular, the system expands informational digital documents with multimodal granular details to dynamically integrate supplemental in-depth information to the digital document. For example, in response to a selection of a specific portion of a digital document, the system generates expanded multimodal informational content (e.g., text and image content) for the selected portion of the digital document from external text and image sources. Indeed, the system uses existing content from the digital document to select images and combine the selected images with text into image-text pairs that are textually and visually consistent with the digital document. Moreover, the system expands the digital document by inserting the image-text pairs in connection with the selected portion of the digital document. The system thus provides flexible and accurate expansion of digital documents with visual and contextual coherence according to the content of the digital documents. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description provides one or more embodiments with additional specificity and detail through the use of the accompanying drawings, as briefly described below. FIG. 1 illustrates a diagram of an environment in which a document expansion system operates in accordance with one or more embodiments. FIG. 2 illustrates an overview diagram of the document expansion system expanding a digital document in accordance with one or more embodiments. FIG. 3 illustrates an overview of a document expansion system inserting a series of image-text pairs into a digital document in accordance with one or more embodiments. FIG. 4 illustrates an overview of a document expansion system retrieving a plurality of ranked documents based on user selected content in accordance with one or more embodiments. FIG. 5 illustrates an example of expanding a content item from a digital document utilizing dependency graphs to determine a plurality of supplemental text content items in accordance with one or more embodiments. FIGS. 6A-6B illustrate the document expansion system generating image-text pairs comprising text content items and retrieved digital images in accordance with one or more embodiments. FIG. 7 illustrates selecting contextually consistent digital images for a digital document in accordance with one or more embodiments. FIGS. 8A-8C illustrate inserting supplemental image-text pairs into a digital document in accordance with one or more embodiments. FIGS. 9A-9B illustrate similarity score results for the document expansion system in accordance with one or more embodiments. FIG. 10 illustrates a schematic diagram of a document expansion system in accordance with one or more embodiments. FIG. 11 illustrates a flowchart of a series of acts for inserting supplemental multimodal content into a digital document utilizing machine-learning models in accordance with one or more embodiments. FIG. 12 illustrates a block diagram of an example computing device in accordance with one or more embodiments. DETAILED DESCRIPTION This disclosure describes one or more embodiments of a document expansion system for expanding a portion of a digital document including a sequence of informational data to provide additional details corresponding to the portion. In particular, the document expansion system expands informational digital documents with multimodal granular details to dynamically integrate supplemental in-depth information to the digital document. For example, in some embodiments, the document expansio