US-12619814-B2 - Method, device, and computer program product for data augmentation

US12619814B2US 12619814 B2US12619814 B2US 12619814B2US-12619814-B2

Abstract

Embodiments of the present disclosure relate to a method, a device, and a computer program product for data augmentation. The method includes generating an image embedding based on an image in an unstructured document, and generating a text embedding based on text in the unstructured document and associated with the image. The method further includes acquiring descriptive information from a storage library based on the generated image embedding and text embedding. The method further includes adding the acquired descriptive information into the unstructured document. In this way, it can be possible not only to understand and analyze the unstructured document across modalities, but also to enrich it with a characterization of multimodal data in the unstructured document, thus increasing the amount and diversity of data.

Inventors

Jiacheng Ni
Bin He
Zijia Wang
Zhen Jia

Assignees

DELL PRODUCTS L.P.

Dates

Publication Date: 20260505
Application Date: 20231106
Priority Date: 20231013

Claims (20)

1 . A method for data augmentation, comprising: generating an image embedding based on an image in an unstructured document, wherein the image embedding comprises a first vector generated by applying at least one object detected in the image to an image encoder; generating a text embedding based on text in the unstructured document and associated with the image, wherein the text embedding comprises a second vector generated by applying at least a portion of the text to a text encoder; generating a multimodal embedding based on the first vector and the second vector; acquiring descriptive information from a storage library based on the multimodal embedding generated using the first vector and second vector corresponding to the respective image embedding and text embedding; adding the acquired descriptive information into the unstructured document, wherein adding the acquired descriptive information into the unstructured document comprises inserting the acquired descriptive information into the unstructured document to generate an updated version of the unstructured document, the updated version including the acquired descriptive information in addition to the image and the text; and storing the image embedding comprising the first vector, the text embedding comprising the second vector, and the acquired descriptive information as related informational elements in a multi-element unified format to provide image-text pair indexing of the acquired descriptive information, wherein the multi-element unified format comprises a file of a particular file type having an ordered arrangement of the related informational elements, with the image embedding, the text embedding and the acquired descriptive information being stored as respective ones of the related informational elements in the ordered arrangement of the file, and the file being stored as part of the storage library for utilization in subsequent data augmentation operations relating to one or more additional unstructured documents.
2 . The method according to claim 1 , wherein generating the image embedding comprises: extracting the image from the unstructured document; detecting an object entity in the extracted image; and encoding the detected object entity as the image embedding.
3 . The method according to claim 2 , wherein generating the text embedding comprises: extracting the text from the unstructured document; recognizing a tag entity in the extracted text; and encoding the recognized tag entity as the text embedding.
4 . The method according to claim 3 , wherein encoding for the object entity is performed by an image encoder of a plurality of image encoders and corresponding to the type of the object entity, and encoding for the tag entity is performed by a text encoder of a plurality of text encoders and corresponding to the type of the tag entity, wherein the plurality of image encoders and the plurality of text encoders are preliminarily co-trained based on training data comprising an image-text pair, and wherein the image-text pair comprises an image and text associated with each other.
5 . The method according to claim 1 , wherein the storage library comprises: a data storage library, the data storage library being configured to store an image-text pair comprising an image and text associated with each other and descriptive information corresponding to one of the image and the text associated with each other comprised in the image-text pair, wherein the descriptive information indicates the image and the text associated with each other in the image-text pair; a model storage library, the model storage library being configured to store a plurality of image encoders and a plurality of text encoders, each image encoder of the plurality of image encoders corresponding to a respective type of an object entity of the image, and each text encoder of the plurality of text encoders corresponding to a respective type of a tag entity of the text; and a feature storage library, the feature storage library being configured to store a plurality of multimodal embeddings, each multimodal embedding of the plurality of multimodal embeddings corresponding to a respective image-text pair.
6 . The method according to claim 5 , further comprising: filtering out, based on the text embedding, a first predetermined number of multimodal embeddings from the plurality of multimodal embeddings stored in the feature storage library.
7 . The method according to claim 6 , further comprising: determining similarity of the generated multimodal embedding to each multimodal embedding of the first predetermined number of multimodal embeddings by comparing the multimodal embedding with the first predetermined number of multimodal embeddings; and determining, as further multimodal embeddings associated with the multimodal embedding, a second predetermined number of multimodal embeddings among the first predetermined number of multimodal embeddings having similarity to the multimodal embedding higher than a predetermined similarity threshold, wherein the second predetermined number is less than the first predetermined number.
8 . The method according to claim 7 , wherein acquiring the descriptive information from the storage library comprises: determining an image-text pair corresponding to the further multimodal embeddings; and determining the descriptive information corresponding to the determined image-text pair.
9 . The method according to claim 7 , further comprising: storing, in the data storage library, the image and the text in the unstructured document in association with the descriptive information; and storing the multimodal embedding in the feature storage library.
10 . The method according to claim 1 , wherein the added descriptive information is editable, and the added descriptive information comprises structured information.
11 . An electronic device, comprising: at least one processor; and memory coupled to the at least one processor and storing instructions, wherein the instructions, when executed by the at least one processor, cause the electronic device to perform actions comprising: generating an image embedding based on an image in an unstructured document, wherein the image embedding comprises a first vector generated by applying at least one object detected in the image to an image encoder; generating a text embedding based on text in the unstructured document and associated with the image, wherein the text embedding comprises a second vector generated by applying at least a portion of the text to a text encoder; generating a multimodal embedding based on the first vector and the second vector; acquiring descriptive information from a storage library based on the multimodal embedding generated using the first vector and second vector corresponding to the respective image embedding and text embedding; adding the acquired descriptive information into the unstructured document, wherein adding the acquired descriptive information into the unstructured document comprises inserting the acquired descriptive information into the unstructured document to generate an updated version of the unstructured document, the updated version including the acquired descriptive information in addition to the image and the text; and storing the image embedding comprising the first vector, the text embedding comprising the second vector, and the acquired descriptive information as related informational elements in a multi-element unified format to provide image-text pair indexing of the acquired descriptive information, wherein the multi-element unified format comprises a file of a particular file type having an ordered arrangement of the related informational elements, with the image embedding, the text embedding and the acquired descriptive information being stored as respective ones of the related informational elements in the ordered arrangement of the file, and the file being stored as part of the storage library for utilization in subsequent data augmentation operations relating to one or more additional unstructured documents.
12 . The electronic device according to claim 11 , wherein generating the image embedding comprises: extracting the image from the unstructured document; detecting an object entity in the extracted image; and encoding the detected object entity as the image embedding.
13 . The electronic device according to claim 12 , wherein generating the text embedding comprises: extracting the text from the unstructured document; recognizing a tag entity in the extracted text; and encoding the recognized tag entity as the text embedding.
14 . The electronic device according to claim 13 , wherein encoding for the object entity is performed by an image encoder of a plurality of image encoders and corresponding to the type of the object entity, and encoding for the tag entity is performed by a text encoder of a plurality of text encoders and corresponding to the type of the tag entity, wherein the plurality of image encoders and the plurality of text encoders are preliminarily co-trained based on training data comprising an image-text pair, and wherein the image-text pair comprises an image and text associated with each other.
15 . The electronic device according to claim 11 , wherein the storage library comprises: a data storage library, the data storage library being configured to store an image-text pair comprising an image and text associated with each other and descriptive information corresponding to one of the image and the text associated with each other comprised in the image-text pair, wherein the descriptive information indicates the image and the text associated with each other in the image-text pair; a model storage library, the model storage library being configured to store a plurality of image encoders and a plurality of text encoders, each image encoder of the plurality of image encoders corresponding to a respective type of an object entity of the image, and each text encoder of the plurality of text encoders corresponding to a respective type of a tag entity of the text; and a feature storage library, the feature storage library being configured to store a plurality of multimodal embeddings, each multimodal embedding of the plurality of multimodal embeddings corresponding to a respective image-text pair.
16 . The electronic device according to claim 15 , wherein the actions further comprise: filtering out, based on the text embedding, a first predetermined number of multimodal embeddings from the plurality of multimodal embeddings stored in the feature storage library.
17 . The electronic device according to claim 16 , wherein the actions further comprise: determining similarity of the generated multimodal embedding to each multimodal embedding of the first predetermined number of multimodal embeddings by comparing the multimodal embedding with the first predetermined number of multimodal embeddings; and determining, as further multimodal embeddings associated with the multimodal embedding, a second predetermined number of multimodal embeddings among the first predetermined number of multimodal embeddings having similarity to the multimodal embedding higher than a predetermined similarity threshold, wherein the second predetermined number is less than the first predetermined number.
18 . The electronic device according to claim 17 , wherein acquiring the descriptive information from the storage library comprises: determining an image-text pair corresponding to the further multimodal embeddings; and determining the descriptive information corresponding to the determined image-text pair.
19 . The electronic device according to claim 17 , wherein the actions further comprise: storing, in the data storage library, the image and the text in the unstructured document in association with the descriptive information; and storing the multimodal embedding in the feature storage library.
20 . A computer program product comprising a non-transitory computer-readable medium having computer-executable instructions stored therein, wherein the computer-executable instructions, when executed by a computer, cause the computer to perform actions comprising: generating an image embedding based on an image in an unstructured document, wherein the image embedding comprises a first vector generated by applying at least one object detected in the image to an image encoder; generating a text embedding based on text in the unstructured document and associated with the image, wherein the text embedding comprises a second vector generated by applying at least a portion of the text to a text encoder; generating a multimodal embedding based on the first vector and the second vector; acquiring descriptive information from a storage library based on the multimodal embedding generated using the first vector and second vector corresponding to the respective image embedding and text embedding; adding the acquired descriptive information into the unstructured document, wherein adding the acquired descriptive information into the unstructured document comprises inserting the acquired descriptive information into the unstructured document to generate an updated version of the unstructured document, the updated version including the acquired descriptive information in addition to the image and the text; and storing the image embedding comprising the first vector, the text embedding comprising the second vector, and the acquired descriptive information as related informational elements in a multi-element unified format to provide image-text pair indexing of the acquired descriptive information, wherein the multi-element unified format comprises a file of a particular file type having an ordered arrangement of the related informational elements, with the image embedding, the text embedding and the acquired descriptive information being stored as respective ones of the related informational elements in the ordered arrangement of the file, and the file being stored as part of the storage library for utilization in subsequent data augmentation operations relating to one or more additional unstructured documents.

Description

RELATED APPLICATION The present application claims priority to Chinese Patent Application No. 202311332977.9, filed Oct. 13, 2023, and entitled “Method, Device, and Computer Program Product for Data Augmentation,” which is incorporated by reference herein in its entirety. FIELD Embodiments of the present disclosure generally relate to the field of computers, and in particular to a method, a device, and a computer program product for data augmentation. BACKGROUND Unstructured data refers to data that has irregular or incomplete data structures, has no predefined data model, or cannot be or is difficult to be represented in two-dimensional logical tables of databases. As in the case of other types of data, unstructured data has experienced explosive growth. This unstructured data includes various forms of information, such as text, images, audio, video, etc. The unstructured data is diverse and flexible since it has multiple forms and is not limited to fixed data models as compared with formatted data. In addition, the unstructured data contains a large amount of information, and valuable information may be extracted therefrom for decision-making, judgment, and the like through data mining technology, etc. In order to better process and analyze the unstructured data, it is required to constantly explore new technologies and methods. SUMMARY Embodiments of the present disclosure provide a solution for data augmentation. With this solution, it can be possible not only to understand and analyze the unstructured document across modalities, but also to enrich it with a characterization of multimodal data in the unstructured document, thus increasing the amount and diversity of data. In a first aspect of the present disclosure, a method for data augmentation is provided, the method including generating an image embedding based on an image in an unstructured document, and generating a text embedding based on text in the unstructured document and associated with the image. The method further includes acquiring descriptive information from a storage library based on the generated image embedding and text embedding. The method further includes adding the acquired descriptive information into the unstructured document. In another aspect of the present disclosure, an electronic device is provided, the device including a processor and a memory, the memory being coupled to the processor and storing instructions thereon, these instructions, when executed by the processor, causing the electronic device to perform actions, these actions including generating an image embedding based on an image in an unstructured document, and generating a text embedding based on text in the unstructured document and associated with the image. These actions further include acquiring descriptive information from a storage library based on the generated image embedding and text embedding. These actions further include adding the acquired descriptive information into the unstructured document. In still another aspect of the present disclosure, a computer program product is provided. The computer program product is tangibly stored on a non-transitory computer-readable storage medium and includes computer-executable instructions, these computer-executable instructions, when executed by a computer, causing the computer to perform the method or process according to embodiments of the present disclosure. It should be noted that this Summary is provided to introduce a series of concepts in a simplified manner, and these concepts will be further described in the Detailed Description below. The Summary is neither intended to identify key features or necessary features of the present disclosure, nor intended to limit the scope of the present disclosure. BRIEF DESCRIPTION OF THE DRAWINGS By description of embodiments of the present disclosure, provided in further detail herein with reference to the accompanying drawings, the above and other objectives, features, and advantages of the present disclosure will become more apparent, in which: FIG. 1 is a block diagram of an example environment in which a method and/or a process according to an embodiment of the present disclosure may be implemented; FIG. 2 is a flow chart of a method for data augmentation according to an embodiment of the present disclosure; FIG. 3 illustrates a data augmentation process for an unstructured document according to an embodiment of the present disclosure; FIG. 4 illustrates an embedding generation process for multimodal data according to an embodiment of the present disclosure; FIG. 5 illustrates an index building process according to an embodiment of the present disclosure; FIG. 6 illustrates an example of a storage library according to an embodiment of the present disclosure; FIG. 7 illustrates an interaction of multiple unified sets of class block storage containers according to an embodiment of the present disclosure; and FIG. 8 is a block diagram of an example device that ma