KR-20260063647-A - METHOD FOR LEARNING MODEL FOR GENERATING EMBEDDINGS OF DOCUMENT AND SYSTEM THEREFOR

KR20260063647AKR 20260063647 AKR20260063647 AKR 20260063647AKR-20260063647-A

Abstract

A method for training a document embedding generation model and a system thereof are provided. A method for training a document embedding generation model according to an embodiment of the present disclosure is a method performed by a computing device, comprising: generating a content token including context information of a first document and a layout token including format information of a first document; inputting the content token, a text embedding vector representing text information included in the first document, and an image embedding vector representing image information of the first document into an encoder and obtaining a first content embedding vector representing context information of the first document; inputting the layout token, the text embedding vector, and the image embedding vector into an encoder and obtaining a first layout embedding vector representing format information of the first document; inputting the first content embedding vector into a decoder and predicting the context of a masked first document based on the first content embedding vector; and inputting the first layout embedding vector into a decoder and, based on the first layout embedding vector, It may include a step of predicting the format of the first document.

Inventors

서지현
손형관
유지아
조현철

Assignees

삼성에스디에스 주식회사

Dates

Publication Date: 20260507
Application Date: 20241030

Claims (10)

In a method performed by a computing device, A step of generating a content token containing context information of a first document and a layout token containing format information of a first document; A step of inputting the content token, a text embedding vector representing text information included in the first document, and an image embedding vector representing image information of the first document into an encoder, and obtaining a first content embedding vector representing context information of the first document; A step of inputting the above layout token, the above text embedding vector, and the above image embedding vector into an encoder, and obtaining a first layout embedding vector representing format information of the first document; A step of inputting the first content embedding vector into a decoder and predicting the context of a masked first document based on the first content embedding vector; and A method comprising the step of inputting the first layout embedding vector into a decoder and predicting the format of the first document based on the first layout embedding vector. Method for training a document embedding generation model.
In Article 1, The above text information is, Text data included in the first document and location data of the text data on the first document, Method for training a document embedding generation model.
In Article 1, The step of obtaining the first content embedding vector is, A step of comparing a second content embedding vector representing contextual information of a second document different from the first document with the first content embedding vector; and A method comprising the step of adjusting the first content embedding vector based on the degree of similarity between the second content embedding vector and the first content embedding vector using the result of the above comparison, Method for training a document embedding generation model.
In Paragraph 3, The step of adjusting the first content embedding vector above is, A step of masking some of the above text embedding vectors; A step of predicting the masked text embedding vector using the first content embedding vector; and A step comprising adjusting the first content embedding vector using the result of the above prediction, Method for training a document embedding generation model.
In Article 1, The step of obtaining the first layout embedding vector is, A step of comparing a second layout embedding vector representing format information of a second document different from the first document with the first layout embedding vector; and A method comprising the step of adjusting the first layout embedding vector based on the degree of similarity between the second layout embedding vector and the first layout embedding vector using the result of the above comparison, Method for training a document embedding generation model.
In Article 5, The step of adjusting the first layout embedding vector above is, A step of extracting an arbitrary first embedding vector among the text embedding vector and the image embedding vector; A step of predicting layout information of a token corresponding to the first embedding vector using the first layout embedding vector; and A step comprising adjusting the first layout embedding vector using the result of the above prediction, Method for training a document embedding generation model.
In Article 1, The step of predicting the context of the masked first document is, A step of masking some of the data included in the first document and generating a first mask document; A step of inputting the first content embedding vector into the decoder and obtaining a predicted document which is a prediction result for masked data included in the first mask document; A step of comparing the above-mentioned expected document with the above-mentioned first document and calculating the loss between the above-mentioned expected document and the above-mentioned first document; and Based on the above loss, the step of adjusting the first content embedding vector, Method for training a document embedding generation model.
In Article 1, The step of predicting the format of the first document above is, A step of inputting the first layout embedding vector into the decoder and obtaining expected layout information for each of the data included in the first document; A step of comparing expected layout information for each of the data included in the first document with actual layout information for each of the data included in the first document, and calculating a loss between the expected layout information and the actual layout information; and Based on the above loss, the step of adjusting the first layout embedding vector is included, The above layout information is, including category, location information, and size information of the data included in the first document above, Method for training a document embedding generation model.
Communication interface; Memory where a computer program is loaded; and The computer program described above includes one or more processors on which it is executed, The above computer program is, The operation of generating a content token containing context information of the first document and a layout token containing format information of the first document; The operation of inputting the content token, a text embedding vector representing text information included in the first document, and an image embedding vector representing image information of the first document into an encoder, and obtaining a first content embedding vector representing context information of the first document; The operation of inputting the above layout token, the above text embedding vector, and the above image embedding vector into an encoder, and obtaining a first layout embedding vector representing the format information of the first document; The operation of inputting the first content embedding vector into a decoder and predicting the context of a masked first document based on the first content embedding vector; and Instructions including inputting the first layout embedding vector into a decoder and performing an operation to predict the format of the first document based on the first layout embedding vector, Document embedding generation model training system.
Combined with a computing device, A step of generating a content token containing context information of a first document and a layout token containing format information of a first document; A step of inputting the content token, a text embedding vector representing text information included in the first document, and an image embedding vector representing image information of the first document into an encoder, and obtaining a first content embedding vector representing context information of the first document; A step of inputting the above layout token, the above text embedding vector, and the above image embedding vector into an encoder, and obtaining a first layout embedding vector representing format information of the first document; A step of inputting the first content embedding vector into a decoder and predicting the context of a masked first document based on the first content embedding vector; and Inputting the first layout embedding vector into a decoder and executing the step of predicting the format of the first document based on the first layout embedding vector, A computer program stored on a computer-readable recording medium.

Description

Method for Learning Model for Generating Document Embeddings and System Thereof The present disclosure relates to a method and system for training a model for generating document embeddings. More specifically, it relates to a method and system for training a model that generates embedding vectors based on the content and layout of a document. A Multimodal Large Language Model (MLM) is a large-scale language model that processes various forms of data simultaneously, such as text, images, audio, and video. Unlike existing language models that handle only text, Multimodal LLMs can integrally understand and process multiple data modalities. Traditionally, when generating embeddings to measure similarity between documents, the approach has primarily relied on the document text. Keywords were extracted from the full text or summarized portions of a document and represented as keyword-specific weight vectors; these weight vectors were then assumed to be document embeddings, and document similarity was measured using the similarity between the two embeddings. FIG. 1 is a system configuration diagram for explaining the configuration and operation of a system for learning a document embedding generation model according to some embodiments of the present disclosure. FIG. 2 is a flowchart for explaining a method for training a document embedding generation model according to some embodiments of the present disclosure. Figure 3 is a detailed flowchart for explaining some operations of the method for training an embedding generation model of a document described with reference to Figure 2. FIG. 4 is an illustrative diagram for explaining a method for generating content embedding vectors according to some embodiments of the present disclosure. Figure 5 is a detailed flowchart for explaining some operations of the method for training an embedding generation model of a document described with reference to Figure 3. Figure 6 is a detailed flowchart for explaining some operations of the method for training an embedding generation model of a document described with reference to Figure 2. FIG. 7 is an illustrative diagram for explaining a method for generating layout embedding vectors according to some embodiments of the present disclosure. Figure 8 is a detailed flowchart for explaining some operations of the method for training an embedding generation model of a document described with reference to Figure 6. Figure 9 is a detailed flowchart for explaining some operations of the method for training an embedding generation model of a document described with reference to Figure 2. Figure 10 is a detailed flowchart for explaining some operations of the embedding generation model training method of the document described with reference to Figure 2. FIG. 11 is an illustrative diagram for explaining layout information referenced by some embodiments of the present disclosure. FIG. 12 is an illustrative diagram for explaining an example in which an embedding generation model learned by the method for learning an embedding generation model of a document according to the present disclosure is applied. FIG. 13 is a hardware configuration diagram of a computing device described in some embodiments of the present disclosure. Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the attached drawings. The advantages and features of the present invention and the methods for achieving them will become clear by referring to the embodiments described below in detail together with the attached drawings. However, the technical concept of the present invention is not limited to the following embodiments but can be implemented in various different forms. The following embodiments are provided merely to complete the technical concept of the present invention and to fully inform those skilled in the art of the scope of the present invention, and the technical concept of the present invention is defined only by the scope of the claims. In describing the present disclosure, if it is determined that a detailed description of related known configurations or functions could obscure the essence of the invention, such detailed description is omitted. Unless otherwise defined, terms used in the following embodiments (including technical and scientific terms) may be used in a meaning commonly understood by those skilled in the art to which this disclosure pertains, but this may vary depending on the intent of those skilled in the art, case law, the emergence of new technology, etc. The terms used in this disclosure are for describing the embodiments and are not intended to limit the scope of this disclosure. In the following embodiments, singular expressions include plural concepts unless the context clearly specifies them as singular. Additionally, plural expressions include singular concepts unless the context clearly specifies them as plural. In addition, terms such as first, second, A, B, (a), (b), etc.