KR-102962034-B1 - METHOD FOR TRAINING TEXT EMBEDDING MODEL, METHOD FOR TEXT EMBEDDING AND SYSTEM THEREOF

KR102962034B1KR 102962034 B1KR102962034 B1KR 102962034B1KR-102962034-B1

Abstract

The present disclosure discloses an information processing system comprising a memory and a processor connected to the memory and configured to execute at least one computer-readable program contained in the memory. The information processing system may include at least one program comprising instructions for inputting a document into a pre-trained text embedding model and generating at least one first token associated with a word contained in the document and a second token associated with the document through the text embedding model.

Inventors

임예원
박우명

Assignees

사이오닉에이아이 주식회사

Dates

Publication Date: 20260507
Application Date: 20250611

Claims (18)

In a text embedding model training method performed by at least one processor, Step of receiving a learning document; A step of generating at least one first token associated with a word included in the received learning document and a second token to be added to the input sequence of the received learning document to summarize the overall meaning of the learning document; A step of updating the vector representations of at least one first token and the second token according to context from the first encoder to the first specific encoder; A step of reconstructing the output token of the first specific encoder into a plurality of chunks; and A step of updating the vector representation of tokens included in the plurality of chunks according to context, from the second specific encoder located after the first specific encoder to the last encoder. Includes, Each of the plurality of chunks input to the above-mentioned second specific encoder is, A first chunk token reconstructed based on a batch size from at least one first token output from the first specific encoder, and a second chunk token corresponding to the second token, Text embedding model training method.
In paragraph 1, A step of outputting a chunk embedding contained in a second chunk token of each of the plurality of chunks output from the last encoder above. including, Text embedding model training method.
In paragraph 1, A step of adding padding tokens when the sum of the first chunk token and the second chunk token included in at least one of the plurality of chunks is smaller than the batch size. including, Text embedding model training method.
In paragraph 1, Each of the above plurality of chunks is, Including at least a portion of the first chunk token included in an adjacent chunk by overlapping Text embedding model training method.
In paragraph 1, A step of counting the number of words included in at least one of a table, graph, chart, formula, code, item having a hierarchical structure, text paragraph, or map included in the received learning document; and Step of determining the batch size based on the number of words counted above including, Text embedding model training method.
In paragraph 1, The first specific encoder or the second specific encoder is the 22nd placed encoder, Text embedding model training method.
In a text embedding method performed by at least one processor, Step of inputting the document into a pre-trained text embedding model; A step of generating, through the text embedding model, at least one first token associated with a word included in the document and a second token to be added to the input sequence of the document to summarize the overall meaning of the document; A step of updating the vector representations of the at least one first token and the second token according to context from the first encoder of the text embedding model to the first specific encoder; A step of reconstructing the output token of the first specific encoder into a plurality of chunks through the text embedding model; and A step of updating the vector representations of tokens included in the plurality of chunks according to context through the text embedding model, from the second specific encoder located after the first specific encoder to the last encoder. Includes, Each of the plurality of chunks input to the above-mentioned second specific encoder is, A first chunk token reconstructed based on a batch size from at least one first token output from the first specific encoder, and a second chunk token corresponding to the second token, Text embedding method.
In Paragraph 7, A step of outputting a chunk embedding contained in a second chunk token of each of the plurality of chunks output from the last encoder above. including, Text embedding method.
In Paragraph 7, A step of adding padding tokens when the sum of the first chunk token and the second chunk token included in at least one of the plurality of chunks is smaller than the batch size. including, Text embedding method.
In Paragraph 7, Each of the above plurality of chunks is, Including at least a portion of the first chunk token included in an adjacent chunk by overlapping Text embedding method.
In Paragraph 7, A step of counting the number of words included in at least one of a table, graph, chart, formula, code, item having a hierarchical structure, text paragraph, or map included in the above document; and Step of determining the batch size based on the number of words counted above including, Text embedding method.
In Paragraph 7, The first specific encoder or the second specific encoder is the 22nd placed encoder, Text embedding method.
In information processing systems, Memory; and A processor connected to the memory and configured to execute at least one computer-readable program contained in the memory. Includes, The above at least one program is, Input the document into a pre-trained text embedding model, and At least one first token associated with a word included in the above document and a second token associated with the above document are generated through the text embedding model, and Updating the vector representations of the at least one first token and the second token from the first encoder of the text embedding model to the first specific encoder according to the context, and The output token of the first specific encoder is reconstructed into multiple chunks through the text embedding model, and It includes instructions for updating the vector representation of tokens included in the plurality of chunks according to context through the text embedding model, from the second specific encoder located after the first specific encoder to the last encoder. Each of the plurality of chunks input to the above-mentioned second specific encoder is, A first chunk token reconstructed based on a batch size from at least one first token output from the first specific encoder, and a second chunk token corresponding to the second token, Information processing system.
In Paragraph 13, The above at least one program is, Further including instructions for outputting a chunk embedding contained in a second chunk token of each of the plurality of chunks output from the last encoder above, Information processing system.
In Paragraph 13, The above at least one program is, Further including instructions for adding padding tokens when the sum of the first chunk token and the second chunk token included in at least one of the plurality of chunks is smaller than the batch size. Information processing system.
In Paragraph 13, Each of the above plurality of chunks is, Including at least a portion of the first chunk token included in an adjacent chunk by overlapping Information processing system.
In Paragraph 13, The above at least one program is, Counting the number of words included in at least one of the tables, graphs, charts, formulas, codes, items having a hierarchical structure, text paragraphs, or maps included in the above document, and Further including instructions for determining the batch size based on the number of words counted above, Information processing system.
In Paragraph 13, The first specific encoder or the second specific encoder is the 22nd placed encoder, Information processing system.

Description

Method for training text embedding model, method for text embedding and system thereof The present invention relates to a method for training a text embedding model, a text embedding method and a system through the same. With the recent rapid increase in the volume of text data generated from the Internet, social media, and electronic documents, the importance of Natural Language Processing (NLP) technology is steadily growing. In NLP technology, it is essential to quantify text data so that computers can process it, and one of the core technologies for this is text embedding. Transformer-based text embedding models can generate sentence or document-level embeddings that reflect context, demonstrating excellent performance in various high-dimensional natural language processing tasks such as semantic search, question answering, and text classification; however, they require large-scale computational resources and vast training data due to the significant length of the input text. Accordingly, a chunking technique has been proposed to divide long text input into transformer-based text embedding models into multiple semantic units (chunks). However, while applying chunking offers the advantage of reduced computational load, it leads to the problem of context loss. In particular, since context is frequently lost in structured data such as text exceeding chunk sizes or tables, improvements are needed to address this issue. The information described above disclosed in the background technology of this invention is intended only to enhance understanding of the background of the present invention and may therefore include information that does not constitute prior art. Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein similar reference numerals indicate similar elements, but are not limited thereto. FIG. 1 is a drawing showing an example of processing of an electronic device according to one embodiment of the present disclosure. FIG. 2 is a schematic diagram showing a configuration in which an information processing system is connected to communicate with a plurality of user terminals to provide a text embedding service according to one embodiment of the present disclosure. FIG. 3 is a block diagram showing the internal configuration of a user terminal and an information processing system according to one embodiment of the present disclosure. FIG. 4 is a diagram illustrating an example of a text embedding model learning method according to one embodiment of the present disclosure. FIG. 5 is a diagram illustrating an example of a transformer included in a text embedding model according to one embodiment of the present disclosure. FIG. 6 is a diagram illustrating an example of generating chunk embeddings of a text embedding model according to one embodiment of the present disclosure. FIG. 7 is a diagram illustrating an example of the application of a pre-trained text embedding model according to one embodiment of the present disclosure. FIG. 8 is a diagram illustrating an example of an evaluation metric of a pre-trained text embedding model using an evaluation dataset according to one embodiment of the present disclosure. FIG. 9 is a diagram illustrating an example of an evaluation metric for a text embedding model pre-trained based on a specific benchmark according to one embodiment of the present disclosure. FIG. 10 is a sequence diagram illustrating an example of a method for training a text embedding model according to one embodiment of the present disclosure. FIG. 11 is a sequence diagram illustrating an example of a text embedding method according to one embodiment of the present disclosure. Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein similar reference numerals indicate similar elements, but are not limited thereto. Hereinafter, specific details for implementing the present disclosure will be described in detail with reference to the attached drawings. However, in the following description, specific descriptions regarding widely known functions or configurations will be omitted if there is a risk that the gist of the present disclosure may be unnecessarily obscured. In the attached drawings, identical or corresponding components are assigned the same reference numerals. Additionally, in the description of the following embodiments, the description of identical or corresponding components may be omitted. However, even if a description of a component is omitted, it is not intended that such component is not included in any embodiment. The advantages and features of the disclosed embodiments and the methods for achieving them will become clear by referring to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below but may be implemented in various different