JP-7855396-B2 - Information processing device, information processing method, and program

JP7855396B2JP 7855396 B2JP7855396 B2JP 7855396B2JP-7855396-B2

Inventors

肥後智昭

Assignees

キヤノン株式会社

Dates

Publication Date: 20260508
Application Date: 20220509

Claims (17)

An information processing device that extracts named entities from documents using a natural language processing model, An acquisition means for obtaining text data from a document image obtained by reading the aforementioned document, A conversion means that processes the text data into token units and converts it into a token sequence, A calculation means for calculating the number of processing steps required to process the token sequence using the natural language processing model, A partitioning means for dividing the token sequence into blocks that can be processed by the natural language processing model, Processing means for inputting each of the aforementioned blocks into the natural language processing model and performing a process to estimate named entities, It has, The division means divides the token sequence into blocks based on the calculated number of processing steps, such that at least a portion of adjacent blocks overlap. The processing means selects one of the estimation results obtained from each block for each token belonging to the overlapping portion between adjacent blocks. An information processing device characterized by the following:
The information processing apparatus according to claim 1, characterized in that the calculation means calculates the minimum number of processing steps required to process all tokens included in the token sequence.
The information processing apparatus according to claim 2, characterized in that the division means divides the token sequence into blocks equal to the calculated minimum number of processing steps.
The system further includes a condition acquisition means for acquiring a delimiter condition for dividing the token sequence into predetermined units, The division means divides the token sequence into blocks based on the delimiter condition. The information processing apparatus according to feature 1.
The condition acquisition means acquires a plurality of delimiter conditions with different levels of granularity when delimiting the token sequence, The division means divides the token sequence into blocks by sequentially changing the delimiting condition applied from among the plurality of delimiting conditions so as not to increase the calculated number of processing steps. The information processing apparatus according to feature 4.
The aforementioned dividing means is For the aforementioned token sequence, a number of temporary blocks equal to the number of processing iterations is determined according to the upper limit number of tokens in the block. The block is generated by removing the tokens at the end of the temporary block according to the delimiter conditions. The information processing apparatus according to feature 5.
The information processing apparatus according to claim 6, characterized in that the division means applies the division conditions in order from the coarsest to the most coarse, and repeats the process until there are no overlapping areas between all adjacent blocks after removing the tokens at the edges of the temporary blocks.
The information processing apparatus according to claim 7, characterized in that the division means applies the division conditions in order from the coarsest to the coarsest, and even if there is a non-overlapping area between any adjacent blocks after removing the token at the end of the temporary block, if the two blocks corresponding to the non-overlapping area overlapped with each other at the temporary block stage, and the token following the last token of the front block obtained by removing the end token is the first token of the back block, the processing is terminated at that point.
The aforementioned delimiter conditions include two or more of the following: delimiting by paragraph, delimiting by line break, delimiting by period, delimiting by punctuation, delimiting by token. The aforementioned sequence of tokens can be roughly divided in the following order: by paragraph, by line break, by period, by punctuation, and by token. The information processing apparatus according to feature 5.
The processing means divides the overlapping portion into a first half and a second half, determines the estimation result of the block near the beginning as the named entity for tokens belonging to the first half, and determines the estimation result of the block near the end as the named entity for tokens belonging to the second half. This is the information processing apparatus according to claim 1.
The information processing apparatus according to claim 10, characterized in that the processing means divides the overlapping portion into the first half and the second half according to predetermined conditions for determining the boundary between the first half and the second half.
The information processing apparatus according to claim 1, characterized in that, when prioritizing recall in named entity estimation, if one of the estimation results for a token included in the overlapping portion indicates that it is not a named entity, the estimation result indicating that it is a named entity is selected.
The information processing apparatus according to claim 1, characterized in that, when prioritizing precision in named entity estimation, if the estimation results differ for tokens included in the overlapping portion, the estimation result is determined to indicate that it is not a named entity.
The converted token sequence further includes a deletion means for deleting unnecessary tokens, The information processing apparatus according to claim 1, characterized in that the calculation means calculates the number of processing steps with respect to the token sequence from which the unnecessary tokens have been removed.
The estimation results include a score representing the likelihood of a named entity tag corresponding to a token. The information processing apparatus according to claim 1, characterized in that the processing means selects one of the estimation results based on the score for tokens belonging to the overlapping portion of the block.
A method for controlling an information processing device that extracts named entities from documents using a natural language processing model, The acquisition step involves obtaining text data from a document image obtained by reading the aforementioned document, A transformation step involves processing the aforementioned text data into token units and converting it into a token sequence, A calculation step to calculate the number of processing steps required to process the token sequence using the natural language processing model, A partitioning step of dividing the token sequence into blocks that can be processed by the natural language processing model, A processing step which involves inputting each of the aforementioned blocks into the natural language processing model and performing a process to estimate named entities, Includes, In the division step, based on the calculated number of processing steps, the token sequence is divided into blocks such that at least a portion of adjacent blocks overlap. In the processing step, for each token belonging to the overlapping portion between adjacent blocks, one of the estimation results obtained from each block is selected. A control method characterized by the following:
A program for causing a computer to execute the control method described in claim 16.

Description

Regarding techniques for extracting named entities from documents. Named Entity Recognition (NER) is a commonly known technique for extracting strings (named entities) corresponding to predefined items from a document. According to NER, for example, by pre-defining the items "Company Name" and "Expiration Date," the strings "ABC Company" and "2022/03/07," corresponding to "Company Name" and "Expiration Date," respectively, can be extracted from the document's text. In recent years, natural language processing models such as Seq2Seq and Transformer, which have become mainstream in natural language processing, can obtain processing results by taking a sequence of tokens—units formed by dividing the text in a document into tokens—as input. When such models are used for named entity recognition, named entities can be efficiently estimated from the input token sequence. However, there is an upper limit to the number of tokens that can be input to the model at once, so long texts had to be divided into multiple token sequences for input. In this regard, Patent Document 1 discloses a method for dividing a document into sections such as chapters, sections, and paragraphs, and extracting named entities for each section. Japanese Patent Publication No. 2021-64143 A diagram showing the hardware configuration of an information processing device.A diagram showing an example of the functional configuration of an information processing device.A diagram illustrating a specific example of extracting named entities from document images.An example of a table showing the result of applying GT to the string corresponding to the token.A flowchart illustrating the process of splitting a token sequence to generate input blocks.A diagram illustrating how a temporary block is determined by overlapping token sequences of length T.This figure shows an example of the result when a delimiter condition is applied to a temporary block.A flowchart illustrating the process of extracting named entities from a set of input blocks.Diagram illustrating the third variation.A flowchart showing the processing procedure for determining the estimation result related to Modification 4.A diagram illustrating the effect of modified example 4. The embodiments for carrying out the present invention will be described below with reference to the drawings. Note that the following embodiments do not limit the invention as defined in the claims, and not all combinations of features described in the embodiments are necessarily essential to the solution of the invention. First, the hardware configuration of the information processing device shown in each embodiment will be described with reference to Figure 1. Figure 1 is a hardware configuration diagram of the information processing device 100. In Figure 1, the CPU 101 controls various devices connected to the system bus 109. The ROM 102 stores the BIOS (Basic Input/Output System) program and boot program. The RAM 103 is used as the main memory of the CPU 101. The external memory 104 stores the programs processed by the information processing device. The input unit 105 consists of various devices used for inputting information, such as a touch panel, keyboard, mouse, and robot controller. The display device 106 consists of a liquid crystal monitor, projector, LED indicator, etc., and displays a user interface screen (UI screen) and calculation results according to instructions from the CPU 101. The communication interface 107 is an interface that communicates information with external devices via a network such as a LAN or the Internet, according to communication standards such as Ethernet (registered trademark), USB, and Wi-Fi. I/O 108 is an input/output unit that, for example, connects to a scanner (not shown) to receive scanned document images (hereinafter referred to as "document images"). [Embodiment 1] In this embodiment, if the number of tokens in the token sequence obtained from the input text exceeds the upper limit that can be input to the natural language processing model, the number of processing steps required to process all of them is determined, and the input text is divided so that the token sequences overlap within a range that does not increase the number of processing steps. In this way, the accuracy of named entity recognition is improved while suppressing the increase in processing time. Examples of pre-trained natural language processing models include BERT (Bidirectional Encoder Representations from Transformers) and XLNet. Alternatively, instead of using publicly available models as described above, a model that has been pre-trained from scratch may be used. Furthermore, the model does not necessarily have to have a transformer-based structure; any pre-trained, high-accuracy natural language processing model will suffice. For example, a model with a uniquely designed structure or a model with a structure automatically designed by AUTOML or similar is acceptable. Hereafter, the explanati