JP-7854631-B2 - A system for estimating important words, a method for constructing the system, and a method for classifying words.

JP7854631B2JP 7854631 B2JP7854631 B2JP 7854631B2JP-7854631-B2

Inventors

赤部晃一
竹内俊貴
青木貴司
西村邦裕

Assignees

株式会社テンクー

Dates

Publication Date: 20260507
Application Date: 20210430

Claims (7)

A system for estimating and outputting important words in a specific field from a document received in that specific field, A database including a positive example document corpus consisting of positive example documents defined as cases related to a specific field, and a negative example document corpus consisting of negative example documents defined as cases not related to the said specific field, A word classification unit classifies each of the multiple words in the received document into at least one of the following: selected words, non-selected words, and undetermined words, according to a score calculated based on the number of times the word appears in the positive example document and the negative example document, respectively. A word selection model is subjected to machine learning, using pre-assigned learning labels for words corresponding to at least one of the target words and one of the non-target words, and taking multiple words extracted from the positive example documents in the positive example document corpus and the negative example documents in the negative example document corpus as input, to output as an estimation result a label indicating either a target word or a non-target word according to the context in which the word relates . It comprises an output unit that outputs output data, The word classification unit inputs at least the words classified as indeterminate words into the word selection model, The word selection model, when estimating a label indicating a target word for a word that has been classified as an indeterminate word input from the word classification unit, classifies the word as a target word. The output unit outputs the words classified as the target words as output data indicating important words in the specific field. system.
The word classification unit classifies the input word into at least one of the selected word, the non-selected word, and the undetermined word based on the number of times the word appears in the positive example document corpus and the number of times the word appears in the negative example document corpus. The system described in claim 1.
The word classification unit classifies each of the multiple words in the received document into at least one of the following: a word to be selected, a word not to be selected, or an undetermined word, according to a score calculated based on the number of positive example documents in which the word appears in the positive example document corpus and the number of negative example documents in which the word appears in the negative example document corpus. The system according to claim 1.
The word classification unit classifies the input word into at least one of the selected word, the non-selected word, and the undetermined word, based on the number of positive example documents in which the word appears in the positive example document corpus and the number of negative example documents in which the word appears in the negative example document corpus. The system described in claim 3.
The system further comprises means for distinguishing and outputting, from other words in the received document, a word that has been classified as an indeterminate word input to the word selection model subjected to machine learning, a word for which a label indicating the target word has been estimated, from other words in the received document. The system according to any one of claims 1 to 4.
A method performed in a system for estimating and outputting important words in a document of a specific field, To construct a database that includes a positive example document corpus consisting of positive example documents defined as cases related to the aforementioned specific field, and a negative example document corpus consisting of negative example documents defined as cases not related to the aforementioned specific field, Using pre-assigned learning labels for words corresponding to at least one of the target words and one of the non- target words, a word selection model is constructed by applying machine learning to multiple words extracted from the positive example documents in the positive example document corpus and the negative example documents in the negative example document corpus of the constructed database, and outputting a label indicating either a target word or a non-target word as an estimation result, according to the context related to the word. Each of the multiple words in the received document is classified into at least one of the following: selected words, non-selected words, and undetermined words, according to a score calculated based on the number of times that word appears in the positive example document and the negative example document, respectively. At least the words classified as indeterminate words are input into the word selection model, When estimating a label indicating the target word for a word classified as an indeterminate word by the word selection model, the word is classified as a target word. This includes outputting the words classified as target words as important words in the specific field, method.
This further includes constructing a document classification model that uses machine learning to calculate the accuracy of document classification according to the frequency of word occurrences in the specified field, using a first set of documents in a specific field document corpus related to the specified field and a second set of documents in a general field document corpus covering a broader field than the specified field as training data. Building the aforementioned database means Accepting documents to be classified, Using the document classification model that has undergone machine learning, the accuracy of the received documents to be classified is calculated. This includes classifying the documents to be classified as either good example documents or bad example documents according to the calculated accuracy score. The method described in claim 6.

Description

This invention relates to document display support technology, and more particularly to a document display support system and a document display support method for visually distinguishing and displaying specific words in a document in a particular field, as well as a program for executing the method. Currently, a vast number of documents are stored in databases worldwide. Typically, users need to read and understand a document to determine whether it is relevant to a particular field and whether it is important. For example, in the field of cancer genomics, medical professionals such as doctors and researchers annotate information obtained through data analysis, such as gene mutations, drugs, and clinical trial information, when creating reports. Of this information, information related to gene mutations and clinical significance is almost always published in the form of academic papers, requiring medical professionals to read and correctly understand the complex text of these papers. However, the number of papers that need to be referenced is enormous, the content of each paper is complex, and the sheer volume of information is overwhelming. Therefore, it is difficult to quickly determine whether a particular paper is useful in cancer genomic medicine, and which parts of the paper contain important information. For this reason, several technologies have been proposed to improve the display of documents that users are trying to reference. For example, Patent Document 1 discloses a technology for checking only the desired sentences from long patent information data. Specifically, Patent Document 1 discloses an information processing device in which the processing means comprises a content discrimination character search unit, a highlighting display processing unit, a sentence segmentation discrimination unit, a display determination unit, and a hide setting unit. The display means displays sentences separated by paragraph marks determined by the display determination unit to contain the highlighted characters, for each case data, and also displays the highlighted characters that have been colored by the highlighting display processing unit. Furthermore, Patent Document 2 (described below) discloses a technology for automatically searching for relevant case reports from past medical reports, including the sections for examination purpose, findings, and diagnosis. This technology uses keywords—words or sequences of words weighted according to the frequency and importance of the cases—to automatically search for relevant medical reports from past medical reports. Japanese Patent Publication No. 2016-207071Japanese Patent Publication No. 2012-141797 This is a block diagram showing an example of a schematic configuration of a document display support system according to one embodiment of the present invention.This is a block diagram illustrating the functional configuration of a document display support system according to one embodiment of the present invention.This figure illustrates the machine learning of a word selection model in a document display support system according to one embodiment of the present invention.This figure illustrates estimation using a trained word selection model in a document display support system according to one embodiment of the present invention.This figure shows an example of a document displayed on a browser screen in a document display support system according to one embodiment of the present invention.This is a block diagram illustrating the functional configuration of a document display support system according to one embodiment of the present invention.This diagram illustrates an example of the configuration of a document database in a document display support system according to one embodiment of the present invention.This is a block diagram showing an example of the schematic configuration of a document display support system according to one embodiment of the present invention.This is a flowchart illustrating the process of building a document database in a document display support system according to one embodiment of the present invention.This is a flowchart illustrating the learning process of a word selection model in a document display support system according to one embodiment of the present invention.This is a flowchart illustrating the learning process of a word selection model in a document display support system according to one embodiment of the present invention.This figure shows an example of the hardware configuration of a document display support system according to one embodiment of the present invention. The embodiments of the present invention will be described below with reference to the drawings. However, the embodiments described below are merely illustrative, and there is no intention to exclude various modifications or applications of techniques not explicitly stated below. The present invention can be implemented by various modifications (e.g., co