US-12619680-B2 - Learning data generation device, method, and record medium for storing program
Abstract
A learning data generation device includes processing circuitry to extract a cause expression and a result expression from an input text, and to generate a modified text by at least one of a method of interchanging the cause expression and the result expression and a method of specifying one of the cause expression and the result expression as a modification target sentence and replacing the modification target sentence with a replacement candidate sentence dissimilar to the modification target sentence.
Inventors
- Ryunosuke OKA
- Hiroyasu ITSUI
- Hayato UCHIDE
Assignees
- MITSUBISHI ELECTRIC CORPORATION
Dates
- Publication Date
- 20260505
- Application Date
- 20221017
Claims (12)
- 1 . A learning data generation device comprising: processing circuitry configured to, for each of a plurality of input text extract a cause expression and a result expression from an input text among the plurality of input text; generate a modified text by at least one of a method of interchanging the cause expression and the result expression and a method of specifying one of the cause expression and the result expression as a modification target sentence and replacing the modification target sentence with a replacement candidate sentence dissimilar to the modification target sentence; compare the modified text to an example validity evaluation database, the example validity evaluation database including examples of valid expressions, each valid expression including a valid sentence, the comparing including comparing the replacement candidate sentence with the valid sentence; and in response to the modified text not being identical or similar to any sentence in the example validity evaluation database, add the modified text as a negative example in a negative example data set, wherein the modified text is determined to be not similar to a sentence in the example validity evaluation database using a similar sentence search method.
- 2 . The learning data generation device according to claim 1 , wherein the processing circuitry extracts a clue expression from the input text and extracts the cause expression and the result expression based on the clue expression.
- 3 . The learning data generation device according to claim 2 , wherein the processing circuitry extracts the clue expression by referring to a clue expression database accumulating a plurality of clue expressions.
- 4 . The learning data generation device according to claim 3 , comprising a storage storing the clue expression database.
- 5 . The learning data generation device according to claim 1 , wherein the processing circuitry extracts the replacement candidate sentence dissimilar to the modification target sentence from a replacement candidate sentence database accumulating a plurality of replacement candidate sentences, and replaces the modification target sentence with the extracted replacement candidate sentence.
- 6 . The learning data generation device according to claim 5 , wherein the processing circuitry obtains a degree of similarity between the modification target sentence and a text in the replacement candidate sentence database and extracts the replacement candidate sentence based on a result obtained by comparing the degree of similarity with a predetermined threshold value.
- 7 . The learning data generation device according to claim 5 , comprising a storage storing the replacement candidate sentence database.
- 8 . The learning data generation device according to claim 1 , comprising a storage storing the example validity evaluation database.
- 9 . The learning data generation device according to claim 1 , wherein the processing circuitry segments the input text into a plurality of unit expressions, and extracts the cause expression and the result expression from the input text segmented into the unit expressions.
- 10 . The learning data generation device according to claim 9 , wherein the unit expression is a morpheme or a word including one or more morphemes.
- 11 . A learning data generation method executed by a learning data generation device, comprising, for each of a plurality of input text: extracting a cause expression and a result expression from an input text among the plurality of input text; generating a modified text by at least one of a method of interchanging the cause expression and the result expression and a method of specifying one of the cause expression and the result expression as a modification target sentence and replacing the modification target sentence with a replacement candidate sentence dissimilar to the modification target sentence; comparing the modified text to an example validity evaluation database, the example validity evaluation database including examples of valid expressions, each valid expression including a valid sentence, the comparing including comparing the replacement candidate sentence with the valid sentence; and in response to the modified text not being identical or similar to any sentence in the example validity evaluation database, adding the modified text as a negative example in a negative example data set, wherein the modified text is determined to be not similar to a sentence in the example validity evaluation database using a similar sentence search method.
- 12 . A non-transitory computer-readable record medium for storing a learning data generation program that causes a computer to execute processing comprising: for each of a plurality of input text: extracting a cause expression and a result expression from an input text among the plurality of input text; generating a modified text by at least one of a method of interchanging the cause expression and the result expression and a method of specifying one of the cause expression and the result expression as a modification target sentence and replacing the modification target sentence with a replacement candidate sentence dissimilar to the modification target sentence; comparing the modified text to an example validity evaluation database, the example validity evaluation database including examples of valid expressions, each valid expression including a valid sentence, the comparing including comparing the replacement candidate sentence with the valid sentence; and in response to the modified text not being identical or similar to any sentence in the example validity evaluation database, adding the modified text as a negative example in a negative example dataset, wherein the modified text is determined to be not similar to a sentence in the example validity evaluation database using a similar sentence search method.
Description
CROSS-REFERENCE TO RELATED APPLICATION This application is a continuation application of International Application No. PCT/JP2020/018299 having an international filing date of Apr. 30, 2020. BACKGROUND OF THE INVENTION 1. Field of the Invention The present disclosure relates to a learning data generation device, a learning data generation method and a learning data generation program. 2. Description of the Related Art There are technologies for automatically acquiring an expression that is included in a text and describes a causal relationship (referred to as a “causal relationship expression”). There are roughly two types of technologies for acquiring causal relationship expressions. A first technology is a technology using no training data, as typified by a technology of acquiring causal relationship expressions by using specific keywords or templates. For example, a technology using clue expressions such as “for” and “from” implying the existence of a causal relationship expression corresponds to the first technology. A second technology is a technology using training data, as typified by a technology of collecting sentences including a causal relationship expression and sentences including no causal relationship expression and executing text classification by use of machine learning. For example, a technology using an input text and a label indicating the position of a cause or a result in the input text and estimating a causal relationship part in the text by means of sequence labeling typified by Conditional Random Field (CRF) corresponds to the second technology. These two types of technologies are not contrary to each other, rather they are in a complementary relationship. Namely, the two types of technologies are used in ways such as acquiring a causal relationship expression estimation model by means of machine learning by using training data automatically collected by using keywords, templates or the like. As the training data collected by using keywords, templates or the like, two types of data have to be collected. First data are positive examples. In the technology of automatically acquiring causal relationship expressions, a text including a causal relationship expression or a text provided with a label indicating that a cause or a result exists in a certain part in the text is a positive example. Second data are negative examples. In the technology of automatically acquiring causal relationship expressions, a text including no causal relationship expression or a text provided with a label indicating that neither a cause nor a result exists in a certain part in the text is a negative example. Patent Reference 1 proposes a technology in which a causal relationship expression estimation model that has been learned by using training data automatically collected by using keywords, templates or the like is used for estimating a relationship between phrases. In the Patent Reference 1, the clue expressions implying the existence of a causal relationship expression are used for acquiring the positive examples. For example, in a case of a sentence “The ground gets wet because it rains.”, a clue expression “because” is used and a cause expression (“it rains”), a result expression (“the ground gets wet”) and the clue expression (“because”) are acquired. On the other hand, for acquiring a negative example, among elements acquired in a positive example, an element as a cause expression or a result expression is replaced randomly. For example, in the case where the cause expression (“it rains”) and the result expression (“the ground gets wet”) have been acquired, the cause expression (“it rains”) and a randomly replaced result expression “fall down from a ladder”) are acquired as a negative example. As above, in the Patent Reference 1, both of positive examples and negative examples can be acquired automatically. Patent Reference 1 is Japanese Patent Application Publication No. 2019-153093. The negative example acquisition method described in the Patent Reference 1 is simple and convenient since negative examples can be collected by randomly replacing a cause element or a result element among elements acquired in positive examples. However, the negative examples acquired by this method have undergone insufficient examination regarding appropriateness of the acquired example, and thus there is a possibility that data not being a negative example or data inappropriate as Japanese language is acquired as a negative example. For example, consideration will be given here to a case where an example “crops increase” is generated as the result of randomly replacing the result expression element in the situation where the cause expression (“it rains”) and the result expression (“the ground gets wet”) have been acquired as a positive example. In the Patent Reference 1, “Crops increase because it rains.” is acquired as a negative example. On the other hand, as viewed from human eyes, the acquired result expression is