US-12625925-B2 - Storage medium, estimation method, and information processing device, relearning program, and relearning method

US12625925B2US 12625925 B2US12625925 B2US 12625925B2US-12625925-B2

Abstract

A non-transitory computer-readable storage medium storing an estimation program that causes a computer to execute a process includes specifying representative points of each of training clusters that corresponds to each of labels targeted for estimation; setting boundaries between each of input clusters under a condition that a number of the input clusters and a number of the representative points coincide with each other, the input clusters being generated by clustering in a feature space for input data; acquiring estimation results for the labels with respect to the input data based on a correspondence relationship between the input clusters and the training clusters based on the boundaries; and estimating determination accuracy for the labels by using the machine learning model with respect to the input data based on the estimation results.

Inventors

Yuhei UMEDA
Takashi Katoh
Yuichi Ike
Mari Kajitani
Masatoshi Takenouchi

Assignees

FUJITSU LIMITED

Dates

Publication Date: 20260512
Application Date: 20220419

Claims (10)

1 . A non-transitory computer-readable storage medium storing an estimation program of retraining a machine learning model using, as supervised learning data, data with labels each of which is targeted for estimation and assigned by a clustering technique, the estimation program comprising instructions which, when executed by at least one computer, cause the at least one computer to execute a process, the process comprising: specifying, for each training cluster of a plurality of training clusters, a representative point from among a plurality of points included in the each training cluster to obtain a plurality of representative points for the plurality of training clusters, each of the plurality of training clusters identifiably corresponding to one of the labels targeted for estimation, the plurality of training clusters being generated by clustering, in a feature space, a training data set used in a training phase for training of the machine learning model that estimates the labels according to input data in the training data set, wherein a number of the plurality of training clusters is equal to a number of the labels targeted for estimation; setting boundaries between each of a plurality of input clusters under a condition that a number of the plurality of input clusters and the number of the representative points coincide with each other, the plurality of input clusters being generated by clustering, in the feature space, an input data set used in an operation phase of the trained machine learning model; acquiring estimation results for the labels with respect to the input data set based on a correspondence relationship between the plurality of input clusters and the plurality of training clusters after the setting of the boundaries; and re-training the machine learning model using, as the supervised learning data, a combination of the input data set and labels derived from the estimation results.
2 . The non-transitory computer-readable storage medium according to claim 1 , wherein the specifying includes: acquiring density of the training data set in the feature space; and specifying, for each training cluster of the plurality of training clusters, a piece of the input data with a highest density as the representative point for the training cluster.
3 . The non-transitory computer-readable storage medium according to claim 2 , wherein the process further comprising: acquiring the density of the input data set in the feature space; acquiring a number of pieces of the plurality of input clusters based on a number of pieces of zero-dimensional connection information equal to or greater than a certain value obtained by extracting one or more pieces of the input data set that corresponds to the density equal to or higher than a threshold value and executing a persistent homology conversion process on the extracted one or more pieces of the input data set; acquiring the number of the plurality of input clusters while altering the threshold value until the number of the plurality of input clusters coincides with the number of the representative points; specifying each of the plurality of input clusters with respect to the input data set based on the number of the plurality of input clusters; and giving the labels that correspond to the specified each of the plurality of input clusters to the input data set.
4 . The non-transitory computer-readable storage medium according to claim 3 , wherein the setting includes: specifying, for each of a plurality of input data groups, each input data group being a set of the input data that includes a plurality of pieces of the zero-dimensional connection information equal to or greater than the certain value when the number of the input clusters coincides with the number of the representative points, the piece of the input data with the highest density as a representative points for the each input data group; and associating the plurality of input clusters to each of the plurality of input data groups based on the correspondence relationship between specified representative point for the each input data group and each of the representative points of the training cluster, wherein the giving includes giving the labels that correspond to the each of the plurality of associated input clusters to each of the plurality of input data groups.
5 . The non-transitory computer-readable storage medium according to claim 4 , wherein the setting includes: acquiring, for input data that does not belong to the input data groups, distances from the representative points based on the density; and associating the plurality of input clusters with a closest distance among the distances from the representative points to input data set that does not belong to the input data set when a second closest distance among the distances from the representative points is longer than a maximum value among distances between the representative points, wherein the giving includes giving the labels that correspond to the plurality of associated input clusters to the input data set that does not belong to the input data groups.
6 . The non-transitory computer-readable storage medium according to claim 5 , wherein the setting includes acquiring probabilities of belonging to each of the plurality of input clusters by a probabilistic approach based on a certain condition when the second closest distance is shorter than the maximum value among the distances between the representative points, and the giving includes giving, as the labels, probabilistic labels derived from the probabilities of belonging to each of the plurality of input clusters, to the data that does not belong to the input data groups.
7 . The non-transitory computer-readable storage medium according to claim 6 , wherein the acquiring includes acquiring combinations of the input data set and the probabilistic labels, and the process further includes: acquiring an element-wise product of: a first matrix based on determination results obtained by inputting the acquired input data set to the machine learning model and a second matrix based on the probabilistic labels, and when a value obtained by dividing the element product by the number of pieces of the input data set is less than a threshold value, detecting accuracy deterioration of the machine learning model.
8 . The non-transitory computer-readable storage medium according to claim 7 , wherein the re-training includes re-training the machine learning model by using, as the supervised learning data, the acquired combinations of the input data set and the probabilistic labels when the accuracy deterioration of the machine learning model is detected.
9 . An estimation method for a computer of retraining a machine learning model using, as supervised learning data, data with labels each of which is targeted for estimation and assigned by a clustering technique, the estimation method comprising: specifying, for each training cluster of a plurality of training clusters, a representative point from among a plurality of points included in the each training cluster to obtain a plurality of representative points for the plurality of training clusters, each of the plurality of training clusters identifiably corresponding to one of the labels targeted for estimation, the plurality of training clusters being generated by clustering, in a feature space, a training data set used in a training phase for training of the machine learning model that estimates the labels according to input data in the training data set, wherein a number of the plurality of training clusters is equal to a number of the labels targeted for estimation; setting boundaries between each of a plurality of input clusters under a condition that a number of the plurality of input clusters and the number of the representative points coincide with each other, the plurality of input clusters being generated by clustering, in the feature space, an input data set used in an operation phase of the trained machine learning model; acquiring estimation results for the labels with respect to the input data based on a correspondence relationship between the plurality of input clusters and the plurality of training clusters after the setting of the boundaries; and re-training the machine learning model using, as the supervised learning data, a combination of the input data set and labels derived from the estimation results.
10 . An estimation device of retraining a machine learning model using, as supervised learning data, data with labels each of which is targeted for estimation and assigned by a clustering technique, the estimation device comprising: one or more memories; and one or more processors coupled to the one or more memories and the one or more processors configured to: specify, for each training cluster of a plurality of training clusters, a representative point from among a plurality of points included in the each training cluster to obtain a plurality of representative points for the plurality of training clusters, each of the plurality of training clusters identifiably corresponding to one of the labels targeted for estimation, the plurality of training clusters being generated by clustering, in a feature space, a training data set used in a training phase for training of the machine learning model that estimates the labels according to input data in the training data set, wherein a number of the plurality of training clusters is equal to a number of the labels targeted for estimation, set boundaries between each of a plurality of input clusters under a condition that a number of the plurality of input clusters and the number of the representative points coincide with each other, the plurality of input clusters being generated by clustering, in the feature space, an input data set used in an operation phase of the trained machine learning model, acquire estimation results for the labels with respect to the input data set based on a correspondence relationship between the plurality of input clusters and the plurality of training clusters after the setting of the boundaries, and re-training the machine learning model using, as the supervised learning data, a combination of the input data set and labels derived from the estimation results.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application is a continuation application of International Application PCT/JP2019/041581 filed on Oct. 23, 2019 and designated the U.S., the entire contents of which are incorporated herein by reference. FIELD The present invention relates to a storage medium, an estimation method, and an information processing device, a retraining program, and a retraining method. BACKGROUND Conventionally, a machine learning model (hereinafter, sometimes simply referred to as a “model”) using machine learning that performs data discrimination and classification functions and the like has been used. Since the machine learning model performs discrimination and classification in line with trained teacher data, the accuracy of the machine learning model deteriorates when the tendency (data distribution) of input data changes during operation. For such a reason, it is common to perform maintenance work to properly restore the model. For example, accuracy deterioration of the machine learning model is suppressed by labeling data at regular intervals, measuring the model accuracy attributable to the labeled data, and performing retraining when the accuracy falls below a certain accuracy. In this approach, when the input data is data for which correct answer information can be acquired after the prediction result by the machine learning model is obtained, such as weather forecast and stock price prediction, a low cost of labeling for retraining is involved. However, in the case of data for which it is infeasible to acquire correct answer information, labeling for retraining involves a higher cost. In recent years, a technique of clustering feature spaces, which represent information organized by removing unwanted data from the input data and in which the machine learning model classifies the input data that has been input, making labels correspondent with each cluster, and automatically labeling input data with the result of the correspondence as correct answer information (labels) has been known. Non-Patent Document 1: Sethi, Tegjyot Singh, and Mehmed Kantardzic, “On the reliable detection of concept drift from streaming unlabeled data”, Expert Systems with Applications 82 (2017). SUMMARY According to an aspect of the embodiments, a non-transitory computer-readable storage medium storing an estimation program that causes at least one computer to execute a process, the process includes specifying representative points of each of training clusters that corresponds to each of labels targeted for estimation, the training clusters being generated by clustering in a feature space for training data used for training of a machine learning model that estimates the labels according to input data; setting boundaries between each of input clusters under a condition that a number of the input clusters and a number of the representative points coincide with each other, the input clusters being generated by clustering in a feature space for input data; acquiring estimation results for the labels with respect to the input data based on a correspondence relationship between the input clusters and the training clusters based on the boundaries; and estimating determination accuracy for the labels by using the machine learning model with respect to the input data based on the estimation results. The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a diagram explaining a performance estimation device according to a first embodiment. FIG. 2 is a diagram explaining general deterioration detection using a Kullback-Leibler (KL) distance. FIG. 3 is a diagram explaining general deterioration detection using a confidence level. FIG. 4 is a diagram explaining a disadvantage in clustering. FIG. 5 is a functional block diagram illustrating a functional configuration of the performance estimation device according to the first embodiment. FIG. 6 is a diagram illustrating an example of information stored in an input data database (DB). FIG. 7 is a diagram illustrating an example of information stored in a determination result DB. FIG. 8 is a diagram illustrating an example of information stored in an estimation result DB. FIG. 9 is a diagram explaining the specification of the number of clusters and center points. FIG. 10 is a diagram explaining clustering under operation. FIG. 11A, FIG. 11B, FIG. 11C, and FIG. 11D are diagrams explaining persistent homology. FIG. 12 is a diagram explaining barcode data. FIG. 13 is a diagram explaining cluster allocation. FIG. 14 is a diagram explaining a matching process. FIG. 15 is a diagram explaining the continuation of clustering under operation. FIG. 16 is a diagram