KR-102963523-B1 - Efficient actual measurement data annotation
Abstract
A computer-implemented method is provided for determining a set of target items to be annotated for training a machine learning application. The method comprises the step of providing a training data set having a set of data samples and an auto-encoder having a classifier. The auto-encoder includes an embedding model that maps the set of data samples to a set of compressed feature vectors. The set of compressed feature vectors defines a compressed feature matrix. Additionally, the following are provided: a definition of a graph associated with the compressed feature matrix; a step of applying a clustering algorithm to identify node clusters of the graph and a step of applying a centrality algorithm to identify central nodes of the node clusters; a step of retrieving node labels for the central nodes from annotators; a step of propagating the annotated node labels to other nodes of the graph and a step of training the embedding model and the classifier with the annotated and propagated node labels.
Inventors
- 스타, 피터
- 돌피, 미셸
- 아우어, 크리스토프
- 게오르고풀로스, 레오니다스
- 카에스트너, 랄프
- 벨리제프, 알렉산더
- 노거 히달고, 달
- 쿠즈넷소바, 리타
- 베카스, 콘스탄티노스
Assignees
- 인터내셔널 비지네스 머신즈 코포레이션
Dates
- Publication Date
- 20260511
- Application Date
- 20210128
- Priority Date
- 20200306
Claims (20)
- In computer-implemented methods, A step of providing a training data set containing a set of data samples; A step of providing an auto-encoder - the auto-encoder includes a classifier -; A step of performing initial training of the embedding model of the above auto-encoder - the embedding model is configured to map the set of data samples to a set of compressed feature vectors containing feature elements, and the set of compressed feature vectors defines a compressed feature matrix of the set of data samples -; A step of providing a definition of a graph associated with the above-mentioned compressed feature matrix; A step of applying a clustering algorithm to identify one or more node clusters of the above graph; A step of applying a centrality algorithm to identify one or more central nodes of the above one or more node clusters; A step of retrieving one or more node labels for one or more central nodes of one or more node clusters from an annotator—thereby generating annotated node labels—; A step of propagating annotated node labels of one or more central nodes to other nodes of the graph—thereby generating node labels propagated by this—; and Step of performing additional training of the above auto-encoder Includes, The above additional training includes training the embedding model of the auto-encoder and the classifier of the auto-encoder using the annotated propagated node labels, wherein the classifier is configured to predict the one or more node labels for the elements of the compressed feature vectors. Computer-implemented method.
- In claim 1, the above method is: The method further comprises a step of repeating the application step of the clustering algorithm, the application step of the centrality algorithm, the propagation step of the annotated node labels, and the step of performing additional training of the auto-encoder in one or more iteration rounds until convergence. Computer-implemented method.
- In claim 2, the method further includes the step of outputting one or more node labels of the last iteration round prior to convergence as a set of target items to be annotated. Computer-implemented method.
- In claim 1, the above method is: The step of repeating the step of searching for node labels for one or more central nodes of the above one or more node clusters from one or more annotators is further included. Computer-implemented method.
- In claim 1, the graph is defined by an adjacency matrix, wherein the adjacency matrix is the product of the compressed feature matrix and the transpose of the compressed feature matrix. Computer-implemented method.
- In claim 1, the step of applying the clustering algorithm includes the step of applying a graph-based clustering algorithm. Computer-implemented method.
- In claim 5, the step of applying the centrality algorithm includes the step of applying a graph-based centrality algorithm. Computer-implemented method.
- In claim 6, the graph-based clustering algorithm is: k-spanning tree or minimum spanning tree algorithms; Shared nearest neighbor algorithms; Algorithms based on betweenness centrality; and Selected from a group consisting of spectral clustering algorithms Computer-implemented method.
- In claim 7, the graph-based centrality algorithm is: Approximating the product of the matrix exponential and the random probe vector of the said adjacent matrix; Calculating the diagonal of the adjacent matrix based on the product of the matrix index and the random probe vector; and Calculating node centralities based on the calculated diagonal until a predefined number of the above one or more central nodes are detected. Computer-implemented method.
- In claim 1, the above method is: The method further includes the step of identifying one or more boundary nodes, wherein the boundary nodes are defined as nodes located at the boundary between two of the clusters. Computer-implemented method.
- In Clause 10, the above method is: The method further includes the step of searching for node labels for the above one or more boundary nodes from one or more annotators. Computer-implemented method.
- In claim 1, the above method is: The method further includes the step of identifying one or more farthest nodes of the above one or more clusters. Computer-implemented method.
- In Clause 12, the above method is: The method further includes the step of retrieving node labels for one or more of the farthest nodes of the one or more clusters from one or more annotators. Computer-implemented method.
- In claim 5, the step of propagating the annotated node labels includes the step of applying a propagation function to the annotated node labels, wherein the propagation function is a function of the adjacency matrix Computer-implemented method.
- In paragraph 14, the above propagation function is a Chebyshev expansion It is defined as, where P j is the j-th Chebyshev polynomial Computer-implemented method.
- In Article 1, The step of further training a cognitive model of a machine learning application using the above training data set Computer-implemented method.
- delete
- delete
- delete
- In a system comprising one or more processors for executing computer-readable instructions, said computer-readable instructions control said one or more processors to perform operations, said operations are: An operation to provide a training data set containing a set of data samples; Operation of providing an auto-encoder - the auto-encoder includes a classifier -; Operation of performing initial training of the embedding model of the above auto-encoder - the embedding model is configured to map the set of data samples to a set of compressed feature vectors containing feature elements, and the set of compressed feature vectors defines a compressed feature matrix of the set of data samples -; An operation that provides a definition of a graph associated with the above-mentioned compressed feature matrix; An operation of applying a clustering algorithm to identify one or more node clusters of the above graph; An operation of applying a centrality algorithm to identify one or more central nodes of the above one or more node clusters; The operation of retrieving one or more node labels for one or more central nodes of the one or more node clusters from an annotator—thereby generating annotated node labels—; The operation of propagating annotated node labels of one or more central nodes to other nodes of the graph—generating node labels propagated thereby—; and Action of performing additional training of the above auto-encoder Includes, The above additional training includes training the embedding model of the auto-encoder and the classifier of the auto-encoder using the annotated propagated node labels, wherein the classifier is configured to predict the one or more node labels for the elements of the compressed feature vectors. System.
Description
Efficient actual measurement data annotation [0001] The present invention relates to a computer-implemented method for determining a set of target items to be annotated in order to train a machine learning application. [0002] The present invention also relates to a corresponding system and a corresponding computer program product. Prior art includes U.S. Patent Application Publication US2014/0376804 (Publication date: Dec. 25, 2014). [0020] FIGS. 1a, 1b and 1c illustrate a computer-implemented method according to an embodiment of the present invention; [0021] FIG. 2a illustrates a corresponding flowchart of the method exemplified in FIG. 1a, 1b and 1c; [0022] FIG. 2b illustrates a computer-implemented method for training a machine learning application with a training data set; [0023] FIG. 2c illustrates a computer-implemented method for performing a machine learning application; [0024] FIG. 3 illustrates a schematic block diagram of a computing system that can be used to perform computer-implemented methods as exemplified in FIG. 1a, 1b, 1c and FIG. 2a, 2b, and 2c; [0025] Fig. 4 illustrates the mapping performed by the autoencoder during training; [0026] FIG. 5 illustrates an exemplary block diagram of an autoencoder according to one embodiment of the present invention; [0027] FIG. 6 illustrates a graph to which a computer-implemented method is applied according to embodiments of the present invention; and [0028] FIG. 7 illustrates a more detailed block diagram of a server according to one embodiment of the present invention. [0029] With reference to FIGS. 1 to 7, some of the general terms of embodiments of the present invention are described. [0030] The term ground truth generally refers to information provided by direct observation (i.e., empirical evidence) as opposed to information provided by inference. [0031] Embodiments of the present invention provide a computer-implemented method for efficiently generating training data and actual data for machine learning algorithms and applications. [0032] Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. In this context, a graph is made up of vertices or nodes and lines called edges connecting them. Graphs are widely used in applications to model many types of relationships and process dynamics in physical, biological, social, and information systems. Therefore, many practical problems in modern technology, science, and business applications are typically represented by graphs. [0033] The centrality of a node is a widely used measure to determine the relative importance of a node within a full network or graph. Node centralities can be used to determine which nodes are important in complex networks, for example, to understand influencers or to find hot spot links. For example, node centralities are typically used to determine how influential a person is within a social network, or, in the theory of space syntax, to determine how important a room is within a building or how well-used a road is within a city network. [0034] FIGS. 1a, 1b and 1c illustrate a computer-implemented method according to an embodiment of the present invention. [0035] FIG. 2a illustrates a corresponding flowchart of the method exemplified in FIG. 1a, 1b, and 1c. FIG. 3 illustrates a schematic block diagram of a computing system that can be used to perform the computer-implemented method as exemplified in FIG. 1a, 1b, 1c, and 2a. [0036] First, referring to FIG. 3, a computing system (300) is illustrated that includes a server (310) configured to execute a machine learning application program (MLAP) (311), a machine learning training program (MLTP) (312) for training said application program, and a target item determination program (TIDP) (313) for determining target items of large datasets that need to be annotated in order to generate a training data set for the machine learning application in an efficient manner. The machine learning application (311) may be a deep learning application in particular. [0037] A server (310) is combined with a database (320). The database (320) may include storage (321) for storing one or more datasets, particularly large datasets. The datasets stored in storage (321) may be, in particular, unannotated datasets. Thus, the datasets stored in storage (321) may also refer to initial datasets, raw datasets, or initial training datasets. Each dataset may include multiple data samples. The multiple data samples may include, for example, various types of images (e.g., {cats, dogs}, {line-plots, scatter-plots, histograms, geological maps}) or text (sentences, paragraphs, whole-texts). Each of the multiple data samples may include multiple data items—e.g., multiple pixels. [0038] The database (320) also includes storage (322) for storing training data sets. Training data sets may be derived from initial or raw data sets by annotating one or more target items in multiple data samples of initial or