JP-2026074705-A - A method for selecting training data for training a deep learning model and a training data selection device using the same.

JP2026074705AJP 2026074705 AJP2026074705 AJP 2026074705AJP-2026074705-A

Abstract

[Problem] To provide a method for selecting training data for training a deep learning model. [Solution] The method includes the steps of: a learning data selection device acquiring at least one individual type corresponding to each of the numerous learning data contained in all the learning data stored in the data pool, and generating a binary graph matching each of the numerous learning data contained in all the learning data with the individual type; and when the learning data selection device refers to the binary graph to select n learning data from all the learning data that match the individual type, the n learning data for training a deep learning model is selected such that the number of individual types that match each of the n learning data is within a predetermined threshold deviation. [Selection Diagram] Figure 2

Inventors

金桂賢
李鉉東

Assignees

スパーブエーアイカンパニーリミテッド

Dates

Publication Date: 20260507
Application Date: 20241021

Claims (20)

In a method for selecting training data for training a deep learning model, (a) A step in which a training data selection device acquires at least one attribute corresponding to each of the numerous training data contained in the total training data stored in the data pool, and generates a binary graph matching each of the numerous training data contained in the total training data with the attribute; and (b) A step in which the training data selection device, referring to the binary graph, selects n training data from the total training data that match the attribute (where n is the target number of training data for the deep learning model to learn, and is an integer representing a number of multiple items) such that the number of attribute matching each of the n training data is within a predetermined threshold deviation; A method that includes this.
In step (b) above, The individual types include a first_1 individual type to a first_x individual type (where x is an integer of 1 or more) corresponding to a first type having a first type having a large number of training data, and a second_1 individual type to a second_y individual type (where y is an integer of 1 or more) corresponding to a second type having a second type having a large number of training data, The method according to claim 1, wherein the learning data selection device selects the n learning data such that the number of 1_1 individual types to 1_x individual types corresponding to the first type and the number of 2_1 individual types to 2_y individual types corresponding to the second type are within the threshold deviation, the number of 1_1 individual types to 1_x individual types is within the first threshold deviation, and the number of 2_1 individual types to 2_y individual types is within the second threshold deviation.
In step (b) above, The method according to claim 1, wherein the learning data selection device (i) refers to the binary graph and confirms the number of corresponding individual types that match each of the total learning data, and selects a specific learning data with the largest number of corresponding individual types; (ii) repeats the process of confirming the number of remaining corresponding individual types that match each of the total learning data, among the remaining individual types excluding the corresponding individual types, and selecting another specific learning data with the largest number of remaining corresponding individual types, thereby executing a cycle to select some learning data that matches all of the individual types; and repeats the process of executing the cycle from the remaining learning data excluding the selected some learning data from the total learning data until the n learning data are selected.
In step (b) above, The aforementioned learning data selection device is Referencing the aforementioned binary graph, an optimization algorithm is used to select a specific subset of training data consisting of a predetermined number of training data that includes all the individual types, which has the fewest number of training data. The remaining training data is calculated by removing a specific number of training data included in the specific subset. The method according to claim 1, wherein the process of selecting at least one other specific subset consisting of a predetermined number of training data containing all of the individual types from the remaining training data is repeated through the optimization algorithm so that the number of selected training data is n or more.
In step (b) above, The training data selection device uses linear programming to calculate the product of a PxQ binary matrix corresponding to the P individual types and Q training data in the binary graph and a Q-dimensional vector representing the selection goodness-of-fit variable for each of the Q training data in each of the P individual types. The device generates a P-dimensional vector (the P-dimensional vector represents the sum of the goodness-of-fit of the Q training data belonging to each of the P individual types) where the sum of the goodness-of-fit is 1 or more, and the selection goodness-of-fit variable in the Q-dimensional vector is between 0 and 1. From among the selection goodness-of-fit variables of the Q-dimensional vector, the device selects a specific subset that includes a specific training data corresponding to a specific selection goodness-of-fit variable whose sum of the selection goodness-of-fit variables has the minimum value. The device then calculates the remaining training data by removing the specific training data included in the specific subset from the Q training data. The method according to claim 4, wherein the process of selecting at least one other specific subset using the linear programming method with respect to the remaining training data is repeated so that the number of selected training data is n or more.
In step (b) above, The method according to claim 5, wherein the training data selection device selects the specific subset using a dual linear programming method that applies at least one of the following constraints in linear programming: merging, separating, and sign changing.
In step (b) above, The training data selection device uses integer programming to calculate the product of a PxQ binary matrix corresponding to the P individual types and Q training data in the binary graph and a Q-dimensional vector representing the selection variable for each of the Q training data in each of the P individual types. The device generates a P-dimensional vector (the P-dimensional vector represents the selection quantity of training data belonging to each of the P individual types) in which the selection quantity is 1 or more, and the selection variable in the Q-dimensional vector is 0 or 1. From among the selection variables of the Q-dimensional vector, the device selects a specific subset that includes a specific training data corresponding to a specific selection variable whose sum of the selection variables has the minimum value. The device then calculates the remaining training data by removing the specific training data included in the specific subset from the Q training data. The method according to claim 4, wherein the process of selecting other specific subsets of the remaining training data using the integer programming method is repeated so that the number of selected training data is n or more.
In step (b) above, The method according to claim 7, wherein the training data selection device selects the particular subset using a dual integer programming method to which at least one of merging, separating, and sign-changing constraints in integer programming is applied.
In step (a) above, The method according to claim 1, wherein the learning data selection device transmits the large number of learning data to a labeler terminal, and uses a labeler corresponding to the labeler terminal to generate at least one individual type corresponding to each of the large number of learning data.
The aforementioned training data consists of training images. In step (a) above, The method according to claim 1, wherein the training data selection device performs a process of generating a first scene vector corresponding to each of the training images by performing a first embedding operation on each of the training images and clustering the first scene vectors to generate a first scene cluster, or performs a process of generating a kth scene vector corresponding to each of the training images by performing a kth embedding operation (where k is an integer of 1 or more) on each of the training images and clustering the kth scene vectors to generate a kth scene cluster, and generates the individual type corresponding to the training images by referring to the first scene cluster to the kth scene cluster.
The aforementioned training data consists of training images. In step (a) above, The method according to claim 1, wherein the training data selection device performs object detection on each of the training images to detect at least one object from each of the training images, generates a cropped image by cropping the region corresponding to the bounding box of each detected object in each of the training images, performs an embedding operation on each of the cropped images to generate an object vector corresponding to each of the cropped images, clusters the object vectors to generate an object cluster, and generates the individual type corresponding to the training image by referring to the object cluster.
The aforementioned training data consists of training images. In step (a) above, The method according to claim 1, wherein the training data selection device refers to the ground truth information contained in each of the training images to generate a cropped image obtained by cropping the region corresponding to the bounding box of each object from each of the training images, performs an embedding operation on each of the cropped images to generate an object vector corresponding to each of the cropped images, clusters the object vectors to generate an object cluster, and refers to the object cluster to generate the individual type corresponding to the training image.
The aforementioned training data consists of training images. In step (a) above, The learning data selection device performs the following steps: (i) a process of performing a first embedding operation on each of the learning images to generate a first scene vector corresponding to each of the learning images and clustering the first scene vectors to generate a first scene cluster or a process of performing a k-th embedding operation (where k is an integer of 1 or more) on each of the learning images to generate a k-th scene vector corresponding to each of the learning images and clustering the k-th scene vectors to generate a k-th scene cluster; (ii) an object detection operation on each of the learning images to detect at least one object from each of the learning images, a cropped image obtained by cropping the region corresponding to the bounding box of each detected object in each of the learning images, an embedding operation on each of the cropped images to generate an object vector corresponding to each of the cropped images, and clustering the object vectors to generate an object cluster; and (iii) a process of generating the individual types corresponding to the learning images by referring to the first scene cluster to the k-th scene cluster and the object cluster.
The aforementioned training data consists of training images. In step (a) above, The learning data selection device performs the following steps: (i) a process of performing a first embedding operation on each of the learning images to generate a first scene vector corresponding to each of the learning images and clustering the first scene vectors to generate a first scene cluster, or a process of performing a k-th embedding operation (where k is an integer of 1 or more) on each of the learning images to generate a k-th scene vector corresponding to each of the learning images and clustering the k-th scene vectors to generate a k-th scene cluster, (ii) a process of referencing the ground truth information contained in each of the learning images to generate a cropped image obtained by cropping the region corresponding to the bounding box of each object from each of the learning images, a embedding operation on each of the cropped images to generate an object vector corresponding to each of the cropped images, clustering the object vectors to generate an object cluster, and (iii) a process of referencing the first scene cluster to the k-th scene cluster and the object cluster to generate the individual type corresponding to the learning images.
In a training data selection device for selecting training data for training a deep learning model, A memory containing instructions for selecting training data for training a deep learning model; and a processor that performs operations for selecting training data for training the deep learning model in accordance with the instructions stored in the memory; Includes, The processor is a learning data selection device that performs the following processes: (I) acquiring at least one attribute corresponding to each of the numerous learning data contained in all the learning data stored in the data pool, and generating a binary graph matching each of the numerous learning data contained in all the learning data with the attribute; and (II) referring to the binary graph, selecting n learning data from all the learning data that match the attribute (wherein n is the target number of learning data for training the deep learning model and is an integer representing a number of items) such that the number of attribute matching each of the n learning data is within a predetermined threshold deviation.
The individual types include a first_1 individual type to a first_x individual type (where x is an integer of 1 or more) corresponding to a first type having a first type having a large number of training data, and a second_1 individual type to a second_y individual type (where y is an integer of 1 or more) corresponding to a second type having a second type having a large number of training data, The aforementioned processor, The learning data selection device according to claim 15, wherein in the process of (II) above, the n learning data are selected such that the number of 1_1 individual types to 1_x individual types corresponding to the first type and the number of 2_1 individual types to 2_y individual types corresponding to the second type that match the n learning data are within the threshold deviation, the number of 1_1 individual types to 1_x individual types is within the first threshold deviation, and the number of 2_1 individual types to 2_y individual types is within the second threshold deviation.
The aforementioned processor, A learning data selection device according to claim 15, wherein in the process of (II) above, (i) by referring to the binary graph, the number of corresponding individual types that match each of the individual types that match each of the total learning data is confirmed, and a specific learning data with the largest number of corresponding individual types is selected; (ii) by repeating the process of checking the number of the remaining corresponding individual types that match each of the total learning data that match each of the remaining individual types that have been removed from the individual types, and selecting another specific learning data with the largest number of the remaining corresponding individual types, thereby executing a cycle to select some learning data that matches all of the individual types; and repeating the process of executing the above cycle from the remaining learning data excluding the selected some learning data from the total learning data until the n learning data are selected.
The aforementioned processor, In the process described in (II) above, by referring to the binary graph, an optimization algorithm is used to select a specific subset of training data consisting of a predetermined number of training data that includes all the individual types, which has the fewest number of training data, and the remaining training data is calculated by removing a specific number of training data included in the specific subset. The learning data selection device according to claim 15, wherein the process of selecting at least one other specific subset consisting of a predetermined number of learning data including all of the individual types from the remaining learning data is repeated through the optimization algorithm so that the number of selected learning data is n or more.
The aforementioned processor, In the process described in (II) above, using linear programming, the product of a PxQ binary matrix corresponding to the P individual types and Q training data in the binary graph and a Q-dimensional vector representing the selection goodness-of-fit variable for each of the Q training data in each of the P individual types is calculated to generate a P-dimensional vector (the P-dimensional vector represents the sum of the goodness-of-fit of the Q training data belonging to each of the P individual types). The sum of the goodness-of-fit in the P-dimensional vector is 1 or more, and the selection goodness-of-fit variable in the Q-dimensional vector is 0 or more and 1 or less. From among the selection goodness-of-fit variables of the Q-dimensional vector, a specific subset is selected that includes a specific training data corresponding to a specific selection goodness-of-fit variable whose sum of the selection goodness-of-fit variables has the minimum value. The remaining training data is calculated by subtracting the specific training data included in the specific subset from the Q training data. The learning data selection device according to claim 18, wherein the process of selecting at least one other specific subset from the remaining learning data using the linear programming method is repeated so that the number of selected learning data is n or more.
The aforementioned processor, The training data sorting apparatus according to claim 19, wherein in the process of (II) above, the particular subset is sorted using a dual linear programming method which applies at least one of the constraint merging, separation, and sign changing in linear programming.

Description

This invention relates to a method for uniformly selecting training data for training a deep learning model from all training data stored in a data pool, without bias or variability in the data, and to a training data selection device utilizing this method. Generally, deep learning models recognize complex patterns in images, text, sound, and other data to generate accurate insights and predictions, and are applied in various fields such as computer vision, speech recognition, autonomous vehicles, robotics, natural language processing, and medical image analysis. In order for such deep learning models to accurately perform their intended tasks, they must be trained using a large amount of training data. Traditional methods for selecting training data for deep learning models from a collected data pool include random sampling, which selects a target number of training data from the entire training data stored in the data pool, and vector quantization, which clusters and groups the vectors representing each of the training data generated by embedding extraction, and then selects representative values for each group of grouped vectors. For example, Patent Document 1 discloses a clustering device and method based on similarity using deep learning techniques, and Patent Document 2 discloses a system, method, and program for extracting high-contribution items to improve the performance of multilayer neural networks (deep learning). Furthermore, Patent Document 3 discloses a training device and method for deep learning classification models, and Patent Document 4 discloses a system and method for training machine learning models using active learning. However, conventional methods for selecting training data have the problem of resulting in bias and variability in data types. For example, if a data pool contains 1 million training images, with 70% related to sunny weather, 20% related to cloudy weather, 5% related to foggy weather, and 5% related to snowy and/or rainy weather, then randomly sampling 10,000 training images would result in only about 500 images being selected from a total of 50,000 images related to snowy and/or rainy weather. This would lead to a bias and variability in the selection of training images based on weather type. Furthermore, while using vector quantization to select training images can somewhat mitigate the bias and variability in the types of training images selected by embedding extraction and clustering, it cannot fundamentally prevent problems related to data bias and variability. Therefore, the applicant aims to propose a method for uniformly selecting training data for training a deep learning model from all training data stored in a data pool, categorized by type, without bias or variability. Korean Published Patent No. 10-2023-0068941Patent No. 6458072Patent No. 7225614U.S. Patent No. 1,1663409 The following drawings, attached for use in describing embodiments of the present invention, represent only a portion of the embodiments, and a person with ordinary skill in the art to which the present invention pertains (hereinafter referred to as "ordinary art") can obtain other drawings based on these drawings without performing any inventive work. Figure 1 is a schematic diagram showing a training data selection device for selecting training data for training a deep learning model according to one embodiment of the present invention.Figure 2 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to the first embodiment of the present invention.Figure 3 is a schematic diagram illustrating an example of generating individual types of training data in the first embodiment of the present invention.Figure 4 is a schematic diagram illustrating another example of generating individual types of training data in the first embodiment of the present invention.Figure 5 is a diagram illustrating a binary graph obtained by matching each of the training data with an individual type in the first embodiment of the present invention.Figure 6a is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6b is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6c is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6d is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 7 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to a second embodiment of the present invention.Figure 8 is a schematic diagram illustrating a method for selecting training data for tr