JP-2026074726-A - A method for selecting training data for training a deep learning model and a training data selection device using the same.
Abstract
[Problem] To provide a method for selecting training data for training a deep learning model. [Solution] The method includes the steps of: a learning data selection device generating at least one individual type corresponding to each of a large number of learning images stored in a data pool, and generating a binary graph matching each of the large number of learning images with the individual type; and the learning data selection device referring to the binary graph and using an optimization algorithm to select a specific subset with the fewest number of learning images from a subset consisting of a predetermined number of learning images that include all the individual types, calculating the remaining learning images after removing a specific number of learning images included in the specific subset, and using an optimization algorithm to select at least one other specific subset consisting of a predetermined number of learning images that include all the individual types from the remaining learning images, repeating this process until n learning images have been selected. [Selection Diagram] Figure 8
Inventors
- 金 桂賢
- 李 鉉東
Assignees
- スパーブエーアイ カンパニー リミテッド
Dates
- Publication Date
- 20260507
- Application Date
- 20241021
Claims (20)
- In a method for selecting training data for training a deep learning model, (a) A training data selection device generates at least one attribute corresponding to each of a large number of training images stored in a data pool, and generates a binary graph matching each of the large number of training images with the attribute; and (b) The training data selection device (i) refers to the binary graph and, through an optimization algorithm, selects a specific subset of training images consisting of a predetermined number of training images containing all of the attribute types, which has the fewest number of training images, calculates the remaining training images excluding a specific number of training images included in the specific subset, and (ii) from the remaining training images, through the optimization algorithm, selects at least one other specific subset consisting of a predetermined number of training images containing all of the attribute types, and repeats this process until n training images for training the deep learning model (where n is the target number of training images for training the deep learning model and is an integer representing a plurality of numbers) are selected; A method that includes this.
- In step (b) above, The individual types include a first_1 individual type to a first_x individual type (where x is an integer of 1 or more) corresponding to a first type type that each of the learning images has, and a second_1 individual type to a second_y individual type (where y is an integer of 1 or more) corresponding to a second type type that each of the learning images has, The method according to claim 1, wherein the learning data selection device selects the n learning images such that the number of 1_1 individual types to 1_x individual types corresponding to the first type and the number of 2_1 individual types to 2_y individual types corresponding to the second type that match the n learning images are within the threshold deviation, the number of 1_1 individual types to 1_x individual types is within the first threshold deviation, and the number of 2_1 individual types to 2_y individual types is within the second threshold deviation.
- In step (b) above, The training data selection device uses linear programming to calculate the product of a PxQ binary matrix corresponding to the P individual types and Q training images in the binary graph and a Q-dimensional vector representing the selection goodness-of-fit variable for each of the Q training images in each of the P individual types. The device generates a P-dimensional vector (the P-dimensional vector represents the sum of the goodness-of-fit variables for the Q training images belonging to each of the P individual types) where the sum of the goodness-of-fit variables is 1 or more, and the selection goodness-of-fit variable in the Q-dimensional vector is between 0 and 1. From among the selection goodness-of-fit variables of the Q-dimensional vector, the device selects a specific subset that includes a specific training image corresponding to a specific selection goodness-of-fit variable whose sum of the selection goodness-of-fit variables has the minimum value. The device then calculates the remaining training images by removing the specific training image included in the specific subset from the Q training images. The method according to claim 1, wherein the process of selecting at least one other specific subset using the linear programming method with respect to the remaining training images is repeated so that the number of selected training images is n or more.
- In step (b) above, The method according to claim 3, wherein the training data selection device selects the specific subset using a dual linear programming method that applies at least one of the following constraints in linear programming: merging, separating, and sign changing.
- In step (b) above, The training data selection device uses integer programming to calculate the product of a PxQ binary matrix corresponding to the P individual types and Q training images in the binary graph and a Q-dimensional vector representing the selection variable for each of the Q training images in each of the P individual types. The device generates a P-dimensional vector (the P-dimensional vector represents the selection quantity of training images belonging to each of the P individual types) in which the selection quantity is 1 or more, and the selection variable in the Q-dimensional vector is 0 or 1. From among the selection variables of the Q-dimensional vector, the device selects a specific subset that includes a specific training image corresponding to a specific selection variable whose sum of the selection variables is the minimum value. The device then calculates the remaining training images by removing the specific training image included in the specific subset from the Q training images. The method according to claim 1, wherein the process of selecting at least one other specific subset from the remaining training images using the integer programming method is repeated so that the number of selected training images is n or more.
- In step (b) above, The method according to claim 5, wherein the training data selection device selects the particular subset using a dual integer programming method to which at least one of merging, separating, and sign-changing constraints in the integer programming method is applied.
- In step (a) above, The method according to claim 1, wherein the learning data selection device transmits the learning images to a labeler terminal, and a labeler corresponding to the labeler terminal generates at least one individual type corresponding to each of the learning images.
- In step (a) above, The method according to claim 1, wherein the training data selection device performs a process of generating a first scene vector corresponding to each of the training images by performing a first embedding operation on each of the training images and clustering the first scene vectors to generate a first scene cluster, or performs a process of generating a kth scene vector corresponding to each of the training images by performing a kth embedding operation (where k is an integer of 1 or more) on each of the training images and clustering the kth scene vectors to generate a kth scene cluster, and generates the individual type corresponding to the training images by referring to the first scene cluster to the kth scene cluster.
- In step (a) above, The method according to claim 1, wherein the learning data selection device checks the metadata contained in each of the learning images, and further refers to the shooting time contained in each of the metadata to generate the individual type corresponding to each of the learning images.
- In step (a) above, The method according to claim 1, wherein the training data selection device (i) performs a specific embedding operation on each of the training images to generate a specific scene vector corresponding to each of the training images, clusters the specific scene vectors to generate a specific scene cluster, (ii) refers to the respective metadata contained in each of the training images to confirm the shooting time of each of the training images, and (iii) refers to the specific scene cluster and the shooting time to generate the individual type corresponding to each of the training images.
- In a training data selection device for selecting training data for training a deep learning model, A memory containing instructions for selecting training data for training a deep learning model; and a processor that performs operations for selecting training data for training the deep learning model in accordance with the instructions stored in the memory; Includes, The processor is a learning data selection device that performs the following processes until n learning images for training the deep learning model are selected (i) a process of generating at least one attribute corresponding to each of a large number of learning images stored in a data pool, and generating a binary graph matching each of the large number of learning images with the attribute; and (ii) a process of (i) referring to the binary graph, selecting a specific subset with the fewest number of learning images from a subset consisting of a predetermined number of learning images that include all of the attribute types, calculating the remaining learning images excluding a specific number of learning images included in the specific subset, and (ii) selecting at least one other specific subset consisting of a predetermined number of learning images that include all of the attribute types from the remaining learning images using the optimization algorithm, and repeating this process until n learning images for training the deep learning model are selected (where n is the target number of learning images for training the deep learning model, and is an integer representing a number of multiple numbers).
- The aforementioned processor, In the process described in (II) above, the individual type includes a first_1 individual type to a first_x individual type (where x is an integer of 1 or more) corresponding to the first type type that each of the learning images has, and a second_1 individual type to a second_y individual type (where y is an integer of 1 or more) corresponding to the second type type that each of the learning images has, A learning data selection device according to claim 11, which selects the n learning images such that the number of individual types from the first type to the first x type that match the first type and the number of individual types from the second type to the second y type that match the n learning images are within the threshold deviation, the number of individual types from the first type to the first x type is within the first threshold deviation, and the number of individual types from the second type to the second y type is within the second threshold deviation.
- The aforementioned processor, In the process described in (II) above, using linear programming, the product of a PxQ binary matrix corresponding to the P individual types and Q training images in the binary graph and a Q-dimensional vector representing the selection goodness-of-fit variable for each of the Q training images in each of the P individual types is calculated to generate a P-dimensional vector (the P-dimensional vector represents the sum of the goodness-of-fit for the Q training images belonging to each of the P individual types). The sum of the goodness-of-fit in the P-dimensional vector is 1 or more, and the selection goodness-of-fit variable in the Q-dimensional vector is 0 or more and 1 or less. From among the selection goodness-of-fit variables of the Q-dimensional vector, a specific subset is selected that includes a specific training image corresponding to a specific selection goodness-of-fit variable whose sum of the selection goodness-of-fit variables has the minimum value. The remaining training images are calculated by removing the specific training image included in the specific subset from the Q training images. The learning data selection device according to claim 11, wherein the process of selecting at least one other specific subset using the linear programming method with respect to the remaining learning images is repeated so that the number of selected learning images is n or more.
- The aforementioned processor, The training data sorting device according to claim 13, wherein in the process of (II) above, the particular subset is sorted using a dual linear programming method which applies at least one of the constraint merging, separation, and sign changing in linear programming.
- The aforementioned processor, In the process described in (II) above, using integer programming, the product of a PxQ binary matrix corresponding to the P individual types and Q training images in the binary graph and a Q-dimensional vector representing the selection variable for each of the Q training images in each of the P individual types is calculated to generate a P-dimensional vector (the P-dimensional vector represents the selection quantity of training images belonging to each of the P individual types), and the selection quantity in the P-dimensional vector is 1 or more, and the selection variable in the Q-dimensional vector is 0 or 1. From among the selection variables of the Q-dimensional vector, a specific subset is selected that includes a specific training image corresponding to a specific selection variable whose sum of the selection variables has the minimum value, and the remaining training images are calculated by removing the specific training image included in the specific subset from the Q training images. The learning data selection device according to claim 11, wherein the process of selecting at least one other specific subset using the linear programming method with respect to the remaining learning images is repeated so that the number of selected learning images is n or more.
- The aforementioned processor, The training data sorting device according to claim 15, wherein in the process of (II), the device sorts the particular subset using a dual integer programming method which applies at least one of the following constraints in integer programming: merging, separating, and sign-changing.
- The aforementioned processor, The learning data sorting apparatus according to claim 11, wherein in the process of (I) above, the learning images are transmitted to a labeler terminal, and a labeler corresponding to the labeler terminal generates at least one individual type corresponding to each of the learning images.
- The aforementioned processor, The learning data sorting device according to claim 11, wherein in the process of (I) above, a process is performed in which a first embedding operation is performed on each of the learning images to generate a first scene vector corresponding to each of the learning images, and the first scene vectors are clustered to generate a first scene cluster, or a process is performed in which a kth embedding operation (where k is an integer of 1 or more) is performed on each of the learning images to generate a kth scene vector corresponding to each of the learning images, and the kth scene vectors are clustered to generate a kth scene cluster, and the individual types corresponding to the learning images are generated by referring to the first scene cluster or the kth scene cluster.
- The aforementioned processor, The learning data sorting apparatus according to claim 11, wherein in the process of (I) above, the metadata contained in each of the learning images is checked, and the shooting time contained in each of the metadata is further referenced to generate the individual type corresponding to each of the learning images.
- The aforementioned processor, The learning data sorting device according to claim 11, wherein in the process of (I) above, (i) a specific embedding operation is performed on each of the learning images to generate a specific scene vector corresponding to each of the learning images, the specific scene vectors are clustered to generate a specific scene cluster, (ii) the shooting time of each of the learning images is confirmed by referring to the respective metadata contained in each of the learning images, and (iii) the individual type corresponding to each of the learning images is generated by referring to the specific scene cluster and the shooting time.
Description
This invention relates to a method for uniformly selecting training data for training a deep learning model from all training data stored in a data pool, without bias or variability in the data, and to a training data selection device utilizing this method. Generally, deep learning models recognize complex patterns in images, text, sound, and other data to generate accurate insights and predictions, and are applied in various fields such as computer vision, speech recognition, autonomous vehicles, robotics, natural language processing, and medical image analysis. In order for such deep learning models to accurately perform their intended tasks, they must be trained using a large amount of training data. Traditional methods for selecting training data for deep learning models from a collected data pool include random sampling, which selects a target number of training data from the entire training data stored in the data pool, and vector quantization, which clusters and groups the vectors representing each of the training data generated by embedding extraction, and then selects representative values for each group of grouped vectors. For example, Patent Document 1 discloses a method for preparing cognitive data for training a deep learning model, and Patent Document 2 discloses a similarity-based clustering device and method utilizing deep learning training techniques. Furthermore, Patent Document 3 discloses a training device and method for a deep learning classification model, and Patent Document 4 discloses a system and method for training a machine learning model using active learning. However, conventional methods for selecting training data have the problem of resulting in bias and variability in data types. For example, if a data pool contains 1 million training images, with 70% related to sunny weather, 20% related to cloudy weather, 5% related to foggy weather, and 5% related to snowy and/or rainy weather, then randomly sampling 10,000 training images would result in only about 500 images being selected from a total of 50,000 images related to snowy and/or rainy weather. This would lead to a bias and variability in the selection of training images based on weather type. Furthermore, while using vector quantization to select training images can somewhat mitigate the bias and variability in the types of training images selected by embedding extraction and clustering, it cannot fundamentally prevent problems related to data bias and variability. Therefore, the applicant aims to propose a method for uniformly selecting training data for training a deep learning model from all training data stored in a data pool, categorized by type, without bias or variability. U.S. Patent No. 11475335Korean Published Patent No. 10-2023-0068941Patent No. 7225614U.S. Patent No. 1,1663409 The following drawings, attached for use in describing embodiments of the present invention, represent only a portion of the embodiments, and a person with ordinary skill in the art to which the present invention pertains (hereinafter referred to as "ordinary art") can obtain other drawings based on these drawings without performing any inventive work. Figure 1 is a schematic diagram showing a training data selection device for selecting training data for training a deep learning model according to one embodiment of the present invention.Figure 2 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to the first embodiment of the present invention.Figure 3 is a schematic diagram illustrating an example of generating individual types of training data in the first embodiment of the present invention.Figure 4 is a schematic diagram illustrating another example of generating individual types of training data in the first embodiment of the present invention.Figure 5 is a diagram illustrating a binary graph obtained by matching each of the training data with an individual type in the first embodiment of the present invention.Figure 6a is a schematic diagram showing the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6b is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6c is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6d is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 7 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to a second embodiment of the present invention.Figure 8 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to a third embodi