JP-2026074717-A - A method for selecting training data for training a deep learning model and a training data selection device using the same.
Abstract
[Problem] To provide a method for selecting training data for training a deep learning model. [Solution] The method includes the steps of: a learning data selection device generating a binary graph matching each of a large number of learning images with an individual type; and the learning data selection device referring to the binary graph to confirm the number of corresponding individual types that match each learning image from among the individual types, selecting a specific learning image with the largest number of corresponding individual types, confirming the number of remaining corresponding individual types that match each learning image from the remaining individual types after removing the corresponding individual types, selecting another specific learning image with the largest number of remaining corresponding individual types, repeating this process to select some learning images that match all individual types, and repeating the process of executing the cycle from the remaining learning images after removing the selected learning images from a large number of learning images until n learning images are selected. [Selection Diagram] Figure 7
Inventors
- 金 桂賢
- 李 鉉東
Assignees
- スパーブエーアイ カンパニー リミテッド
Dates
- Publication Date
- 20260507
- Application Date
- 20241021
Claims (14)
- In a method for selecting training data for training a deep learning model, (a) A learning data selection device generates at least one attribute corresponding to each of the numerous learning images stored in the data pool, and generates a binary graph matching each of the numerous learning images with the attribute; and (b) The training data selection device (i) refers to the binary graph to determine the number of corresponding individual types that match each of the individual types for each of the training images, and selects a specific training image with the largest number of corresponding individual types; (ii) refers to the remaining individual types, excluding the corresponding individual types, to determine the number of the remaining corresponding individual types that match each of the training images, and selects another specific training image with the largest number of the remaining corresponding individual types, repeating this process to select a portion of training images that match all of the individual types; and repeating the process of performing the cycle from the remaining training images, excluding the portion of training images selected from the large number of training images, until n training images for training the deep learning model (where n is the target number of training images for training the deep learning model and is an integer representing a number of multiple images) are selected; A method that includes this.
- In step (b) above, The individual types include a first_1 individual type to a first_x individual type (where x is an integer of 1 or more) corresponding to a first type type that each of the learning images has, and a second_1 individual type to a second_y individual type (where y is an integer of 1 or more) corresponding to a second type type that each of the learning images has, The method according to claim 1, wherein the learning data selection device selects the n learning images such that the number of 1_1 individual types to 1_x individual types corresponding to the first type and the number of 2_1 individual types to 2_y individual types corresponding to the second type that match the n learning images are within the threshold deviation, the number of 1_1 individual types to 1_x individual types is within the first threshold deviation, and the number of 2_1 individual types to 2_y individual types is within the second threshold deviation.
- In step (b) above, The method according to claim 1, wherein if there are multiple learning images with the largest number of the corresponding individual types, the learning data selection device selects one of the multiple learning images according to a first criterion to select the specific learning image, and if there are multiple remaining learning images with the largest number of the remaining individual types, it selects one of the multiple remaining learning images according to a second criterion to select the other specific learning image.
- In step (a) above, The method according to claim 1, wherein the learning data selection device transmits the learning images to a labeler terminal, and a labeler corresponding to the labeler terminal generates at least one individual type corresponding to each of the learning images.
- In step (a) above, The method according to claim 1, wherein the training data selection device performs a process of generating a first scene vector corresponding to each of the training images by performing a first embedding operation on each of the training images and clustering the first scene vectors to generate a first scene cluster, or performs a process of generating a kth scene vector corresponding to each of the training images by performing a kth embedding operation (where k is an integer of 1 or more) on each of the training images and clustering the kth scene vectors to generate a kth scene cluster, and generates the individual type corresponding to the training images by referring to the first scene cluster to the kth scene cluster.
- In step (a) above, The method according to claim 1, wherein the learning data selection device checks the metadata contained in each of the learning images, and further refers to the shooting time contained in each of the metadata to generate the individual type corresponding to each of the learning images.
- In step (a) above, The method according to claim 1, wherein the training data selection device (i) performs a specific embedding operation on each of the training images to generate a specific scene vector corresponding to each of the training images, clusters the specific scene vectors to generate a specific scene cluster, (ii) refers to the respective metadata contained in each of the training images to confirm the shooting time of each of the training images, and (iii) refers to the specific scene cluster and the shooting time to generate the individual type corresponding to each of the training images.
- In a training data selection device for selecting training data for training a deep learning model, A memory containing instructions for selecting training data for training a deep learning model; and a processor that performs operations for selecting training data for training the deep learning model in accordance with the instructions stored in the memory; Includes, The processor performs the following processes: (I) generating at least one attribute corresponding to each of the numerous training images stored in the data pool, and generating a binary graph matching each of the numerous training images with the attribute; and (II) (i) referring to the binary graph, confirming the number of corresponding attributes that match each of the attribute types for each of the training images, selecting a specific training image with the largest number of corresponding attribute types, and (ii) matching each of the remaining attribute types, excluding the corresponding attribute types, for each of the training images. A learning data selection device that checks the number of each of the remaining corresponding individual types, repeats the process of selecting the other specific learning image with the largest number of the remaining corresponding individual types, executes a cycle to select some learning images that match all of the individual types, and repeats the process of executing the above cycle from the remaining learning images excluding the selected some learning images from the large number of learning images until n learning images for training the deep learning model (where n is the target number of learning images for training the deep learning model and is an integer representing multiple numbers) are selected.
- The aforementioned processor, In the process described in (II) above, the individual type includes a first_1 individual type to a first_x individual type (where x is an integer of 1 or more) corresponding to the first type type that each of the learning images has, and a second_1 individual type to a second_y individual type (where y is an integer of 1 or more) corresponding to the second type type that each of the learning images has, The learning data selection device according to claim 8, which selects the n learning images such that the number of individual types from the first type to the first x type that match the first type and the number of individual types from the second type to the second y type that match the n learning images are within the threshold deviation, the number of individual types from the first type to the first x type is within the first threshold deviation, and the number of individual types from the second type to the second y type is within the second threshold deviation.
- The aforementioned processor, In the process of (II) described above, if there are multiple learning images that have the largest number of the corresponding individual types, one of the multiple learning images is selected according to the first criterion to select the specific learning image, and if there are multiple remaining learning images that have the largest number of the remaining individual types, one of the multiple remaining learning images is selected according to the second criterion to select the other specific learning image, as described in claim 8.
- The aforementioned processor, The learning data sorting apparatus according to claim 8, wherein in the process of (I) above, the learning images are transmitted to a labeler terminal, and a labeler corresponding to the labeler terminal generates at least one individual type corresponding to each of the learning images.
- The aforementioned processor, The learning data sorting device according to claim 8, wherein in the process of (I) above, a process is performed in which a first embedding operation is performed on each of the learning images to generate a first scene vector corresponding to each of the learning images, and the first scene vectors are clustered to generate a first scene cluster, or a process is performed in which a kth embedding operation (where k is an integer of 1 or more) is performed on each of the learning images to generate a kth scene vector corresponding to each of the learning images, and the kth scene vectors are clustered to generate a kth scene cluster, and the individual types corresponding to the learning images are generated by referring to the first scene cluster or the kth scene cluster.
- The aforementioned processor, The learning data sorting apparatus according to claim 8, wherein in the process of (I) above, the metadata contained in each of the learning images is checked, and the shooting time contained in each of the metadata is further referenced to generate the individual type corresponding to each of the learning images.
- The aforementioned processor, The learning data sorting device according to claim 8, wherein in the process of (I) above, (i) a specific embedding operation is performed on each of the learning images to generate a specific scene vector corresponding to each of the learning images, the specific scene vectors are clustered to generate a specific scene cluster, (ii) the shooting time of each of the learning images is confirmed by referring to the respective metadata contained in each of the learning images, and (iii) the individual type corresponding to each of the learning images is generated by referring to the specific scene cluster and the shooting time.
Description
This invention relates to a method for uniformly selecting training data for training a deep learning model from all training data stored in a data pool, without bias or variability in the data, and to a training data selection device utilizing this method. Generally, deep learning models recognize complex patterns in images, text, sound, and other data to generate accurate insights and predictions, and are applied in various fields such as computer vision, speech recognition, autonomous vehicles, robotics, natural language processing, and medical image analysis. In order for such deep learning models to accurately perform their intended tasks, they must be trained using a large amount of training data. Traditional methods for selecting training data for deep learning models from a collected data pool include random sampling, which selects a target number of training data from the entire training data stored in the data pool, and vector quantization, which clusters and groups the vectors representing each of the training data generated by embedding extraction, and then selects representative values for each group of grouped vectors. For example, Patent Document 1 discloses a method for preparing cognitive data for training a deep learning model, and Patent Document 2 discloses a similarity-based clustering device and method utilizing deep learning training techniques. Furthermore, Patent Document 3 discloses a training device and method for a deep learning classification model, and Patent Document 4 discloses a system and method for training a machine learning model using active learning. However, conventional methods for selecting training data have the problem of resulting in bias and variability in data types. For example, if a data pool contains 1 million training images, with 70% related to sunny weather, 20% related to cloudy weather, 5% related to foggy weather, and 5% related to snowy and/or rainy weather, then randomly sampling 10,000 training images would result in only about 500 images being selected from a total of 50,000 images related to snowy and/or rainy weather. This would lead to a bias and variability in the selection of training images based on weather type. Furthermore, while using vector quantization to select training images can somewhat mitigate the bias and variability in the types of training images selected by embedding extraction and clustering, it cannot fundamentally prevent problems related to data bias and variability. Therefore, the applicant aims to propose a method for uniformly selecting training data for training a deep learning model from all training data stored in a data pool, categorized by type, without bias or variability. U.S. Patent No. 11475335Korean Published Patent No. 10-2023-0068941Patent No. 7225614U.S. Patent No. 1,1663409 The following drawings, attached for use in describing embodiments of the present invention, represent only a portion of the embodiments, and a person with ordinary skill in the art to which the present invention pertains (hereinafter referred to as "ordinary art") can obtain other drawings based on these drawings without performing any inventive work. Figure 1 is a schematic diagram showing a training data selection device for selecting training data for training a deep learning model according to one embodiment of the present invention.Figure 2 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to the first embodiment of the present invention.Figure 3 is a schematic diagram illustrating an example of generating individual types of training data in the first embodiment of the present invention.Figure 4 is a schematic diagram illustrating another example of generating individual types of training data in the first embodiment of the present invention.Figure 5 is a diagram illustrating a binary graph obtained by matching each of the training data with an individual type in the first embodiment of the present invention.Figure 6a is a schematic diagram showing the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6b is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6c is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 6d is a schematic diagram illustrating the process of selecting training data by referring to a binary graph in the first embodiment of the present invention.Figure 7 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to a second embodiment of the present invention.Figure 8 is a schematic diagram illustrating a method for selecting training data for training a deep learning model according to a third embodi