US-12620207-B2 - Method and apparatus for generating an augmented sample set

US12620207B2US 12620207 B2US12620207 B2US 12620207B2US-12620207-B2

Abstract

A method and apparatus is provided for generating an augmented sample set for enriching a first training dataset for training a model. The method comprises: using data augmentation and corresponding labeling or using label augmentation to add a first augmented sample set to the first training dataset, wherein the data augmentation and corresponding labeling, or the label augmentation purposely puts a first distinguishing characteristic of a first part-of-interest or an associated label into the first training dataset to cause the first distinguishing characteristic of the first part-of-interest to be emphasized to enable the model to learn a generalizable principle of the first distinguishing characteristic, wherein the first distinguishing characteristic is for differentiating the first part-of-interest from a second part-of-interest. Methods for training a model, using a model to differentiate part-of-interests and using a model to infer a dataset are also provided.

Inventors

Ya-Jian CHENG

Assignees

Ya-Jian CHENG

Dates

Publication Date: 20260505
Application Date: 20221118

Claims (20)

1 . A method for generating an augmented sample set for enriching a first training dataset for training a first model for differentiating a plurality of parts-of-interest from each other, wherein the plurality of parts-of-interest comprises a first part-of-interest and a second part-of-interest, the method comprising: using data augmentation and corresponding labeling or label augmentation to add a first augmented sample set to the first training dataset, wherein the data augmentation and corresponding labeling or the label augmentation purposely puts a first distinguishing characteristic of the first part-of-interest or an associated label into the first training dataset to cause the first distinguishing characteristic of the first part-of-interest to be emphasized such that the first model learns a generalizable principle of the first distinguishing characteristic, wherein the first distinguishing characteristic is for differentiating the first part-of-interest from the second part-of-interest.
2 . The method of claim 1 , wherein the step of using the data augmentation and corresponding labeling or the label augmentation to add the first augmented sample set to the first training dataset comprises: causing the first distinguishing characteristic of the first part-of-interest to have a first appearance and a first non-distinguishing characteristic of the first part-of-interest to have a second appearance in the first augmented sample set, wherein the first appearance and the second appearance are differential, and wherein the first model is prone to overfit to the first non-distinguishing characteristic when differentiating the first part-of-interest from the second part-of-interest; and labeling the first augmented sample set according to the first appearance and the second appearance that are differential.
3 . The method of claim 2 , wherein the data augmentation comprises superimposing the first part-of-interest with the second part-of-interest in two parts of the first augmented sample set; wherein in one of the two parts, a first superimposition weight of the first distinguishing characteristic is higher than a second superimposition weight of a second distinguishing characteristic of the second part-of-interest, and in the other of the two parts, a fourth superimposition weight of the second distinguishing characteristic is higher than a third superimposition weight of the first distinguishing characteristic; wherein the first appearance has a first differentiable degree of the first distinguishing characteristic with the first superimposition weight with respect to the second distinguishing characteristic with the fourth superimposition weight; wherein in the one of the two parts, the first superimposition weight of the first non-distinguishing characteristic is higher than the second superimposition weight of a second non-distinguishing characteristic of the second part-of-interest, and in the other of the two parts, the fourth superimposition weight of the second non-distinguishing characteristic is higher than the third superimposition weight of the first non-distinguishing characteristic; wherein the second appearance has a second differentiable degree of the first non-distinguishing characteristic with the first superimposition weight with respect to the second non-distinguishing characteristic with the fourth superimposition weight; wherein the first differentiable degree is higher than the second differentiable degree, and the second differentiable degree is indifferentiable.
4 . The method of claim 2 , wherein the step of using the data augmentation and corresponding labeling or the label augmentation to add the first augmented sample set to the first training dataset further comprises: in a plurality of first parts of the first augmented sample set, using the data augmentation to cause a third appearance of the first distinguishing characteristic to change from being more different from the first appearance to being more similar to the first appearance; and labeling so that the third appearance of the first distinguishing characteristic causes a characteristic corresponding to the first distinguishing characteristic that is learned during training the first model to be refined to be an adequate range of distinguishing characteristic of the first part-of-interest, wherein the data augmentation configures regions of interest to be selected in a sample to reflect the adequate range of the distinguishing characteristic.
5 . The method of claim 2 , wherein the step of using the data augmentation and corresponding labeling or the label augmentation to add the first augmented sample set to the first training dataset further comprises: in a plurality of first parts of the first augmented sample set, using the data augmentation to cause a third appearance of the first distinguishing characteristic to change from being more different from the first appearance to being more similar to the first appearance; and labeling so that the third appearance of the first distinguishing characteristic causes a characteristic corresponding to the first distinguishing characteristic that is learned during training the first model to be refined to be an adequate range of distinguishing characteristic of the first part-of-interest, wherein the plurality of first parts are separated by a threshold into a plurality of first groups that correspond to an inadequate range of the distinguishing characteristic and the adequate range of the distinguishing characteristic, respectively, wherein the threshold is adjusted according to a sensitivity and specificity requirement.
6 . The method of claim 5 , wherein the data augmentation comprises superimposing the first part-of-interest with the second part-of-interest in the first parts of the first augmented sample set; wherein in a first one of the first parts, a first superimposition weight of the first distinguishing characteristic is higher than a second superimposition weight of a second distinguishing characteristic of the second part-of-interest, and in a second one of the first parts, a fourth superimposition weight of the second distinguishing characteristic is higher than a third superimposition weight of the first distinguishing characteristic; wherein the first appearance has a first differentiable degree of the first distinguishing characteristic with the first superimposition weight with respect to the second distinguishing characteristic with the fourth superimposition weight; wherein in the first one of the first parts, the first superimposition weight of the first non-distinguishing characteristic is higher than the second superimposition weight of a second non-distinguishing characteristic of the second part-of-interest, and in the second one of the first parts, the fourth superimposition weight of the second non-distinguishing characteristic is higher than the third superimposition weight of the first non-distinguishing characteristic; wherein the second appearance has a second differentiable degree of the first non-distinguishing characteristic with the first superimposition weight with respect to the second non-distinguishing characteristic with the fourth superimposition weight; wherein the first differentiable degree is higher than the second differentiable degree, and the second differentiable degree is indifferentiable; and wherein the third appearance has a plurality of third differentiable degrees of the first distinguishing characteristic with respect to the second distinguishing characteristic, wherein each of the third differentiable degrees corresponds to two of the first parts, wherein the third differentiable degrees range from a fourth differentiable degree to a self of the first differentiable degree, wherein the fourth differentiable degree is lower than the first differentiable degree due to decreasing a first difference between the first superimposition weight and the third superimposition weight, and decreasing a second difference between the fourth superimposition weight and the second superimposition weight.
7 . The method of claim 2 , wherein the data augmentation that causes the first appearance and the second appearance that are differential forms at least one combination of the parts-of-interest, and the data augmentation further forms at least one additional combination of the parts-of-interest, wherein the at least one combination and the at least one additional combination are exhaustive combinations of the parts-of-interest or a subset of the exhaustive combinations of the parts-of-interest, wherein when the at least one combination and the at least one additional combination are the subset, the at least one additional combination is selected on the basis of at least one prediction error of the first model or an application requirement of the first model.
8 . The method of claim 1 , wherein a mechanism of the data augmentation is selected to reproduce an appearance of the first distinguishing characteristic in a rare sample using available samples, wherein the rare sample and the available samples are in the first training dataset before the first training dataset is enriched; and the first augmented sample set is formed using the available samples.
9 . The method of claim 1 , wherein labeling corresponding to the data augmentation comprises one or both of labeling for a main task of differentiating the parts-of-interest emphasizing the first distinguishing characteristic, or de-emphasizing non-distinguishing characteristic and further comprises labeling for at least one auxiliary task that assists the first model to perform the main task using a characteristic relevant to the main task, wherein the at least one auxiliary task is specific to a mechanism of the data augmentation.
10 . The method of claim 1 , wherein the first augmented sample set comprises a first sample that has an artifact caused by the data augmentation, and the first augmented sample set further comprises a second sample that has the artifact caused by the data augmentation and has a second label value differential with respect to a first label value of the first sample.
11 . The method of claim 1 , wherein the step of using the data augmentation and corresponding labeling or label augmentation to add the first augmented sample set to the first training dataset comprises: in a first part in the first augmented sample set, superimposing a basic learning part with an enhancing part with a first superimposition weight for the basic learning part and a second superimposition weight for the enhancing part, wherein the basic learning part has the first distinguishing characteristic of the first part-of-interest and a first non-distinguishing characteristic of the first part-of-interest having appearances differential in a first manner and the enhancing part has the first distinguishing characteristic of the first part-of-interest and the first non-distinguishing characteristic of the first part-of-interest having appearances differential in a second manner opposite to the first manner; and labeling according to one of the appearances differential in the first manner or the appearances differential in the second manner, and further labeling according to the first superimposition weight and the second superimposition weight that are differential so that the other one of the appearances differential in the first manner or the appearances differential in the second manner is implicitly labeled.
12 . The method of claim 11 , wherein the step of using the data augmentation and corresponding labeling or label augmentation to add the first augmented sample set to the first training dataset further comprises: in a plurality of second parts of the first augmented sample set, using the data augmentation to cause third superimposition weights of the first distinguishing characteristic to change from being more different from the first superimposition weight to being more similar to the first superimposition weight; and labeling so that the third superimposition weights of the first distinguishing characteristic cause a characteristic corresponding to the first distinguishing characteristic that is learned during training the first model to be refined to be an adequate range of distinguishing characteristic of the first part-of-interest, wherein the second parts are separated by a threshold into a plurality of first groups that correspond to an inadequate range of the distinguishing characteristic and the adequate range of the distinguishing characteristic, respectively, wherein the threshold is adjusted according to a sensitivity and specificity requirement of the first model.
13 . The method of claim 1 , wherein the step of using the data augmentation and corresponding labeling or the label augmentation to add the first augmented sample set to the first training dataset comprises: using first data in a first standard as an augmented label for reconstructing the first data from second data in a second standard, wherein the first augmented sample set comprises the second data with the augmented label; and wherein before reconstructing, in the first data, a first appearance of the first distinguishing characteristic is clear for the first distinguishing characteristic to be distinguishing and in the second data, a second appearance of the first distinguishing characteristic is not as clear as the first appearance for the first distinguishing characteristic to be distinguishing.
14 . The method of claim 1 , wherein the step of using the data augmentation and corresponding labeling or the label augmentation to add the first augmented sample set to the first training dataset comprises: using data collection to collect two sets of data between which the first distinguishing characteristic of the first part-of-interest has a first differentiable degree and a first non-distinguishing characteristic of the first part-of-interest has a second differentiable degree, wherein the first differentiable degree and the second differentiable degree are different; wherein the first model is prone to overfit to the first non-distinguishing characteristic when differentiating the first part-of-interest from the second part-of-interest; and wherein only a subset of samples come with both a first standard and a second standard while most samples come with the second standard; using the label augmentation to label whether a sample of the first standard comes with a corresponding second standard sample.
15 . The method of claim 14 , wherein the step of using the data augmentation and corresponding labeling or the label augmentation to add the first augmented sample set to the first training dataset further comprises: using the data collection to collect a plurality of first parts to synthesize the first augmented sample set, wherein third differentiable degrees each of which are between a corresponding two of the first parts change from being more different from the first differentiable degree to being more similar to the first differentiable degree; and using the label augmentation to label so that the third differentiable degrees of the first distinguishing characteristic cause a characteristic corresponding to the first distinguishing characteristic that is learned during training the first model to be refined to be an adequate range of distinguishing characteristic of the first part-of-interest.
16 . The method of claim 1 , further comprising training a second model by using a second training dataset enriched by a second augmented sample set generated by the method of claim 1 .
17 . The method of claim 16 , further comprising using the second model to differentiate a plurality of third parts-of-interest from each other.
18 . The method of claim 1 , further comprising using a second model to infer a first dataset, wherein the second model is trained using a second training dataset enriched by a second augmented sample set generated by the method of claim 1 .
19 . A method for generating an augmented sample set for enriching a first training dataset for training a first model for differentiating a plurality of parts-of-interest from each other, wherein the plurality parts-of-interest comprises a first part-of-interest and a second part-of-interest, comprising: a data augmentation step for using data augmentation to cause the first distinguishing characteristic of the first part-of-interest to have a first appearance and a first non-distinguishing characteristic of the first part-of-interest to have a second appearance in the first augmented sample set, wherein the first appearance and the second appearance are differential, and wherein the first model is prone to overfit to the first non-distinguishing characteristic when differentiating the first part-of-interest from the second part-of-interest; and a labeling step for labeling according to the first appearance and the second appearance that are differential.
20 . An apparatus for generating an augmented sample set for enriching a first training dataset for training a first model for differentiating a plurality of parts-of-interest from each other, the parts-of-interest comprising a first part-of-interest and a second part-of-interest, wherein the apparatus comprises a memory storing a plurality of program instructions and a processor coupled to the memory, wherein the program instructions, when called or run by the processor, cause the processor to execute the step of: using data augmentation and corresponding labeling or using label augmentation to add a first augmented sample set to the first training dataset, wherein the data augmentation and corresponding labeling, or the label augmentation purposely put a first distinguishing characteristic of the first part-of-interest or an associated label into the first training dataset to cause the first distinguishing characteristic of the first part-of-interest to be emphasized such that the first model learns a generalizable principle of the first distinguishing characteristic, wherein the first distinguishing characteristic is for differentiating the first part-of-interest from the second part-of-interest.

Description

BACKGROUND OF DISCLOSURE 1. Field of Disclosure The present application relates to data processing, and more particularly, to a method and apparatus for generating an augmented sample set. 2. Description of Related Art This background section introduces aspects that may facilitate a better understanding of the disclosure. Accordingly, the statements of this section are to be read in this light and are not to be understood as admissions about what is in the prior art or what is not in the prior art. A common mistake of Artificial Intelligence (AI) models is that they might have a higher response to non-distinguishing part of the image, such as the background (more pixels) of an image but not an interesting object of the image since the background occupies more pixels than the interesting object. For example, if one searches “wolf” on a web search engine, lots of images with snowy backgrounds are obtained. But if one searches “dog”, images with almost no snowy backgrounds are obtained. Training an AI Neural Network (NN) to detect the wolf with such naturally biased images, NN is easily biased to think its job is to detect snow, while the NN designer thinks the NN's job ought to be detecting the wolf. For a medical NN, the NN often fails on rare conditions, which are usually dangerous ones. For example, melanoma is more dangerous and rare than acne. And a skin disease detecting NN is more likely to wrongly reject melanoma than acne. In an example of developing a medical AI to reject False Positives in Cardiac Pause, a three-second Pause is considered short and a 10-seconds Pause is considered long. Longer Pauses are more dangerous. But one can find that when an ECG recorder detects a 10-seconds Pause, 99% chance it is a loss of contact False Pause. The mechanism behind this is similar to that 99% of the tornado alarms a person has ever heard are usually False Alarms, because true tornados are rare and a person is not likely to survive multiple true tornados. Therefore, when trained with naturally biased data, NN is naturally biased to think that long Pauses are False. If one made two medical AIs, a High Sensitivity one and a Low Sensitivity one, the High Sensitivity AI will wrongly reject 0.5% of True Pauses and all of them are 10-seconds Pause (dangerous and rare). The Low Sensitivity AI will wrongly reject 5% of True Pauses and all of them are 3-seconds Pause (less dangerous and less rare). The irony is that some High Sensitivity AI is easier to make, easier to be approved, and is unfortunately more dangerous. In existing art, some propose to train an AI model by a training dataset that randomly arbitrarily utilizes subsets of a sample for generating each augmented sample, some propose to use label augmentation which unselectively transforms distinguishing and non-distinguishing characteristics or use label augmentation which selectively transforms subsets of the samples but still unselectively transforms the distinguishing and non-distinguishing characteristics, and some propose to train a model by memorizing a rare sample instead of learning a more generalizable principle of the distinguishing characteristic in the rare sample. All these proposals are not a solution to the above-identified problem. SUMMARY An objective of the present application is to provide a method and apparatus for generating an augmented sample set for solving the problems in the existing art. In a first aspect, an embodiment of the present application provides a method for generating an augmented sample set for enriching a first training dataset for training a first model for differentiating a plurality of parts-of-interest from each other, wherein the parts-of-interest comprises a first part-of-interest and a second part-of-interest, comprising: using data augmentation and corresponding labeling or using label augmentation to add a first augmented sample set to the first training dataset, wherein the data augmentation and corresponding labeling, or the label augmentation purposely puts a first distinguishing characteristic of the first part-of-interest or an associated label into the first training dataset to cause the first distinguishing characteristic of the first part-of-interest to be emphasized to enable the first model to learn a generalizable principle of the first distinguishing characteristic, wherein the first distinguishing characteristic is for differentiating the first part-of-interest from the second part-of-interest. In a second aspect, an embodiment of the present application provides a method for training, using a second training dataset enriched by a second augmented sample set generated by the afore-described method, a second model for differentiating a plurality of third parts-of-interest from each other. In a third aspect, an embodiment of the present application provides, a method for using a second model to differentiate a plurality of third parts-of-interest from each other, wherein the second model is train