CN-115577285-B - Training set processing method and device for classification, electronic equipment and storage medium

CN115577285BCN 115577285 BCN115577285 BCN 115577285BCN-115577285-B

Abstract

The training set processing method, the device, the electronic equipment and the storage medium for classification comprise the steps of obtaining a classification sample set, determining each prediction classification corresponding to each sample, probability of each prediction classification and disambiguation score corresponding to each sample, determining a most similar sample corresponding to each sample from each sample with target prediction classification corresponding to each sample, determining the most similar prediction classification corresponding to each classification label based on the disambiguation score, determining a first target sample with the prediction classification being the most similar prediction classification from all samples with classification labels based on the most similar prediction classification, determining a second target sample with the prediction classification being the classification label from all samples with the most similar prediction classification, and rapidly identifying the defects of confusion, labeling errors and the like of the categories or the samples by analyzing the first target sample, the second target sample and the most similar sample corresponding to each sample, so that the sample cleaning speed is improved.

Inventors

LUO HUAN
LI XIN
Dai Chenyue
Hou Yuanchun

Assignees

上海喜马拉雅科技有限公司

Dates

Publication Date: 20260512
Application Date: 20220928

Claims (9)

1. A training set processing method for classification, the method comprising: the method comprises the steps of obtaining a classification sample set, wherein the classification sample set comprises a plurality of samples and classification labels corresponding to the samples, wherein one classification label corresponds to at least one sample, and the samples are corpus, text and images; Determining maximum error classification probability and confusion probability corresponding to each sample based on a comparison result of the prediction classification and a classification label corresponding to the sample according to each sample, and determining the maximum value of the maximum error classification probability and the confusion probability as disambiguation score corresponding to the sample, wherein the disambiguation score characterizes the error degree of the classification label corresponding to the sample; For each sample, determining a most similar sample corresponding to each sample from the samples with target prediction classifications corresponding to the samples, wherein the probability of the target prediction classifications is maximum; determining a most similar prediction classification corresponding to each classification label based on the disambiguation score, determining a first target sample of prediction classification as the most similar prediction classification from all samples with the classification labels based on the most similar prediction classification, and determining a second target sample of prediction classification as the classification label from all samples with the most similar prediction classification; The first target sample and the second target sample are used for indicating to execute a merging strategy on the classification label and the most similar prediction classification or a correction strategy on the first target sample and/or the second target sample, and the most similar sample is used for indicating to correct the samples.
2. The method of claim 1, wherein deriving a respective predictive classification for each of the samples and a probability for each of the predictive classifications using a trained classification model comprises: Determining a plurality of sample subsets and a prediction sample set corresponding to each sample subset from the training set according to the sample identification corresponding to each sample; In one sample subset, the sample identification of each sample is the same as the remainder obtained by dividing the number of the sample subset, and the remainder is different from the remainder corresponding to each prediction sample in the prediction sample set; training a classification model by using each sample subset in turn, and predicting a prediction sample set corresponding to the sample subset by using the trained classification model to obtain each prediction classification of each prediction sample in the prediction sample set and the probability of each prediction classification.
3. The method of claim 1, wherein determining, for each sample, a maximum probability of misclassification and a probability of confusion for the sample based on a comparison of the predictive classification to a classification label for the sample, comprises: Determining a first maximum probability and a second maximum probability according to the sequence of the probabilities from large to small, and taking the difference between the first maximum probability and the second maximum probability as the confusion probability; if the prediction classification corresponding to the first maximum probability is consistent with the classification label, the second maximum probability is used as the maximum error classification probability; And if the predicted classification corresponding to the first maximum probability is inconsistent with the classification label, taking the first maximum probability as the maximum error classification probability.
4. The method of claim 1, wherein determining a most similar prediction classification corresponding to each of the classification tags based on the disambiguation score, and determining a first target sample predicted to be classified as the most similar prediction classification from all samples having the classification tags and a second target sample predicted to be classified as the classification tag based on the most similar prediction classification from all samples having the most similar prediction classification comprises: determining samples to be confirmed with the same prediction classification from all samples with the classification labels; Determining the sum of disambiguation scores of all the samples to be confirmed with the same prediction classification, and determining the prediction classification corresponding to the maximum sum of disambiguation scores as the most similar prediction classification; and determining all samples to be confirmed corresponding to the most similar prediction classification as the first target samples, and determining a second target sample with the prediction classification as the classification label from all samples with the most similar prediction classification.
5. The method of claim 1, wherein for each sample, determining, for each sample, a most similar sample for each sample from among the samples having the target prediction classifications for which the sample corresponds, comprises: if the target prediction classification is consistent with the classification label corresponding to the sample, outputting a most similar sample to be empty; And if the target prediction classification is inconsistent with the classification label corresponding to the sample, extracting all candidate samples with the target prediction classification from the classification sample set, calculating the similarity between all candidate samples and the sample, and determining the candidate sample with the maximum similarity as the most similar sample.
6. The method according to claim 4, wherein the method further comprises: Sequentially displaying the classification label, the number of samples corresponding to the classification label, the most similar prediction classification corresponding to the classification label, the probability of the most similar prediction classification corresponding to the first target sample and the first target sample, and the probability of the classification label corresponding to the second target sample according to the sequence from the large sum of the disambiguation scores to the small sum; and sequentially displaying each sample, the classification label corresponding to the sample, the prediction classification, the disambiguation score and the most similar sample according to the sequence of the disambiguation score from large to small.
7. A training set processing apparatus for classification, comprising: The system comprises an acquisition module, a classification sample set, a classification module and a classification module, wherein the classification sample set comprises a plurality of samples and classification labels corresponding to the samples; The determining module is used for obtaining each prediction classification of each sample and the probability of each prediction classification by using the trained classification model; determining, for each sample, a maximum error classification probability and a confusion probability corresponding to the sample based on a comparison result of the predictive classification and a classification label corresponding to the sample, and determining the maximum value of the maximum error classification probability and the confusion probability as a disambiguation score corresponding to the sample, wherein the disambiguation score characterizes an error score of the classification label corresponding to the sample; the determining module is further configured to determine, for each sample, a most similar sample corresponding to each sample from among samples having a target prediction classification corresponding to the sample, where the probability of the target prediction classification is the greatest; The determining module is further configured to determine a most similar prediction classification corresponding to each classification label based on the disambiguation score, determine a first target sample predicted to be classified as the most similar prediction classification from all samples with the classification labels based on the most similar prediction classification, and determine a second target sample predicted to be classified as the classification label from all samples with the most similar prediction classification; The first target sample and the second target sample are used for indicating to execute a merging strategy on the classification label and the most similar prediction classification or a correction strategy on the first target sample and/or the second target sample, and the most similar sample is used for indicating to correct the samples.
8. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the method of any one of claims 1 to 6.
9. A storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 6.

Description

Training set processing method and device for classification, electronic equipment and storage medium Technical Field The invention relates to the technical field of text classification, in particular to a training set processing method and device for classification, electronic equipment and a storage medium. Background The classification task has a great effect in many scenes, such as intelligent customer service intention recognition, natural flower and plant animal recognition and the like, the classification task depends on classification model technology, classification effect is greatly dependent on data quality and sample number of a training set, and when a great number of annotation errors exist in annotation samples of the training set, the performance of an algorithm can be influenced, and the classification effect is seriously influenced. Disclosure of Invention One of the purposes of the present invention is to provide a training set processing method, apparatus, electronic device and storage medium for classification, which can quickly clean data of a training set, identify samples or classes with wrong labeling and confusion labeling, and implement the following embodiments of the present invention: in a first aspect, the present invention provides a training set processing method for classification, the method comprising: the method comprises the steps of obtaining a classification sample set, wherein the classification sample set comprises a plurality of samples and classification labels corresponding to the samples, and one classification label corresponds to at least one sample; Determining each prediction classification corresponding to each sample, the probability of each prediction classification, and the disambiguation score corresponding to each sample, wherein the disambiguation score characterizes the error degree of the classification label corresponding to the sample; For each sample, determining a most similar sample corresponding to each sample from the samples with target prediction classifications corresponding to the samples, wherein the probability of the target prediction classifications is maximum; determining a most similar prediction classification corresponding to each classification label based on the disambiguation score, determining a first target sample of prediction classification as the most similar prediction classification from all samples with the classification labels based on the most similar prediction classification, and determining a second target sample of prediction classification as the classification label from all samples with the most similar prediction classification; The first target sample and the second target sample are used for indicating to execute a merging strategy on the classification label and the most similar prediction classification or a correction strategy on the first target sample and/or the second target sample, and the most similar sample is used for indicating to correct the samples. In a second aspect, the present invention provides a training set processing apparatus for classification, comprising: The system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a classification sample set, the classification sample set comprises a plurality of samples and classification labels corresponding to the samples, and one classification label corresponds to at least one sample; the determining module is used for determining each prediction classification corresponding to each sample, the probability of each prediction classification and the disambiguation score corresponding to each sample, wherein the disambiguation score characterizes the error score of the classification label corresponding to the sample; the determining module is further configured to determine, for each sample, a most similar sample corresponding to each sample from among samples having a target prediction classification corresponding to the sample, where the probability of the target prediction classification is the greatest; The determining module is further configured to determine a most similar prediction classification corresponding to each classification label based on the disambiguation score, determine a first target sample predicted to be classified as the most similar prediction classification from all samples with the classification labels based on the most similar prediction classification, and determine a second target sample predicted to be classified as the classification label from all samples with the most similar prediction classification; The first target sample and the second target sample are used for indicating to execute a merging strategy on the classification label and the most similar prediction classification or a correction strategy on the first target sample and/or the second target sample, and the most similar sample is used for indicating to correc