CN-115840884-B - Sample selection method, device, equipment and medium

CN115840884BCN 115840884 BCN115840884 BCN 115840884BCN-115840884-B

Abstract

The invention discloses a sample selection method, a device, equipment and a medium, wherein the method is characterized in that clean samples and noise samples generated by a classification data enhancement strategy are used for screening samples with high confidence as high-quality samples, and the samples with low confidence, such as the noise samples and the clean samples with low confidence, are reselected so as to supplement the high-quality samples with low confidence in the clean samples with high confidence, thereby completing the screening of the high-quality samples in the enhanced samples. The invention not only can effectively screen the high-quality samples generated in the data enhancement samples, but also increases the diversity of the data enhancement samples, so that the model can learn more modes to improve the performance of the model, thereby further improving the generalization of the model. Correspondingly, the invention also provides a sample selection device, a device and a medium.

Inventors

JIANG SHENGYI
Lin Xiaodian
Lin Nankai
FU YINGWEN
YANG ZIYU

Assignees

广东外语外贸大学

Dates

Publication Date: 20260512
Application Date: 20221214

Claims (10)

1. A method of sample selection, comprising: Obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo tag of the enhanced sample according to the class probability distribution of the enhanced sample, wherein the enhanced sample is a non-labeled sample generated after enhancement based on original labeled sample data; Classifying the enhanced sample into a clean sample and a noise sample according to the comparison result of the label of the original marked sample corresponding to the enhanced sample and the pseudo label of the enhanced sample; Introducing class probability distribution of the clean sample predicted by the pre-training model after Monte Carlo sampling training under different model parameters, so as to obtain the confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient; Confirming a recall sample according to a comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and a set semantic fluency threshold value, wherein the sample to be recalled comprises a low confidence sample and a noise sample; the high confidence sample and the recall sample are taken as the final selected samples.
2. The sample selection method of claim 1, further comprising, prior to predicting a class probability distribution for the enhanced sample based on the trained pre-training model: and acquiring the original labeling sample, and training the pre-training model by adopting a semi-supervision method based on the original labeling sample to obtain the trained pre-training model.
3. The sample selection method of claim 1, wherein the vocabulary similarity threshold is obtained by: Calculating the vocabulary similarity between each high confidence coefficient sample and the corresponding original labeling sample, and obtaining the vocabulary similarity threshold according to the vocabulary similarity between all the high confidence coefficient samples and the corresponding original labeling sample; And the vocabulary similarity is calculated by the following formula: Wherein J (x) represents the vocabulary similarity, and x g 、x l represents the enhancement sample and the original labeling sample corresponding to the enhancement sample respectively.
4. The sample selection method of claim 1, wherein the semantic fluency threshold is obtained by: Calculating the difference between the confusion degree of the original labeling sample corresponding to the high-confidence sample and the confusion degree of the high-confidence sample to obtain the semantic fluency of the high-confidence sample; And obtaining the semantic fluency threshold according to the semantic fluency of all the high confidence samples.
5. The sample selection method according to claim 1, wherein the introducing the class probability distribution of the clean sample predicted by the pre-training model after monte carlo sampling training under different model parameters to obtain the confidence level of the clean sample according to the class probability distribution under different model parameters specifically comprises: and introducing class probability distribution of the clean sample predicted by the pre-training model after Monte Carlo sampling training under different model parameters, and calculating the confidence coefficient of the clean sample by adopting an information entropy formula according to the class probability distribution under different model parameters.
6. The sample selection method as claimed in claim 1, wherein the step of confirming the recall sample according to a comparison result of the vocabulary similarity of the sample to be recalled and the set vocabulary similarity threshold, the semantic fluency of the sample to be recalled and the set semantic fluency threshold, specifically comprises: selecting a to-be-recalled sample with vocabulary similarity larger than the vocabulary similarity threshold and semantic fluency larger than the semantic fluency threshold as the recall sample.
7. The sample selection method of claim 1, wherein the pre-training model is a pre-training language model employing a Bert structure.
8. A sample selection apparatus, comprising: the pseudo tag acquisition module is used for acquiring an enhanced sample, predicting the class probability distribution of the enhanced sample based on the trained pre-training model, and obtaining a pseudo tag of the enhanced sample according to the class probability distribution of the enhanced sample, wherein the enhanced sample is a non-labeled sample generated after being enhanced based on original labeled sample data; The first classification module is used for classifying the enhancement sample into a clean sample and a noise sample according to the comparison result of the label of the original labeling sample corresponding to the enhancement sample and the pseudo label of the enhancement sample; The second classification module is used for introducing the class probability distribution of the clean sample predicted by the pre-training model after Monte Carlo sampling training under different model parameters, obtaining the confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient; the recall module is used for confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and the set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value, wherein the sample to be recalled comprises a low-confidence sample and a noise sample; And the selection module is used for taking the high-confidence sample and the recall sample as final selected samples.
9. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the sample selection method according to any one of claims 1 to 7 when executing the computer program.
10. A storage medium comprising a stored computer program, wherein the computer program, when run, controls a device in which the storage medium is located to perform the sample selection method according to any one of claims 1 to 7.

Description

Sample selection method, device, equipment and medium Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for sample selection. Background Text classification is the basis of many natural language processing (English is called Natural Language Processing, english is called NLP for short) tasks, and is widely applied to various fields such as emotion analysis, intelligent question and answer and the like. Generally, training a classifier with high generalization capability often requires a large amount of labeling data, but the high manual labeling cost and the large amount of time and effort required to construct a large corpus are often not affordable. In order to solve this problem, a data enhancement (fully called Data Augmentation in english) strategy is proposed, where data enhancement can greatly increase the data volume, alleviate the data deficiency, and improve the generalization capability of the model. However, in the field of natural language processing, data enhancement faces a great challenge, and besides the discontinuity of text data, a great cause is that language itself has weak anti-interference capability, and the language data is likely to destroy the semantics by random modification, so that a low-quality sample is generated, the judgment of a classifier is greatly influenced, and a negative feedback effect is generated on the model. Therefore, how to select high quality samples from the data enhancement samples is important. Disclosure of Invention Aspects of the embodiments of the present invention provide a method, an apparatus, a device, and a medium for selecting samples, which can effectively screen high quality samples generated by a data enhancement policy. An embodiment of the present invention provides a sample selection method, including: Obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo tag of the enhanced sample according to the class probability distribution of the enhanced sample, wherein the enhanced sample is a non-labeled sample generated after enhancement based on original labeled sample data; classifying the enhanced sample into a clean sample and a noise sample according to the comparison result of the label of the original marked sample corresponding to the enhanced sample and the pseudo label of the enhanced sample; Introducing class probability distribution of the clean sample predicted by the pre-training model after Monte Carlo sampling training under different model parameters, so as to obtain the confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient; Confirming a recall sample according to a comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and a set semantic fluency threshold value, wherein the sample to be recalled comprises a low confidence sample and a noise sample; the high confidence sample and the recall sample are taken as the final selected samples. A second aspect of an embodiment of the present invention provides a sample selection apparatus, including: the pseudo tag acquisition module is used for acquiring an enhanced sample, predicting the class probability distribution of the enhanced sample based on the trained pre-training model, and obtaining a pseudo tag of the enhanced sample according to the class probability distribution of the enhanced sample, wherein the enhanced sample is a non-labeled sample generated after being enhanced based on original labeled sample data; The first classification module is used for classifying the enhancement sample into a clean sample and a noise sample according to the comparison result of the label of the original labeling sample corresponding to the enhancement sample and the pseudo label of the enhancement sample; The second classification module is used for introducing the class probability distribution of the clean sample predicted by the pre-training model after Monte Carlo sampling training under different model parameters, obtaining the confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient; the recall module is used for confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and the set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semant