CN-115374865-B - Training data processing method, device, equipment and readable medium

CN115374865BCN 115374865 BCN115374865 BCN 115374865BCN-115374865-B

Abstract

The application provides a training data processing method, a training data processing device, training data processing equipment and a readable medium. And then screening from the candidate sample set at least according to the first screening condition to obtain training sample data. The first screening condition is that the category of candidate sample data to be screened is matched with any key optimization category, and the candidate sample set comprises a plurality of candidate sample data. Because the similarity degree between any two key optimization categories in the plurality of key optimization categories is smaller than the similarity degree threshold, and the similarity degree between the two key optimization categories is the similarity degree between sample data under the two key optimization categories, the training sample data of different categories obtained through screening according to the first screening condition can be used for improving the optimization effect of the classification model to be optimized when the classification model to be optimized is trained.

Inventors

WANG SIRUI
KE FENG

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260508
Application Date: 20220826

Claims (10)

1. A method for processing training data, comprising: Selecting a plurality of key optimization categories from a plurality of categories which can be classified by a classification model, wherein the similarity between any two key optimization categories in the plurality of key optimization categories is smaller than a similarity threshold value, the similarity between the two key optimization categories is the similarity between sample data under the two key optimization categories, wherein the sample data is text sample data when the classification model is a text classification model, and the sample data is image sample data when the classification model is an image classification model; Screening to obtain training sample data from a candidate sample set according to at least a first screening condition, wherein the first screening condition is that the category of the candidate sample data to be screened is matched with any one key optimization category; and adding the training sample data obtained by screening into the first training set to obtain a second training set, wherein the second training set is used for training to obtain an optimized classification model.
2. The method of claim 1, wherein before selecting the plurality of key optimization categories from the plurality of categories that are sortable from the classification model, further comprising: acquiring a first classification accuracy of a classification model to be optimized on verification sample data under each category in a verification set, wherein the classification model to be optimized is obtained through training of training sample data in a first training set; determining the category of which the first classification accuracy is smaller than a first accuracy threshold as the category to be optimized; The classification model can classify a plurality of categories into determined categories to be optimized.
3. The method of claim 1, wherein prior to screening the training sample data from the candidate sample set based at least on the first screening condition, further comprising: Inputting each model sample data under the key optimization category into a feature extraction model aiming at each key optimization category, and obtaining and outputting the feature of each model sample data by the feature extraction model, wherein the feature extraction model is obtained by training a neural network model through a plurality of sample data, and the model sample data is verification sample data or training sample data; The screening to obtain training sample data from the candidate sample set at least according to the first screening condition comprises the following steps: And screening to obtain training sample data from the candidate sample set according to a first screening condition and a second screening condition, wherein the second screening condition is the characteristic of the candidate sample data to be screened, and the characteristic of the candidate sample data is matched with the characteristic of the model sample data under any key optimization category.
4. The method of claim 1, wherein prior to screening the training sample data from the candidate sample set based at least on the first screening condition, further comprising: obtaining a second classification accuracy of the classification model to be optimized on candidate sample data under each class in the candidate sample set; Determining candidate sample data under the category of which the second classification accuracy is smaller than a second accuracy threshold as processed candidate sample data; wherein, at least according to the first screening condition, screening from the candidate sample set to obtain training sample data includes: and screening the processed candidate sample data to obtain training sample data at least according to a first screening condition.
5. The method of claim 1, wherein selecting a plurality of key optimization categories from the plurality of categories that are sortable from the classification model comprises: Selecting one of a plurality of categories which can be classified by the classification model, and determining the selected category as a key optimization category; Selecting the category which has highest similarity with the target category and meets a third screening condition from the rest categories except the category which is determined to be the key optimization category, wherein the target category is the category which is determined to be the key optimization category recently; And if the category which has the highest similarity with the target category and meets the third screening condition is selected, determining the selected category as a key optimization category, taking the selected category as a new target category, and returning to the step of selecting the category which has the highest similarity with the target category and meets the third screening condition from the rest categories except the determined key optimization category until the category which has the highest similarity with the target category and meets the third screening condition is not selected.
6. The method of claim 5, wherein selecting the category that has the highest similarity to the target category and satisfies the third filtering condition among the remaining categories except the category that has been determined as the emphasis optimization category comprises: For each verification sample data in a target class in a verification set, obtaining two training sample data matched with the verification sample data from all training sample data in other classes except the class which is determined to be important optimization according to the verification sample data, wherein the two training sample data matched with the verification sample data are two training sample data which have the highest similarity with the verification sample data in a first training set and are different in class; Calculating a similarity difference value between two training sample data matched with the verification sample data aiming at each verification sample data in the target category in the verification set, and adding one to a similarity pair frequency statistic of two category combinations corresponding to the training sample data matched with the verification sample data if the absolute value of the similarity difference value is smaller than a first threshold; for each counted class combination, determining the ratio of the total frequency of the similarity pairs of the class combination to the total number of verification sample data under the target class in the verification set as the similarity degree of the class combination, wherein the similarity degree of the class combination is the similarity degree between two classes included in the class combination; and selecting the category which has the highest similarity with the target category and meets the third screening condition from all the counted category combinations according to the similarity of all the category combinations.
7. The method according to claim 2, wherein the adding the training sample data obtained by the filtering to the first training set to obtain the second training set further comprises: model training is carried out by using training sample data in the second training set, and an optimized classification model is obtained; Updating candidate sample data in the candidate sample set; And taking the optimized classification model as a new classification model to be optimized, taking the second training set as a new first training set, and returning to the step of acquiring the first classification accuracy of the classification model to be optimized for the verification sample data in each class in the verification set.
8. A training data processing device, comprising: The system comprises a first selection unit, a first extraction unit and a second selection unit, wherein the first selection unit is used for selecting a plurality of key optimization categories from a plurality of categories which can be classified by a classification model, wherein the similarity between any two key optimization categories in the plurality of key optimization categories is smaller than a similarity threshold value, the similarity between the two key optimization categories is the similarity between sample data under the two key optimization categories, the sample data are text sample data when the classification model is a text classification model, and the sample data are image sample data when the classification model is an image classification model; The first screening unit is used for screening training sample data from a candidate sample set at least according to a first screening condition, wherein the first screening condition is the category of the candidate sample data to be screened and is matched with any key optimization category; the adding unit is used for adding the training sample data obtained through screening into the first training set to obtain a second training set, wherein the second training set is used for training to obtain an optimized classification model.
9. A computer readable medium, characterized in that a computer program is stored thereon, wherein the program, when executed by a processor, implements the method according to any of claims 1 to 7.
10. A training data processing apparatus, comprising: One or more processors; A storage device having one or more programs stored thereon; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

Description

Training data processing method, device, equipment and readable medium Technical Field The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a readable medium for processing training data. Background The classification model is used to classify the sample data. In the prior art, the optimization process of the classification model is that a large amount of sample data is directly input into a feature classifier to perform model training, and the feature classifier is used for obtaining and outputting an optimized classification model. The sample data input into the feature classifier carries the category of the sample data. However, the similarity between some sample data of different types used in the optimization process is higher, for example, the similarity between two sample data of "i want to recharge telephone charge" and "i want to recharge flow" is higher, but the category is different, so that the situation that similar sample data are classified and confused easily occurs in the process of optimizing the classification model, and the optimization effect of the classification model is poor, namely, the improvement degree of the classification accuracy is smaller. Disclosure of Invention In view of this, the method, the device, the equipment and the readable medium for processing training data provided by the embodiments of the present invention are used to train the classification model to be optimized by obtaining the training sample data under a plurality of different categories, so as to achieve the effect of improving the optimization of the classification model. In order to achieve the above object, the embodiment of the present invention provides the following technical solutions: The first aspect of the application discloses a training data processing method, which comprises the following steps: Selecting a plurality of key optimization categories from a plurality of categories which can be classified by a classification model, wherein the similarity between any two key optimization categories in the plurality of key optimization categories is smaller than a similarity threshold value; Screening to obtain training sample data from a candidate sample set according to at least a first screening condition, wherein the first screening condition is that the category of the candidate sample data to be screened is matched with any one key optimization category; and adding the training sample data obtained by screening into the first training set to obtain a second training set, wherein the second training set is used for training to obtain an optimized classification model. Optionally, in the method for processing training data, before selecting the plurality of key optimization categories from the plurality of categories that can be classified by the classification model, the method further includes: acquiring a first classification accuracy of a classification model to be optimized on verification sample data under each category in a verification set, wherein the classification model to be optimized is obtained through training of training sample data in a first training set; determining the category of which the first classification accuracy is smaller than a first accuracy threshold as the category to be optimized; The classification model can classify a plurality of categories into determined categories to be optimized. Optionally, in the method for processing training data, before screening the training sample data from the candidate sample set according to at least the first screening condition, the method further includes: For each key optimization category, acquiring the characteristic of each model sample data under the key optimization category, wherein the model sample data is verification sample data or training sample data, and the characteristic of the model sample data is used for representing key information of the model sample data; The screening to obtain training sample data from the candidate sample set at least according to the first screening condition comprises the following steps: And screening to obtain training sample data from the candidate sample set according to a first screening condition and a second screening condition, wherein the second screening condition is the characteristic of the candidate sample data to be screened, and the characteristic of the candidate sample data is matched with the characteristic of the model sample data under any key optimization category. Optionally, in the above method for processing training data, the obtaining, for each key optimization category, a feature of each model sample data under the key optimization category includes: And inputting each model sample data under the key optimization category into a feature extraction model aiming at each key optimization category, and obtaining and outputting the feature of each model sample data by the feature ext