CN-116129190-B - Sample data determining method, data processing method, device, equipment and medium

CN116129190BCN 116129190 BCN116129190 BCN 116129190BCN-116129190-B

Abstract

The application discloses a sample data determining method, a data processing method, a device, equipment and a medium, wherein the method comprises the steps of obtaining an unlabeled data set, taking out part of unlabeled data from the unlabeled data set and marking the part of unlabeled data to obtain a labeled first data set corresponding to the part of unlabeled data, training a type identification model by using the first data set, enabling the type identification model to be used for identifying the unlabeled data to obtain an identification result, obtaining corresponding residual unlabeled data from the unlabeled data set, and determining the labeled sample data set based on the first data set, the type identification model and the residual unlabeled data.

Inventors

Lu Xiangzhe
YANG YANG
HU GUANGLONG

Assignees

网易（杭州）网络有限公司

Dates

Publication Date: 20260508
Application Date: 20230208

Claims (13)

1. A method of determining sample data, comprising: acquiring an unlabeled data set, wherein the unlabeled data in the unlabeled data set is any one of the following data types of data, namely image data, video data, text data and audio data; Taking out part of the non-tag data from the non-tag data set and marking the part of the non-tag data to obtain a first data set with tags corresponding to the part of the non-tag data; Training a type recognition model by using the first data set, wherein the type recognition model is used for recognizing the unlabeled data to obtain a recognition result, and the recognition result comprises the type of the unlabeled data and the confidence level aiming at the corresponding type, wherein the type is related to the content of the unlabeled data; acquiring corresponding residual non-tag data from the non-tag data set; Determining a labeled sample dataset based on the first dataset, the type-recognition model, and the remaining unlabeled data; Determining a labeled sample data set based on the first data set, the type recognition model and the remaining unlabeled data, including determining a labeled second data set corresponding to the remaining unlabeled data based on the type recognition model and the remaining unlabeled data; The method further includes adding the first data set to a labeled sample pool; determining a second data set with labels corresponding to the residual non-label data based on the type identification model and the residual non-label data, wherein the second data set comprises a data determination step, a label analysis step and a label analysis step, wherein the residual non-label data is used as non-label data to be analyzed; the input step of inputting the unlabeled data to be analyzed into the type recognition model to obtain a first recognition result corresponding to the unlabeled data to be analyzed, wherein the first recognition result comprises a first type of the unlabeled data to be analyzed and a first confidence coefficient aiming at the first type, the first type is related to the content of the unlabeled data to be analyzed, the analysis step of taking out at least part of first unlabeled data, corresponding to the first confidence coefficient, in the unlabeled data to be analyzed from the unlabeled data set, of which the first confidence coefficient is larger than a first preset threshold value, and taking the first recognition result corresponding to the first unlabeled data as a label of the first unlabeled data to obtain a third data set with a label, the manual labeling step of taking out the second unlabeled data, corresponding to the first confidence coefficient, in the unlabeled data to be analyzed, of which the first confidence coefficient is smaller than a second preset threshold value, from the unlabeled data set, and labeling the label for a fourth data set with a label, the third data set and the fourth data set are added into the unlabeled data set, the first data set is circularly labeled to the sample set, the sample set is judged to satisfy the conditions of the sample set, the sample set is labeled, the sample model is obtained, and whether the sample set is labeled by the sample set is judged, if yes, taking a set of data except the data contained in the first data set in the marked sample pool as a second data set with labels corresponding to the remaining non-label data, if not, executing a circulation step, namely taking the non-label data except the data contained in the marked sample pool in the non-label data set as new non-label data to be analyzed, and returning to execute the input step, the analysis step, the manual marking step, the adding step, the training step and the circulation judging step until the second data set with labels corresponding to the remaining non-label data is determined; The method comprises the steps of obtaining a plurality of preset types which can be identified by the type identification model, traversing each preset type in the preset types, obtaining a plurality of first sample data which are of the preset type and have confidence degrees larger than a third preset threshold value for the preset type from the labeled sample data set, inputting the plurality of first sample data into the type identification model to obtain a plurality of prediction confidence degrees corresponding to the plurality of first sample data, determining a first target threshold value of the type identification model for the preset type according to the plurality of prediction confidence degrees, and carrying out data cleaning on the data in the labeled sample data set according to the type identification model for the plurality of first target threshold values of the preset type to obtain a cleaned sample data set.
2. The method according to claim 1, wherein the method further comprises: grouping the labeled sample data sets for a plurality of times according to a preset grouping rule to obtain a plurality of training sample data sets and a plurality of verification sample data sets corresponding to the plurality of training sample data sets; determining a plurality of alternative recognition models based on the plurality of training sample data sets and the type recognition model; a new type recognition model is determined from the plurality of candidate recognition models based on the plurality of verification sample data sets.
3. The method of claim 1, wherein determining a first target threshold for the type-recognition model for the preset type based on the plurality of predictive confidence levels comprises: and averaging the plurality of prediction confidence degrees to obtain a first target threshold of the type recognition model aiming at the preset type.
4. The method of claim 1, wherein determining a first target threshold for the type-recognition model for the preset type based on the plurality of predictive confidence levels comprises: Averaging the plurality of prediction confidence coefficients to obtain a threshold to be processed of the type recognition model aiming at the preset type; determining a threshold range of the type recognition model for the preset type according to the threshold to be processed and a preset step length; And taking any numerical value in the threshold range as a first target threshold of the type recognition model aiming at the preset type.
5. The method of claim 1, wherein performing data cleansing on the data in the tagged sample dataset for a plurality of first target thresholds of the plurality of preset types according to the type recognition model, resulting in a cleansed sample dataset, comprises: Traversing second sample data to be cleaned in the labeled sample data set, inputting the second sample data into the type recognition model to obtain a second recognition result corresponding to the second sample data, wherein the second recognition result comprises a second type to which the second sample data belongs and a second confidence degree aiming at the second type, and the second type is related to the content of the second sample data; acquiring a second target threshold of the type identification model for the second type; and when the second confidence coefficient is smaller than the second target threshold value, clearing the second sample data from the labeled sample data set, and when the second confidence coefficient is not smaller than the second target threshold value, reserving the second sample data to obtain the cleared sample data set.
6. A sample data determining apparatus, comprising: The first acquisition unit is used for acquiring an unlabeled data set, wherein the unlabeled data in the unlabeled data set is any one of the following data types, namely image data, video data, text data and audio data; The labeling unit is used for taking out part of the non-label data from the non-label data set and labeling the part of the non-label data to obtain a first data set with labels corresponding to the part of the non-label data; The training unit is used for training a type identification model by utilizing the first data set, wherein the type identification model is used for identifying the non-tag data to obtain an identification result, and the identification result comprises the type of the non-tag data and the confidence degree aiming at the corresponding type, and the type is related to the content of the non-tag data; a second acquiring unit, configured to acquire corresponding remaining non-tag data from the non-tag data set; A determining unit for determining a labeled sample dataset based on the first dataset, the type recognition model, and the remaining unlabeled data; The sample data determining device is used for determining a labeled sample data set based on the first data set, the type identification model and the residual non-labeled data, and is particularly used for determining a second data set with labels corresponding to the residual non-labeled data based on the type identification model and the residual non-labeled data; The sample data determining means is further for adding the first data set to a marked sample pool; the sample data determining device is used for determining a second data set with labels corresponding to the remaining non-label data based on the type identification model and the remaining non-label data, and is specifically used for taking the remaining non-label data as non-label data to be analyzed, inputting the non-label data to be analyzed into the type identification model to obtain a first identification result corresponding to the non-label data to be analyzed, wherein the first identification result comprises a first type to which the non-label data to be analyzed belongs, and a first confidence coefficient for the first type, wherein the first type is related to the content of the non-label data to be analyzed, and the analyzing step is used for taking out at least part of first non-label data with the corresponding first confidence coefficient larger than a first preset threshold value from the non-label data set to be analyzed, taking the first identification result corresponding to the first non-label data as the label of the first non-label data to be analyzed, obtaining a third data set with labels, manually labeling the first identification result comprises a first type to which the non-label data to be analyzed belongs, adding the first confidence coefficient to the first type to the first pre-label data to the first label data to be analyzed into a first label data set with a small threshold value, adding the first label data to the first label set to the first label data set to be labeled with a small threshold value from the first label data set to be labeled with a fourth label set to be labeled with a first threshold value, a step of circulation judgment, in which whether the unlabeled data set meets the preset condition is determined, if yes, a set of data except the data contained in the first data set in the labeled sample pool is used as a labeled second data set corresponding to the residual unlabeled data, if not, a step of circulation is executed, in which unlabeled data except the data contained in the labeled sample pool is used as new unlabeled data to be analyzed, and the steps of input, analysis, manual labeling, addition and training are executed in a returning mode until the labeled second data set corresponding to the residual unlabeled data is determined; the sample data determining device is further used for obtaining a plurality of preset types which can be identified by the type identification model, traversing each preset type in the plurality of preset types, obtaining a plurality of first sample data which belong to the preset type and have confidence degrees larger than a third preset threshold value for the preset type from the labeled sample data set, inputting the plurality of first sample data into the type identification model to obtain a plurality of prediction confidence degrees corresponding to the plurality of first sample data, determining a first target threshold value of the type identification model for the preset type according to the plurality of prediction confidence degrees, and carrying out data cleaning on the data in the labeled sample data set according to the type identification model for the plurality of first target threshold values of the plurality of preset types to obtain a cleaned sample data set.
7. The apparatus of claim 6, wherein the sample data determining means is further for: grouping the labeled sample data sets for a plurality of times according to a preset grouping rule to obtain a plurality of training sample data sets and a plurality of verification sample data sets corresponding to the plurality of training sample data sets; determining a plurality of alternative recognition models based on the plurality of training sample data sets and the type recognition model; a new type recognition model is determined from the plurality of candidate recognition models based on the plurality of verification sample data sets.
8. The apparatus of claim 6, wherein the device comprises a plurality of sensors, The sample data determining device is configured to, when determining, according to the plurality of prediction confidence degrees, a first target threshold of the type recognition model for the preset type, specifically: and averaging the plurality of prediction confidence degrees to obtain a first target threshold of the type recognition model aiming at the preset type.
9. The apparatus according to claim 6, wherein the sample data determining means is configured to, when determining the first target threshold value of the type recognition model for the preset type according to the plurality of prediction confidence levels, specifically: Averaging the plurality of prediction confidence coefficients to obtain a threshold to be processed of the type recognition model aiming at the preset type; determining a threshold range of the type recognition model for the preset type according to the threshold to be processed and a preset step length; And taking any numerical value in the threshold range as a first target threshold of the type recognition model aiming at the preset type.
10. The apparatus according to claim 6, wherein the sample data determining device is configured to perform data cleaning on the data in the labeled sample data set according to the type recognition model for a plurality of first target thresholds of the plurality of preset types, and when obtaining the cleaned sample data set, the sample data determining device is specifically configured to: Traversing second sample data to be cleaned in the labeled sample data set, inputting the second sample data into the type recognition model to obtain a second recognition result corresponding to the second sample data, wherein the second recognition result comprises a second type to which the second sample data belongs and a second confidence degree aiming at the second type, and the second type is related to the content of the second sample data; acquiring a second target threshold of the type identification model for the second type; and when the second confidence coefficient is smaller than the second target threshold value, clearing the second sample data from the labeled sample data set, and when the second confidence coefficient is not smaller than the second target threshold value, reserving the second sample data to obtain the cleared sample data set.
11. An electronic device, comprising: processor, and A memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of claims 1-5 via execution of the executable instructions.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-5.
13. A computer program product comprising computer instructions which, when executed by a processor, implement a method as claimed in any one of claims 1 to 5.

Description

Sample data determining method, data processing method, device, equipment and medium Technical Field The present application relates to the field of data processing, and in particular, to a method for determining sample data, a data processing method, a device, equipment, and a medium. Background In the artificial intelligence era, deep learning has achieved a very good effect in the field of computer vision such as image classification, object detection and the like. The foundation of the establishment of the high-quality deep learning-based model is that a large-scale data set with accurate marking is used for training the model, so that a large amount of labor cost is required to be input for marking the data in the training process of the model. Disclosure of Invention The embodiment of the application provides a different implementation scheme from the related art, so as to solve the technical problems that in the related art, a large amount of labor cost is required to be input for marking data in the process of training a model, the data marking period required by the method for marking the data by means of manpower is long, a large amount of labor cost is required to be input, and the efficiency of marking the data is low. In a first aspect, the present application provides a sample data determining method, including: acquiring a label-free data set; Taking out part of the non-tag data from the non-tag data set and marking the part of the non-tag data to obtain a first data set with tags corresponding to the part of the non-tag data; Training a type recognition model by using the first data set, wherein the type recognition model is used for recognizing the unlabeled data to obtain a recognition result, and the recognition result comprises the type of the unlabeled data and the confidence level aiming at the corresponding type, wherein the type is related to the content of the unlabeled data; acquiring corresponding residual non-tag data from the non-tag data set; a labeled sample dataset is determined based on the first dataset, the type-recognition model, and the remaining unlabeled data. In a second aspect, the present application provides a data processing method, including: Acquiring a labeled sample dataset; Traversing second sample data to be cleaned in the labeled sample data set, inputting the second sample data into the type recognition model to obtain a second recognition result corresponding to the second sample data, wherein the second recognition result comprises a second type to which the second sample data belongs and a second confidence degree aiming at the second type, and the second type is related to the content of the second sample data; acquiring a second target threshold of the type identification model for the second type; When the second confidence coefficient is smaller than the second target threshold value, the second sample data is cleared from the labeled sample data set, and when the second confidence coefficient is not smaller than the second target threshold value, the second sample data is reserved, and the cleared sample data set is obtained; The method comprises the steps of obtaining a labeled sample data set, wherein the labeled sample data set is obtained by removing part of unlabeled data from the unlabeled data set for labeling by related personnel, obtaining a labeled first data set corresponding to the part of unlabeled data, and processing the rest unlabeled data in the unlabeled data set based on the first data set. In a third aspect, the present application provides a sample data determining apparatus comprising: A first acquisition unit configured to acquire a label-free dataset; The labeling unit is used for taking out part of the non-label data from the non-label data set and labeling the part of the non-label data to obtain a first data set with labels corresponding to the part of the non-label data; The training unit is used for training a type identification model by utilizing the first data set, wherein the type identification model is used for identifying the non-tag data to obtain an identification result, and the identification result comprises the type of the non-tag data and the confidence degree aiming at the corresponding type, and the type is related to the content of the non-tag data; a second acquiring unit, configured to acquire corresponding remaining non-tag data from the non-tag data set; A determining unit for determining a labeled sample dataset based on the first dataset, the type recognition model and the remaining unlabeled data. In a fourth aspect, the present application provides an electronic device comprising: processor, and A memory for storing executable instructions of the processor; Wherein the processor is configured to perform each possible implementation manner of the first aspect, the second aspect, or any method of each possible implementation manner of the first aspect, the second aspect, via execution of the execu