CN-113934721-B - Data cleaning method and computer readable storage medium

CN113934721BCN 113934721 BCN113934721 BCN 113934721BCN-113934721-B

Abstract

The invention provides a data cleaning method which comprises the steps of calculating a first distance between feature vectors of every two data in an initial data set, constructing a neighbor set corresponding to each data one by one according to the first distance, obtaining an intermediate classification model according to an initial data set training original classification model, inputting each feature vector into the intermediate classification model to obtain a corresponding predictive label vector, constructing a sample set according to a second distance between the original label vector and the predictive label vector, updating corresponding data according to the sample set and the neighbor set, judging whether the number of times of updating the data reaches preset times, updating the data again when the number of times of updating the data does not reach the preset times, and forming the current data into a target data set when the number of times of updating the data reaches the preset times. The technical scheme of the invention can effectively clean the initial data set, so that the label of the data is more accurate.

Inventors

LIU GUOQING
YANG GUANG
WANG QICHENG
ZHENG WEI
KONG LINGYU
YANG GUOWU

Assignees

深圳佑驾创新科技有限公司

Dates

Publication Date: 20260505
Application Date: 20211215

Claims (6)

1. An image tag data cleaning method, characterized in that the image tag data cleaning method comprises: Calculating a first distance between feature vectors of every two data in an initial data set, wherein the initial data set comprises a plurality of images, and the data in the initial data set comprises the feature vector of each image and an original label vector of each image; Constructing a neighbor set corresponding to each data one by one according to the first distance; training an original classification model according to the initial data set to obtain an intermediate classification model; inputting each feature vector into the intermediate classification model to obtain a corresponding predictive label vector; constructing a sample set according to a second distance between the original tag vector and the predicted tag vector; Updating corresponding data according to the sample set and the neighbor set; judging whether the number of times of updating the data reaches a preset number of times; Training the intermediate classification model with the updated data as an initial data set to obtain a new intermediate classification model when the number of times of updating the data does not reach a preset number of times, and updating the data again with the new intermediate classification model When the number of times of updating the data reaches the preset number of times, forming the current data into a target data set; The constructing a sample set according to the second distance between the original label vector and the predicted label vector specifically includes: calculating a second distance between the original tag vector and the predicted tag vector, and Selecting a first quantity of data from the initial data set according to the order from the second distance to the large distance to form the sample set; updating corresponding data according to the sample set and the neighbor set specifically comprises: analyzing intersections of the sample set and the neighbor set; Judging whether the quantity of the data in the intersection is larger than or equal to a preset value or not, and When the number of the data in the intersection is larger than or equal to a preset value, updating the data corresponding to the neighbor set according to the data of the intersection; the updating the data corresponding to the neighbor set according to the data of the intersection set specifically comprises: calculating the average value of the original tag vectors of all data in the intersection as a first average vector; calculating an average of the first average vector and the original tag vector of the data corresponding to the neighbor set as a second average vector, and Updating the original tag vector of the data corresponding to the neighbor set to the second average vector, or Calculating a weighted average vector of the original tag vectors of all data in the intersection and the original tag vectors of data corresponding to the neighbor set, and Updating the original tag vector of the data corresponding to the neighbor set to the weighted average vector.
2. The image tag data cleaning method according to claim 1, wherein calculating the second distance between the original tag vector and the predicted tag vector comprises: and calculating the Hamming distance between the original label vector and the predicted label vector as the second distance.
3. The image tag data cleaning method of claim 1, wherein constructing a neighbor set corresponding to each of the data one-to-one according to the first distance specifically comprises: Sequentially selecting any one data in the initial data set as reference data; ordering the first distances between the reference data and the data other than the reference data in the initial data set in order from small to large, and And sequentially selecting a second number of data from the data corresponding to the minimum first distance to form a neighbor set corresponding to the reference data.
4. The image tag data cleaning method according to claim 1, wherein calculating the first distance between the feature vectors of each two data in the initial data set comprises: And calculating Euclidean distance between feature vectors of every two data as the first distance.
5. The image tag data cleaning method of claim 1, wherein training an original classification model based on the initial dataset to obtain an intermediate classification model specifically comprises: Inputting the feature vectors of the data in the initial dataset into the original classification model to obtain corresponding training tag vectors, and And updating parameters of the original classification model according to the original label vector and the training label vector to obtain the intermediate classification model.
6. A computer readable storage medium storing program instructions executable by a processor to implement the image tag data cleaning method of any one of claims 1 to 5.

Description

Data cleaning method and computer readable storage medium Technical Field The present invention relates to the field of machine learning technologies, and in particular, to a data cleaning method and a computer readable storage medium. Background The current machine learning technology is widely applied in the fields of data classification, data clustering and the like. Machine-learned model training typically requires a large amount of labeled data. However, due to limitations such as unclear data and insufficient expertise of the annotator, the annotator often has errors in annotating the data, so that noise tags exist in the data. If a machine-learned model is trained with a dataset having noise tags present, the performance of the model may be degraded by the interference of the noise tags. Disclosure of Invention In view of the foregoing, it is desirable to provide a data cleansing method and a computer readable storage medium, which can effectively cleanse an initial data set, so that the labels of the data are more accurate. In a first aspect, an embodiment of the present invention provides a data cleaning method, where the data cleaning method includes: calculating a first distance between feature vectors of every two data in an initial data set, wherein the data in the initial data set comprises the feature vectors and original tag vectors; Constructing a neighbor set corresponding to each data one by one according to the first distance; training an original classification model according to the initial data set to obtain an intermediate classification model; inputting each feature vector into the intermediate classification model to obtain a corresponding predictive label vector; constructing a sample set according to a second distance between the original tag vector and the predicted tag vector; Updating corresponding data according to the sample set and the neighbor set; judging whether the number of times of updating the data reaches a preset number of times; Training the intermediate classification model with the updated data as an initial data set to obtain a new intermediate classification model when the number of times of updating the data does not reach a preset number of times, and updating the data again with the new intermediate classification model When the number of times of updating the data reaches a preset number of times, the current data is formed into a target data set. In a second aspect, embodiments of the present invention provide a computer readable storage medium for storing program instructions executable by a processor to implement a data cleansing method as described above. According to the data cleaning method and the computer readable storage medium, the neighbor set is constructed according to the first distance between the feature vectors of every two data, so that the data with similar features are found. And constructing a sample set according to a second distance between the original tag vector and the predicted tag vector of the data, and screening out the data with higher reliability as the sample set. And updating corresponding data according to the sample set and the neighbor set, so that the reliability of the label vector of the data is higher and higher in the updating process. The method comprises the steps of constructing a neighbor set by using data which are similar to data features and have smaller loss, namely smaller noise, adopting the idea of tag propagation, and propagating the tags of partial data in the neighbor set to corresponding data by using the characteristic that the data tags with similar features are also similar in maximum probability, so that the tags of the data can be effectively enhanced, namely noise is reduced. And the selection loss is low, namely, the credible data is compared to construct a sample set to carry out label propagation, so that the negative influence caused by noise labels can be effectively avoided. And a small number of iterative cleaning processes are utilized, so that data diversification and information loss caused in the label propagation process are avoided. The data of the multi-label data set is cleaned, the interference of the noise label can be well solved, the influence of the noise label on the data set is reduced, the more accurate label is obtained, and a better classification result can be obtained subsequently. Drawings In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art. Fig. 1 is a flowchart of a data cleansing method according to a first embodiment of the pre