CN-120544215-B - Label noise detection method based on multi-time-step loss sequence
Abstract
The invention discloses a label noise detection method based on a multi-time step loss sequence, which utilizes a character recognition module to recognize characters from an input image, utilizes an MSL-MentorNet label noise detection module to analyze the loss sequence of the input image in a continuous training period according to a character recognition result, dynamically distributes sample weights to detect and filter noise labels, and feeds the screened data back to the character recognition module for optimizing model training. According to the invention, a MSL-MentorNet label noise detection and a noise filtering mechanism based on dynamic course learning are adopted, a loss sequence of samples in a continuous training period is analyzed, a BiLSTM network is utilized to distinguish noise labels from clean difficult samples, and a linear scheduling strategy is combined to gradually increase the training data amount. The method can improve the recognition precision of the complex text image, and is particularly suitable for robust recognition and noise filtration of the complex text image in the scenes of ancient book digitization, historical document restoration, industrial OCR quality inspection and the like.
Inventors
- LU MIN
- LUO ZICHENG
- SHI BAO
- LIU NA
Assignees
- 内蒙古工业大学
Dates
- Publication Date
- 20260508
- Application Date
- 20250521
Claims (8)
- 1. The label noise detection method based on the multi-time step loss sequence is characterized by comprising the following steps of: Step 1, recognizing characters from an input image by utilizing a character recognition module; Step 2, analyzing a loss sequence of an input image in a continuous training period by using an MSL-MentorNet tag noise detection module according to a character recognition result, dynamically distributing sample weights to detect and filter noise tags, and feeding the screened data back to a character recognition module for optimizing model training; Before the MSL-MentorNet detection is executed in the step 2, coarse alignment processing is carried out on the initial clean subset by utilizing feature clustering or distribution similarity measurement, the sample is divided into a well-aligned subset and a poorly-aligned subset, the well-aligned subset is reserved for subsequent noise injection, and the Ji Jiaocha subset is temporarily stored or directly removed, so that potential noise interference is avoided; The MSL-MentorNet tag noise detection module comprises: The loss sequence generating unit records the loss value of each input image in the continuous training period and constructs a loss sequence; BiLSTM judging the network, inputting a loss sequence, a label type and training progress information, outputting a sample weight, and reflecting the probability that the sample is a noise sample to realize detection and distinction of the noise label sample; the linear scheduling unit dynamically adjusts the proportion of input images participating in training according to the preset noise rate and the sample weight which is output by BiLSTM discrimination network, and optimizes training data; the linear scheduling unit adjusts the data amount participating in training according to the following rules: Step 1, only selecting a high-weight sample to participate in training in the initial stage, wherein the high-weight sample is a sample which is judged to be more reliable and more likely to be correctly marked after being processed by BiLSTM discrimination network in an MSL-MentorNet label noise detection module; and step 2, linearly increasing the sample retention rate along with the training period until reaching the preset 1-epsilon, wherein epsilon is the noise rate.
- 2. The method for detecting the label noise based on the multi-time step loss sequence according to claim 1, wherein the character recognition module performs geometric correction, visual feature extraction and sequence decoding on an input image to obtain the character recognition result, wherein the geometric correction is to correct bending and tilting texts through a thin plate spline conversion layer, and the thin plate spline conversion layer realizes image transformation by using a differentiable grid.
- 3. The method for detecting tag noise based on a multi-time step loss sequence according to claim 1 or 2, wherein the text recognition module comprises: a thin-plate spline conversion layer for predicting reference points of an input image by differential matrix operation, calculating TPS conversion parameters based on the reference points and the fixed base-reference points, generating a differential grid, and generating a corrected regular text image by interpolation; ResNet a feature extraction layer, which extracts multi-scale visual features from the corrected regular text image; a bidirectional LSTM layer for performing context modeling on the feature sequence; and the attention decoding layer is used for decoding the characteristic sequence into a character sequence, namely the character recognition result, by combining with the implicit language model.
- 4. A method of detecting tag noise based on a multi-time step loss sequence according to claim 3, wherein the implementation of the thin-plate spline conversion layer comprises: step 1, predicting K datum points of an input image by using a positioning network; Step 2, calculating TPS transformation parameters by using a grid generator based on the datum points and the fixed base-datum points through differential matrix operation, and generating a differential grid; And step 3, based on the differentiable grid, performing pixel point mapping and weighted average calculation through bilinear interpolation by using a sampler to generate a corrected regular text image, and performing gradient calculation in the process of back propagation by bilinear interpolation to realize optimization of the thin plate spline conversion layer in neural network training.
- 5. A method of tag noise detection based on a multi-time step loss sequence according to claim 3, wherein the attention decoding layer employs a beam search strategy by generating a character sequence by: step 1, calculating a semantic vector as a weighted sum of hidden states of an encoder; Step 2, combining the predicted character embedding of the previous time step, and updating the decoding state through LSTM; and step 3, outputting the character probability distribution of the current time step.
- 6. The method for detecting tag noise based on a multi-time step loss sequence according to claim 1, wherein the loss sequence generating unit constructs the loss sequence by setting time steps in the training process to be T, t=1, 2..the total number of time steps, T to be the total number of time steps, and the continuous training period to be N, n=1, 2..the total number of training periods, N to be the total number of training periods, for the i-th input image : Each image is input into a text recognition module and an MSL-MentorNet label noise detection module for training in each time step in each training period, a loss value is calculated according to a model prediction result and a real label during each training, a cross entropy loss function is used, Wherein For the number of categories to be considered, Is the first in the true label The value of the class is taken out, Predicting the first for the model Probability of class; Sequentially recording loss values according to time steps in continuous training period to construct loss sequence The loss change condition corresponding to each image under the multi-time step and continuous training period is embodied, wherein i=1, 2.
- 7. The method for detecting tag noise based on a multi-time step loss sequence according to claim 1, wherein the BiLSTM discrimination network is trained by: Step 1, synthesizing IDN and RCN noise data on a clean subset, wherein the clean subset refers to a part of data selected from an original data set, and labels are accurate and do not contain noise; step 2, taking a loss sequence, a label type and a training period percentage as input, and taking a binary weight as output, so as to minimize the mean square error; And 3, updating network parameters through an Adam optimizer.
- 8. The method for tag noise detection based on a multi-time step loss sequence of claim 7, wherein the method for synthesizing IDN noise data on a clean subset is as follows: 1) Setting an IDN noise rate ρIDN, which represents the proportion of samples to be added with characteristic-dependent noise in the clean subset; 2) Randomly selecting a number nIDN =ρidn×p of samples from the clean subset, wherein P is the total number of samples in the clean subset; 3) For each sample selected, analyzing its characteristics; 4) Modifying the label according to the association relation between the sample characteristics and other category characteristics so as to generate a noise label related to the sample characteristics and complete the synthesis of IDN noise data; The method of synthesizing RCN noise data on the clean subset is as follows: 1) Setting an RCN noise rate ρRCN, representing the proportion of samples to be added with random class noise in the clean subset; 2) Randomly selecting a number nRCN =ρrcn×p of samples from the clean subset; 3) And randomly replacing the true label of each selected sample with labels of other categories in the data set, and completing the synthesis of RCN noise data.
Description
Label noise detection method based on multi-time-step loss sequence Technical Field The invention belongs to the technical field of intersection of computer vision and natural language processing, and particularly relates to a label noise detection method based on a multi-time step loss sequence. Background In the crossing field of computer vision and natural language processing, the character recognition technology is widely applied. However, label noise problems often exist in labeling data. The traditional character recognition model relies on a large amount of labeling data for training, and noise labels are easily introduced in the manual labeling process due to subjective errors, standard non-uniformity and the like. These noise labels interfere with model training, making the model learn error patterns, reducing recognition accuracy and generalization ability. The existing method for solving the tag noise has a plurality of limitations. Manual re-labeling, while removing noise, is time consuming, labor consuming and costly in the face of large-scale data. Methods based on statistical analysis, such as filtering noise labels by calculating sample loss statistics, have difficulty in coping with complex noise patterns and poor accuracy. Although a method based on deep learning is tried to design a specific network structure to learn noise characteristics, a great deal of priori knowledge is often relied on, the network design is complex, and generalization is insufficient under different data sets and noise types. Especially for Mongolian, the Mongolian is used as alphabetic writing, different writing methods are adopted for letters in words, no obvious space exists between words, the difficulty of links such as character segmentation, feature extraction and the like of Mongolian text images is increased, and the existing label noise detection method is difficult to adapt to the characteristics of the Mongolian text images. Disclosure of Invention In order to overcome the defects of the prior art, the invention aims to provide a label noise detection method based on a multi-time step loss sequence so as to improve the accuracy and reliability of Mongolian character recognition. In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a label noise detection method based on a multi-time step loss sequence comprises the following steps: Step 1, recognizing characters from an input image by utilizing a character recognition module; And 2, analyzing a loss sequence of the input image in a continuous training period by utilizing the MSL-MentorNet tag noise detection module according to a character recognition result, dynamically distributing sample weights to detect and filter noise tags, and feeding the screened data back to the character recognition module for optimizing model training. In one embodiment, the text recognition module performs geometric correction, visual feature extraction and sequence decoding on an input image to obtain the text recognition result, wherein the geometric correction is to correct bending and tilting texts through a thin plate spline conversion layer, and the thin plate spline conversion layer realizes accurate image transformation by using a differential grid. In one embodiment, the text recognition module includes: A thin-plate spline conversion layer for predicting reference points of an input image through differential matrix operation, calculating TPS conversion parameters based on the reference points and fixed base-reference points, generating differential grids, and generating a corrected regular text image through interpolation, wherein the fixed base-reference points are a set of fixed coordinate points preset in the image correction process and are used for providing a standardized reference frame for a bent or inclined text image so as to correct a text region in an original image into a regular rectangular form through thin-plate spline (TPS) conversion; ResNet a feature extraction layer, which extracts multi-scale visual features from the corrected regular text image; a bidirectional LSTM layer for performing context modeling on the feature sequence; and the attention decoding layer is used for decoding the characteristic sequence into a character sequence, namely the character recognition result, by combining with the implicit language model. In one embodiment, the implementation of the thin-plate spline conversion layer includes: step 1, predicting K datum points of an input image by using a positioning network; Step 2, calculating TPS transformation parameters by using a grid generator based on the datum points and the fixed base-datum points through differential matrix operation, and generating a differential grid; And step 3, based on the differentiable grid, performing pixel point mapping and weighted average calculation through bilinear interpolation by using a sampler to generate a corrected regular text image,