CN-119992563-B - Plant station wiring diagram text robust generalization detection and recognition method based on improvement SwinTextSpotter v2

CN119992563BCN 119992563 BCN119992563 BCN 119992563BCN-119992563-B

Abstract

The invention belongs to the field of smart power grids and computer vision, and particularly relates to a plant station wiring diagram text robust generalization detection and identification method based on improvement SwinTextSpotter v. The method comprises the steps of 1, sending an input image into a text detection recognition network based on multi-mode learning for training and prediction, obtaining a shared feature map through a shared feature extraction backbone network, further sending the shared feature map into a text detection module to obtain a text detection result and a text feature map, 2, sending the text feature map into a visual feature extraction and prediction module to obtain a feature sequence, then matching the predicted feature sequence with a canonical representation obtained by a character structure feature extraction and prediction module to obtain a recognition result, and the like. The method and the device have the advantages that the detection and recognition precision of the model on irregular texts and Chinese character texts is robustly improved, and the generalization performance of detecting and recognizing various kinds of wiring diagram texts is improved.

Inventors

ZHANG DONGDONG
ZHAO YUQIAN
CHENG DAWEI

Assignees

同济大学

Dates

Publication Date: 20260508
Application Date: 20250116

Claims (6)

1. The plant station wiring diagram text robust generalization detection and identification method based on the improvement SwinTextSpotter v is characterized by comprising the following steps of: Step 1, an input image is sent to a text detection recognition network based on multi-mode learning for training and prediction, a backbone network is extracted through shared features to obtain a shared feature map, and the shared feature map is further sent to a text detection module to obtain a text detection result and a text feature map; step 2, sending the text feature map to a visual feature extraction and prediction module to obtain a feature sequence, and then matching the predicted feature sequence with the canonical representation obtained by the character structure feature extraction and prediction module to obtain a recognition result; Step 3, sending the text recognition result to a fine adjustment post-processing module, and fine adjusting part of the recognition result based on the power grid priori knowledge to obtain a final text recognition result; step 4, calculating a text detection result, a text recognition result and corresponding true values to obtain detection and recognition loss; Step 5, carrying out joint optimization on the whole network model according to the loss; Step 6, dividing the general scene text data set and the power grid station wiring drawing data set into a training set and a testing set respectively, screening the general scene training set by using a training set screening strategy based on data mining to be used as model pre-training, wherein the training set of model fine tuning training is the station wiring drawing training set, and the pre-training and the fine tuning training are circularly executed from step 1-5 to network convergence, and a model file is saved; step 7, using a new station wiring drawing dataset to construct a feature extraction double-flow network generated based on a text region mask to perform feature extraction, and performing incremental learning on the current model by adopting a multi-class wiring diagram text detection and identification incremental learning strategy based on knowledge distillation; Step 8, inputting the test drawing into the model obtained through the design and training in the steps 1-7 to obtain detection and identification results; In step 2: The visual feature extraction and prediction module is a visual feature extraction and prediction module based on multistage attention, extracts receptive field features with different scales through local attention and global attention branches, can further capture the remote pixel relationship by introducing cavity convolution, improves the fitting capability of neighborhood and global features, and achieves the purpose of correcting text detection segmentation results by identifying loss optimization detection branches by means of the characteristic of joint optimization; The character structure feature extraction and prediction module is a character structure feature extraction and prediction module based on CCR-CLIP, an image encoder and a character encoder are constructed, pre-training is carried out on the module based on contrast loss, the character structure feature is introduced into a visual model, and the recognition capability of the model to Chinese characters is improved from the angle of multi-modal learning; in order to improve the fitting effect of multi-modal prediction information, a multi-modal prediction fusion module based on contrast learning is designed, the prediction of a visual branch and the character specification representation of a character structure feature branch are multiplied after passing through two linear layers, and the text recognition precision of a complex model in a complex scene is improved through a series of convolution structures and linear layers; the visual characteristic extraction and prediction module based on the multi-level attention comprises three sub-modules of a multi-level attention branch, a recognition conversion module and a text recognition coder-decoder based on equal-scale up-sampling, wherein, The multi-level attention branch comprises local attention and global attention, the local attention branch adopts a residual hole convolution structure and a window self-attention mechanism to pay attention to local detail characteristics of an image, the hole convolution is used for constructing a multiscale receptive field in each convolution layer for a model, the model is favorable for extracting texture characteristics with different granularities, the residual hole structure learns and stores more fine space information in the back propagation process, based on the fact, the residual hole convolution structure enables the characteristics of an irregular text area to be extracted more accurately, irregular characters in the text can be given more reasonable and accurate weights in the follow-up attention mechanism, and the method comprises the following steps: RoI feature map The input residual error is fed into a cavity convolution structure, the formula is as follows: Wherein, the Consists of a cavity convolution layer with a convolution kernel size of 5×5 and a dilation rate of 2 and The layer composition, next, the fitted features are fed into the window self-attention capturing the local dependencies as follows: Wherein, the Representing the query, key, value matrix in the self-attention mechanism, Is the dimension of the key/value matrix, Is the number of sampling points in a window; is the relative position deviation by The introduction of the self-attention weight matrix is used for realizing the relative position coding of the self-attention weight matrix so as to construct the local dependency relationship in the image window, and finally, the obtained window self-attention fitting result is passed through a feedforward neural network The method comprises the steps of obtaining nonlinear transformation of features, enabling a network to capture higher-order image features, enabling global attention branches to adopt a cavity convolution and multi-head self-attention mechanism to pay attention to global outline features of images, adopting the cavity convolution to improve convolution receptive fields to obtain feature images fitting neighborhood space information, taking the feature images as query matrixes, taking the feature images subjected to global average pooling as key and value matrixes, and achieving capture of global dependence through multi-head self-attention, wherein the feature images are represented by the following formula: Wherein, the For the global average pooling layer, In order to be a multi-headed self-attention mechanism, Taking 8, through the structure, the input characteristic image is subjected to cavity convolution processing Each position of the image is interacted with global features to form self-attention output, and the self-attention output is compared with global information of the whole image at each position to extract long-distance dependence and global context; through multi-head arrangement, the structure captures rich context information from different subspaces, and improves the feature expression capability and the global feature fitting effect of the network in a complex scene; The recognition conversion module is SwinTextSpotter v network original structure and is used for generating a text region tight mask so as to realize two-stage combined optimization of text detection and recognition; the text recognition coder and decoder based on equal proportion up-sampling generally uses SwinTextSpotter v2 architecture, and based on the original SwinTextSpotter v text recognition coder and decoder architecture, a bilinear interpolation up-sampling layer is introduced into the sequence coding to map the text characteristic diagram Shengweicheng (Chinese character of 'Shengweicheng') The details on the image height are richer, the pixel points are smoother, and more high-frequency information is reserved during downsampling.
2. The method for detecting and identifying robust text generalization of station wiring diagram based on improvement SwinTextSpotter v2 as claimed in claim 1, wherein said CCR-CLIP based character structure feature extraction and prediction module is composed of an image encoder and a character encoder, and is characterized in that the character structure feature extraction and prediction module is used for obtaining character specification representation by extracting structural stroke features of Chinese characters to construct specification representation thereof, so as to obtain recognition results of Chinese characters by comparing learning ideas, and the character specification representation is obtained by single pre-training The training set of the module is the print body picture of all characters and the radical stroke sequence of the characters, the image encoder is responsible for extracting the visual characteristics of the input character image, and the text encoder extracts the characteristics of the corresponding radical sequence; specifically, the image encoder uses Resnet-50 as a backbone network to obtain image features Then global average pooling is carried out to obtain Will be Embedded in the visual feature space, as follows: Wherein, the In order to project the matrix of the light, For alignment dimensions; the text encoder consists of a 2-layer transducer encoder and an embedded layer, and the radical sequence is transmitted by the encoder Encoded as sequence features The output characteristics corresponding to each time step, For the length of the radical sequence, Is considered as Finally, it is embedded into the text feature space to obtain The following formula: Wherein, the Is a projection matrix, and the contrast loss between the image and the character structure is designed for matching the image and the character structure features The loss between the corresponding image and the character structure is made as small as possible, and the loss between the non-corresponding image and the character structure is made as large as possible, and the following formula is adopted: to reduce prediction errors caused by different font styles and similar characters, contrast loss between visual features of input images with the same radical stroke label is introduced The feature loss between the shape-close characters is made as large as possible, and the following formula is adopted: Wherein, the For visual features having the same sequence of radical strokes And obtaining a final loss function of the CCR-CLIP model according to the following formula: From this point on, the CCR-CLIP model can be trained using the printed Chinese character image, and a text encoder is used to generate normalized representations of all candidate Chinese characters.
3. The method for robust generalization detection and recognition of plant wiring diagram text based on improvement SwinTextSpotter v as claimed in claim 1, wherein said multi-modal prediction fusion module based on contrast learning predicts visual branches Character specification representation of character structural feature branches By two linear layers Post-multiplication, further prediction fusion by a set of depth separable convolutions, and finally Obtaining the final text recognition result The following formula: Wherein, the And Dimension of (2) And Is uniform in dimension.
4. The method for detecting and identifying the text robustness generalization of the station wiring diagram based on the improvement SwinTextSpotter v as set forth in claim 1, wherein in step 3: The priori knowledge comprises an electrical element naming rule and an identification writing rule; the fine tuning post-processing strategy based on priori knowledge comprises adjacent text segmentation post-processing based on regular expression and text segmentation inconsistency and omission post-processing based on merging and reconstruction; Specifically, the regular expression-based adjacent text segmentation post-processing is used for solving the problem that text is detected as a whole due to too close position, and the text segmentation inconsistency and omission post-processing based on the merging and reconstruction are used for solving the problem that the whole text identifier is identified as two parts and I character omission due to longer distance; The specific method for the post-segmentation processing of the adjacent text based on the regular expression is as follows: S3.1.1 length threshold screening: After text detection and recognition is completed, extracting all aspect ratios greater than a set threshold based on a text aspect ratio screening strategy The process can effectively screen out long and thin text boxes which possibly cause segmentation problems; S3.1.2 regular expression matching: Dividing a long text by a regular expression to ensure that the divided text box accords with a set format; s3.1.3 text box cut: cutting the text boxes by means of the relative positions of text contents according to the regular expression matching and segmentation results to obtain a plurality of processed new text boxes; text segmentation inconsistency based on merging and reconstruction and detailed practice of omission post-processing are as follows: S3.2.1 keyword screening: Screening text boxes containing keywords P, Q, temperature and gear, and performing subsequent operations based on the text boxes; s3.2.2 extension and combination: for the selected text boxes, if the keywords are available and only the keywords are available, the text boxes are extended rightward by a certain pixel range, and if the text boxes with the text contents being numbers are available in the range, the text boxes are combined with the original text boxes; s3.2.3 traversal reconstruction: In order to solve the problem of 'I' character missing identification, a certain pixel range is traversed downwards for all text boxes containing the keywords, if a text box with the text content being a number exists in a certain area below, the digital text box is extended leftwards for a certain distance to reconstruct a new text box, and the distance is determined by the position of the text box containing the keywords.
5. The method for detecting and identifying the text robustness generalization of the station wiring diagram based on the improvement SwinTextSpotter v as set forth in claim 1, wherein in step 6: When the network model constructed in the step 1-5 is trained, a strategy of pre-training and then fine-tuning training is adopted, namely, the model is pre-trained under a general scene, so that the model has basic text detection and recognition capability, and then fine-tuning training is carried out on the model by using a power grid station wiring diagram data set; The pre-training stage is based on a training set screening strategy of data mining, namely, firstly, coarse training is carried out on the network model constructed in the steps 1-5 by using an original general scene data set to obtain a low-precision model, the high-confidence data filtering is carried out on the training set data by using the low-precision model, then, the low-confidence data filtering is carried out by taking the model based on the improved PP-OCRv as a high-precision model, and after two times of data filtering, a high-quality training set is obtained, and finally, fine training and iteration are carried out on the low-precision model by using the high-quality training set, wherein the method comprises the following steps: (1) High confidence data filtering based on low precision model Firstly, performing rough training on part of training data by using the proposed model to obtain a low-precision model, and rapidly predicting large-scale data, performing text detection and text recognition on a tens of millions of levels of universal scene text data sets by using the low-precision model, screening text prediction frames with confidence coefficient greater than 0.95, wherein the part is considered as redundant text data; (2) Low confidence data filtering based on high precision model After filtering redundant text data with high confidence coefficient through a high confidence coefficient data filtering step, in order to further filter training data with low confidence coefficient, an algorithm model based on an improved PP-OCRv is used as a high-precision model to predict the rest training data, and a text prediction box with the confidence coefficient less than 0.15 is classified into negative samples with poor quality or difficult recognition and possibly causing interference to model training; through the steps, redundant data with high confidence and negative data with low confidence can be respectively screened from the general scene training set, and the rest training data are high-quality training data which have positive influence on the pre-training of the next stage of the model; (3) Fine training and iteration And (3) using the screened high-quality training data to finely train the network model constructed in the step (1-5) to obtain a network model with higher precision, and using the network model for subsequent fine tuning training based on the factory station wiring diagram drawing.
6. The method for detecting and identifying the text robustness generalization of the station wiring diagram based on the improvement SwinTextSpotter v as set forth in claim 1, wherein in step 7: The method comprises the steps that a model obtained through initial training is used as an original model by using a knowledge distillation-based multi-class wiring diagram text detection and identification incremental learning strategy, all parameters in a text content identification stage in the step 2 are frozen into a new model on the basis of the original model, and only incremental learning characteristics are extracted and model parameters in a text detection stage in the step 1 are related; the specific training strategy is as follows: Assume that Before the moment, the original model is based on the historical wiring diagram training set Finishing early training; At the moment, a new set of wiring diagrams to be trained appears, at this time from Partial random retention partial image construction partial historical wiring diagram dataset Along with a new wiring diagram training set In addition, all parameters of the text content recognition stage of the original model are frozen into the new model; In knowledge distillation of new models, batch size is assumed Then at And Separate sampling Training the samples, sending the input to a feature extraction double-flow network generated based on the text region mask to extract image features and generate rough text region mask respectively, sending the masked feature map to a follow-up module to predict text detection and recognition results, and calculating prediction loss by the detection and recognition results and the truth labels The predicted loss includes the original loss of the original model Weighted cross entropy penalty for text region mask generation branches The following formula: Wherein, the For the mask branch scaling factor, Text candidate box for text detection stage Is not limited to the L1 loss of (C), For cross entropy loss of text segmentation results in the text detection stage, Loss for text recognition phase; to realize the supervision of the original model to the training of the new model, the method is to Training samples of (1) are sent to the original model prediction and output And soft labels of the text prediction sequence, and calculating distillation loss by taking the soft labels as the prediction results of the priori knowledge and the new model The following formula: Wherein, the Representing the original model and the new model respectively Is used to determine the prediction vector of (1), Vectors representing the original model and the new model text prediction sequences, respectively; Is shown in the first At each point, the original model and the new model The predicted value of the current value, Then is shown at the first Predicted values of text predicted sequences of the original model and the new model at the point positions; And Respectively representing what is predicted under the current round Number of text prediction sequences; Finally, the predicted loss and the distillation loss are added to obtain a new model knowledge distillation process loss function, and the new model knowledge distillation process loss function is represented by the following formula: Wherein, the Is the distillation loss proportionality coefficient; in the incremental learning process, the constructed feature extraction dual-flow network based on the text region mask generation comprises a feature extraction backbone network based on a Swin transform and an FPN and a lightweight text region mask generation branch based on an improvement MobileNetv, in the sharing feature extraction process, the feature extraction branch of the whole image is kept unchanged, and the feature extraction backbone network based on the Swin transform and the FPN is still adopted to obtain four feature graphs with different scales In order to improve the precision and generalization capability of the text detection stage under different kinds of wiring diagrams, a lightweight text region mask generation branch based on an improved MobileNetv is introduced on the basis of an original backbone network, an input image is fitted into a rough text region mask in an input whole drawing through a network layer, and then the rough text region mask is matched with the original backbone network To ensure the full coverage of the text region mask to the text, designing a weighted cross entropy loss function, and focusing on recall rate while monitoring the overall image segmentation accuracy; specifically, in the feature extraction stage, input To the original drawing image, it is fed to a convolution kernel Extracting features from the convolution layer with the step length of 2, and sequentially feeding 11 depth separable convolution blocks to extract depth feature information to obtain The depth separable convolution block is an inverted residual structure of MobileNetv3, and the feature map is based on deconvolution and up-sampling strategies in a feature fusion stage Fusing to obtain preliminary text region mask Further reducing dimension by a linear layer to obtain a text region mask with a dimension of 1/4 length and width of the original image And finally, in the text region mask generation stage Gradually downsampling to obtain three further downscaled text region masks Masking text regions with Performing dot multiplication operation to obtain a feature map based on the text region mask ; The characteristics of the network structure are extracted and fused, and the obtained text region masks with four different scales Rough areas representing the presence of text in the drawing image, then the text detection stage may fit text candidate boxes based on the features of these areas, requiring that the text area mask completely cover all possible text areas; to achieve this objective, a weighted cross entropy loss function is designed in the text region mask generation branch as follows: Wherein, the Representing text region masks At the pixel point The predicted value at which the position is to be determined, Representing text region masks At the pixel point A truth value label, namely whether the pixel point is contained in a truth value text box; and the positive example weight super-parameter is larger than 1, which means that more attention is paid to the pixel points of the positive example in the current supervision, namely, the recall rate of the text area is more emphasized while the overall prediction of the text area is ensured to be accurate, so that the predicted text rough area realizes true value full coverage.

Description

Plant station wiring diagram text robust generalization detection and recognition method based on improvement SwinTextSpotter v2 Technical field: The invention belongs to the field of smart power grids and computer vision, and particularly relates to a plant station wiring diagram text robust generalization detection and identification method based on improvement SwinTextSpotter v. The background technology is as follows: Along with the expansion of the scale of the power grid drawing, the traditional manual identification of the drawing content is low in efficiency and can be wrong. Automation of grid drawing management and lookup is urgent. Because of the complex information in the power grid wiring diagram, it is particularly important to detect the position of the character notes in the power drawing and identify the text content in order to realize the automatic process. Text labels in the grid station wiring diagrams tend to be various, with different shapes, directions and sizes. The characters not only relate to Arabic numerals and letters, but also contain a plurality of Chinese characters. In addition, the character scale size within the text in the factory floor wiring diagram also tends to be different and contains a large number of near-confusing characters that are not based on natural semantics. In recent years, deep learning-based methods have achieved brilliant in the field of optical character recognition. OCR technology based on deep learning becomes a potential approach for automation and intellectualization of a power grid wiring diagram. The deep learning is applied to character recognition of engineering design drawings, and the intelligent automatic recognition task of the power grid wiring diagram can be effectively realized. Closest to the prior art and its evaluation: Li et al (SHANBIN L,HAOYU W,JUNHAO Z.Electrical cabinet wiring detection method based on improved YOLOv5 and PP-OCRv3;proceedings of the2022Chinese Automation Congress,CAC 2022,November 25,2022-November 27,2022,Xiamen,China,F,2022[C].Institute of Electrical and Electronics Engineers Inc.) uses YOLOv and PP-OCRv3 to improve the accuracy of text detection and recognition in electrical cabinet wiring patterns, wei Wei et al (Wei Wei, long Na, tian Yue, et al) developed a power plant nameplate text detection method based on improvement DBNet, J. High voltage technology, 2023, 49:63-67.) used pixel level interpolation and pooling to more accurately detect text contours in a power plant nameplate based on DBNet, liu Wei et al (Liu Wei. Secondary loop terminal strip text recognition and wiring verification system, D. Enshi national university, 2023.) introduced transfer learning training to use OCR models for the application scenario of terminal strip design patterns. Research results show that although the selected deep learning method is capable of detecting and identifying text in simple scenes, challenges remain for more complex text detection and identification scenes, such as text located near primitives and text with different scales and mixed horizontal and vertical characters involved in the substation wiring diagram dataset. SwinTextSpotter v2 is an end-to-end text detection and recognition model aimed at handling text recognition tasks in complex scenes. Unlike traditional Convolutional Neural Network (CNN), swinTextSpotter utilizes the powerful feature extraction and modeling capability of Swin transducer [58], can effectively capture long-distance context information and global features, thereby improving the detection of text region and the recognition accuracy of text content, realizes iterative optimization by using a query-based detector, gradually improves the text detection accuracy, and eliminates the problem of error accumulation in the two-stage text detection recognition process by using a recognition conversion module to connect the detector and the recognizer. However SwinTextSpotter v2 performs poorly for small characters in chinese text and irregular text that do not have natural semantics, and does not have generalization capability for many kinds of specification wiring diagrams. Disclosure of Invention The invention aims to provide a plant station wiring diagram text robust generalization detection and identification method based on an improvement SwinTextSpotter v. Aiming at the problems of robustness and generalization of the conventional power grid station wiring diagram text detection and recognition method, the text position detection and content recognition tasks are trained together through the integrated base line model from the introduction end to the end, so that error accumulation is greatly reduced, and the robustness and generalization performance of the model are effectively improved through a series of strategies based on deep learning. In the preprocessing stage, a training set screening strategy based on data mining is designed, the model accuracy is ensured, meanwhile, the end