CN-121789197-B - Natural scene text detection method based on semantic feature gradual recombination
Abstract
The invention relates to a natural scene text detection method based on semantic feature gradual recombination. The method comprises the specific operations that a visual concept matching module constructs a visual concept cluster through a CLIP encoder and a clustering algorithm, a hybrid concept and a pure concept are divided according to text co-occurrence rate, a recombination weight controller combines concept purity difference and text significance estimated value to determine recombination weights, a semantic feature recombination module combines self-adaptive recombination weights to recombine feature graphs at two levels, namely a local level and a global level, a double-network collaborative optimization module predicts coordinate offset and system errors through an independent training boundary point displacement prediction network and a system error prediction network respectively, and text box coordinates are optimized in an iterative mode. According to the method, generalization and robustness of the model are improved through the semantic feature recombination module, correction of the initial text detection frame coordinates is achieved through the combination of the double-network collaborative optimization module, and comprehensive performance of natural scene text detection is remarkably improved.
Inventors
- CHEN YUTONG
- WANG RUNMIN
- YI KE
- ZHANG HUI
- YE SHAN
- TANG HANYU
Assignees
- 湖南师范大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260306
Claims (5)
- 1. A natural scene text detection method based on semantic feature progressive recombination is characterized by comprising the following steps: A visual concept matching step, namely constructing a visual concept cluster by using a CLIP encoder and a clustering algorithm, further dividing the mixed concept and the pure concept according to the text co-occurrence rate, and finally obtaining mixed-pure concept matching pairs based on the similarity among concepts to form a pure visual concept library; a recombination weight determining step of obtaining a global recombination weight according to cosine distance by utilizing a recombination weight controller and combining the concept purity difference and the text significance estimated value; The semantic feature reorganization step comprises the steps of purifying visual clutter factors from an area and a global level respectively by utilizing a semantic feature reorganization module to obtain a reorganized pure feature map, wherein the method comprises the steps of extracting multi-scale features of a natural scene image by utilizing FPN, generating a multi-level feature map, matching spliced low-level feature maps in a pure visual concept library, calculating area reorganization weights of area feature blocks by a reorganization weight controller according to differences of matched local pure visual features and local clutter visual features, superposing the differences on original feature blocks according to the area reorganization weights to obtain an area reorganized feature map, splicing the area reorganization feature map with a high-level feature map, realizing feature fusion and feature dimension reduction by utilizing a 1X 1 convolution layer to obtain an input feature map, globally matching the input feature map in the pure visual concept library, obtaining global reorganization weights by utilizing cosine distances according to the matched pure visual features, and superposing the same adjustment vectors on all spatial positions of the feature map by utilizing a global reorganization unit to obtain the reorganized feature map; And the double-network collaborative optimization step is to predict correction quantity for correcting the initial text candidate box by using a two-way approximator and update the text detection box iteratively.
- 2. The method according to claim 1, wherein the visual concept matching step specifically comprises: Extracting an advanced semantic concept vector by using a CLIP encoder to construct an overall feature matrix; Clustering the feature vectors by adopting a K-means++ algorithm to obtain a plurality of visual concept clusters; And counting the co-occurrence rate of the visual concepts and the text, setting a threshold value, further dividing the mixed concepts and the pure concepts, and obtaining the matched pair of the mixed-pure concepts according to the similarity.
- 3. The method according to claim 1, wherein the reorganizing weight determining step specifically includes determining a regional reorganizing weight adaptively according to a concept purity difference and text salience, determining a global reorganizing weight according to a cosine distance, and the first The region rebinning weights of the individual region feature blocks are expressed as: , , Wherein, the Representing the Sigmoid activation function, 、 All represent a positive weight parameter that is trainable, Represent the first The conceptual purity differences of the individual regional feature blocks, The conceptual purity function is represented as a function of the conceptual purity, Represent the first The visual concept of the individual region feature block after cleaning, Represent the first The original visual concept of the individual region feature blocks, Represent the first Preliminary estimates of text saliency for individual region feature blocks, Representing a trainable bias parameter.
- 4. The method according to claim 1, wherein the dual network co-optimization step specifically comprises: The boundary point displacement prediction network is used for predicting the coordinate offset required by the given input text box state to the target text box state; A system error prediction network for predicting inherent and consistent system errors of the upstream model caused by model deviation and unbalance of training data; And combining the coordinate offset and the system error to carry out joint correction on the initial text candidate box, and finally obtaining the high-precision text box.
- 5. The method of claim 4, wherein the intermediate text box state samples between the initial text box and the target text box are obtained by linear interpolation of point-wise coordinates at the feature map scale: , Wherein, the A continuous time variable is represented as a function of time, Representing the initial text box state in the feature map coordinate system, Representing manually annotated text box states at the feature map scale.
Description
Natural scene text detection method based on semantic feature gradual recombination Technical Field The invention relates to the field of research of natural scene text detection methods and discloses a natural scene text detection method based on semantic feature progressive recombination. Background Natural scene text detection is a key field of intersecting research of artificial intelligence and computer vision, and aims to accurately identify text regions in images and obtain corresponding text detection box coordinates. The technology plays a vital role in image translation, traffic management and intelligent driving. Compared with document text images, the text detection task of natural scene images faces significant challenges, and mainly comprises the steps that text forms have strong randomness, so that a model is difficult to generate a detection frame which is accurately attached to the shape of a real text, the background and the text features are easily mixed, the risk of false recognition is increased, illumination changes or object shielding in a complex scene are avoided, and the visual integrity of the text is damaged. Traditional natural scene text detection methods mainly rely on image processing techniques and mathematical models, such as threshold segmentation, connected domain analysis, region growing, region proposal, and the like. Such an approach may be effective in processing images of relatively simple construction, but has significant limitations in dealing with complex and varied natural scene images. With the progress of computer vision technology, a natural scene text detection method based on deep learning has become an important research direction. The current mainstream method for improving the natural scene text detection performance is highly dependent on the encoder structure. Whether the convolutional neural network-based approach captures local feature information of an image or the vision Transformer (ViT) -based approach models global context information, one inherent limitation is often ignored in that the learned feature representation is essentially a passive fit of all statistical associations in the training data, where spurious associations between text and specific textures, overall styles that are not causal are inevitably confounded. In the prior art, the core of most natural scene text detection methods based on deep learning is to improve the internal architecture of an encoder, the training process of the encoder is always dependent on false correlation, and the method is easy to be interfered by background priori knowledge, so that semantic confusion is caused, and the generalization capability and robustness of the model in a complex natural scene are limited. Some existing methods attempt to introduce additional loss functions or structures, aimed at improving the discriminatory power of the model. However, such methods generally fail to implement fundamental identification and stripping of visual clutter factors or proactively reconstruct pure feature maps, leaving room for improvement in establishing true causal relationships. Notably, suppression of confounding factors is realized at the semantic level, so that improper influence of background information on text detection is reduced on the causal relationship, and false association is cut off and a pure discrimination environment is constructed. However, the application of semantic feature recombination in image processing, especially in fields with high precision requirements such as natural scene text detection, has not been fully explored. Therefore, in the face of the inherent challenges of natural scene text detection (such as false statistical association and system noise interference), and the fact that a large amount of visual clutter factors introduced by data bias are mixed in the local and global information fused by the existing deep learning method, there is a need to develop a natural scene text detection method which is efficient, accurate and capable of effectively cutting off the false association. Disclosure of Invention In order to overcome the defects in the prior art, the invention provides a natural scene text detection method based on semantic feature gradual recombination. The method is characterized by comprising the steps of designing a semantic feature recombination module, purifying and recombining feature images of an area and a global two-level layer according to recombination weights at an advanced semantic level to obtain a multi-layer feature image for removing visual priori bias, designing a double-network collaborative optimization module, predicting network prediction coordinate offset through boundary point displacement, and generating a high-precision text box through iteration of a small number of steps aiming at system noise in a modeling and prediction process of a small text, low-resolution and shielded image through a system error prediction network. The