CN-121982692-A - Lightweight scene text recognition method for low-quality image
Abstract
The invention discloses a light scene text recognition method for low-quality images, which relates to the technical field of scene text recognition and comprises the following steps of 1, calculating the quality perception domain distance of each sample image in a collected training data set, dividing the training data set into a plurality of batches of data sets according to the quality perception domain distance, 2, constructing a light scene text recognition model which comprises a light feature extraction backbone network and a sequence decoding module which are connected in series, 3, training the light scene text recognition model by using the plurality of batches of data sets through a layered training strategy to obtain a trained light scene text recognition model, and 4, inputting the collected images to be recognized into the trained light scene text recognition model to generate a recognition result. The invention integrates light feature extraction, quality perception domain adaptation and progressive layered fine adjustment to improve the efficiency, robustness and generalization capability of image text recognition.
Inventors
- GE LIHONG
- WEI JIAN
- ZHEN YINGCHAO
- Zhao Zini
- PAN HONGFANG
- ZHOU LIN
- WANG LEI
- Wei yaxing
- ZHAO BIN
- CHEN YUJIA
Assignees
- 内蒙古电力(集团)有限责任公司数字研究分公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260120
Claims (10)
- 1. The light scene text recognition method for the low-quality image is characterized by comprising the following steps of: step 1, calculating a quality perception domain distance of each sample image in an acquired training data set, and dividing the training data set into a plurality of batches of data sets according to the quality perception domain distance; Step 2, constructing a lightweight scene text recognition model, wherein the lightweight scene text recognition model comprises a lightweight feature extraction backbone network and a sequence decoding module which are connected in series and based on LS convolution, wherein the lightweight feature extraction backbone network comprises a local perception module, an LS module, a multi-head attention module and a convolution module which are sequentially connected in series, the local perception module extracts feature tensors from an image, the LS module generates detail features according to the feature tensors based on an abnormal-scale context modeling mechanism, the multi-head attention module captures all-situation lengths Cheng Yilai from the detail features to generate final feature representation, and the convolution module generates a visual feature sequence according to the final feature representation; Training the light-weight scene text recognition model by using a layered training strategy by using a plurality of batches of data sets to obtain a trained light-weight scene text recognition model; And 4, inputting the acquired image to be identified into a trained lightweight scene text recognition model to generate a recognition result.
- 2. The method for recognizing light scene text oriented to low-quality images according to claim 1, wherein the specific implementation process of processing the sample image and dividing the training data set into a plurality of batches of data sets in step1 is as follows: step 11, performing multi-index weighted fusion on a plurality of core perception dimensions of the sample image to calculate a quality score; Step 12, calculating a penalty coefficient according to the quality score; Step 13, according to punishment coefficient, the original distance of each sample image in the training data set calculated by adopting the harmonic domain difference estimator Modulating and recalibrating to obtain the distance of the quality perception domain ; Step 14, sensing domain distance according to quality And (3) carrying out ascending order sequencing on all sample images in the training data set, uniformly dividing the sequenced training data set into N batch data sets, wherein N is a positive integer greater than or equal to 2.
- 3. The method for recognizing light scene text oriented to low-quality images according to claim 1, wherein the local perception module comprises a deep convolution layer, a channel attention layer and a feedforward network layer which are sequentially connected in series, local spatial features are extracted from the images by the deep convolution layer, the local spatial features are screened through a channel attention mechanism of the channel attention layer, feature purification is performed in a self-adaptive mode to obtain purification features, and the feedforward network layer performs nonlinear fusion and enhancement on the purification features in a channel dimension and outputs feature tensors.
- 4. The method of claim 1, wherein the LS module comprises 2 sets of depth separable convolutional layers and 4 sets of LS layers, and the first set of depth separable convolutional layers, the first set of LS layers, the second set of depth separable convolutional layers and the remaining three sets of LS layers are sequentially connected in series.
- 5. The method for recognizing light-weight scene text for low-quality images according to claim 4, wherein the depth separable convolution layer comprises a depth convolution layer and a point convolution layer which are stacked in sequence, wherein the depth convolution layer independently performs spatial feature extraction on each input channel while halving the height of the feature map to obtain separation features, and wherein the point convolution layer fuses and integrates the separation features by linear combination across channels to obtain depth features.
- 6. The method for recognizing light-weight scene text for low-quality images according to claim 4, wherein the LS layer comprises a deep convolution layer, a channel attention layer, a first feedforward network layer, an LS convolution layer and a second feedforward network layer which are sequentially stacked, the deep convolution layer builds and strengthens local space features to obtain basic feature representations, the channel attention layer carries out self-adaptive enhancement basic feature representations through channel level recalibration to obtain initial enhancement features, the first feedforward network layer carries out nonlinear transformation on the initial enhancement features in channel dimensions and obtains transformation features through residual connection, a large kernel perception module of the LS convolution layer provides context clues of character levels to generate space relation weights related to the transformation features, a small kernel aggregation module dynamically optimizes the local feature representations, remodels the space relation weights into grouping convolution kernels, carries out self-adaptive dynamic convolution on the transformation features in small local neighbors according to grouping convolution kernels to generate convolution features, and the second feedforward network carries out nonlinear transformation on the convolution features in the channel dimensions to obtain the enhancement features.
- 7. The light scene text recognition method for the low-quality image is characterized in that the multi-head attention module comprises a depth separable convolution layer, a depth convolution layer, a channel attention layer, a first feedforward network layer, a multi-head self-attention layer and a second feedforward network layer which are sequentially connected in series, wherein spatial downsampling and channel expansion of feature tensors are achieved through the depth separable convolution layer, a feature pyramid is constructed, features of the input feature pyramid are preprocessed through the depth convolution layer and the channel attention layer to obtain enhanced features, the first feedforward network layer carries out nonlinear transformation on the enhanced features in a channel dimension to obtain transformed enhanced features, the multi-head self-attention layer comprehensively captures long-range interaction among all positions of the transformed enhanced features by utilizing a multi-head self-attention mechanism to obtain global interaction features, and the second feedforward network layer carries out integration and nonlinear enhancement on the global interaction features to output final feature representation.
- 8. The method for recognizing light scene text oriented to low-quality image according to claim 1, wherein the convolution module uses And the convolution layer is used for carrying out convolution operation on the final characteristic representation output by the multi-head attention module of the LS module to obtain a visual characteristic sequence.
- 9. The method for recognizing light-weight scene text for low-quality images according to claim 1, wherein the sequence decoder module decodes the visual feature sequence, maps the visual feature sequence to text output, and obtains the recognition result.
- 10. The low-quality image-oriented lightweight scene text recognition method according to claim 2, wherein the hierarchical training strategy in step 3 trains the lightweight scene text recognition model in a sequential progressive manner according to the partitioned batch data set.
Description
Lightweight scene text recognition method for low-quality image Technical Field The invention relates to the technical field of scene text recognition, in particular to a lightweight scene text recognition method for low-quality images. Background Scene text recognition (Scene Text Recognition, STR) in the real world is often subject to severe degradation due to complex ambient lighting, variable text carrier geometry, and unstable imaging processes. In a monitoring video, a text is difficult to identify due to low resolution, compression artifacts and motion blur, a vehicle-mounted camera device is limited by factors such as vehicle speed, illumination and shake, clear characters are difficult to stably capture, and in unmanned retail, warehouse logistics and mobile terminal applications, the reduction of image quality is further aggravated by the restriction of hardware performance. These factors together cause problems such as text edge blurring, character adhesion, stroke deletion and the like frequently, and the practicality and the robustness of the scene text recognition model are seriously affected. In the field of scene text recognition facing low-quality images at present, the existing method is mainly developed along three technical routes, and has the superiority and limitation of specific application scenes. In the method based on the visual-language large model, trOCR, maskOCR and DTrOCR series enhance the robustness to the low-quality image by introducing language modeling capability, but the model has huge general parameter scale and depends on a large amount of pre-training data, so that the deployment cost is high, the calculation strength of CLIP4STR and CLIP-LLaMA is still difficult to meet the real-time requirement of edge equipment although the cross-modal alignment is utilized, and in the direction of the light-weight model, KD-LTR and CCFPlus enhance the adaptability of a student network to degradation characteristics by knowledge distillation, but the recognition performance is still drastically reduced under extremely low quality input, and the complexity of combining multiple types of degradation is insufficient by utilizing contrast learning enhancement capability FLCL. In general, existing approaches have not achieved an effective balance between efficiency, robustness, and deployment costs. Therefore, how to improve the efficiency and robustness of low quality image scene text recognition is a problem that those skilled in the art need to solve. Disclosure of Invention In view of the above problems, the present invention has been made to provide a low-quality image-oriented lightweight scene text recognition method that overcomes or at least partially solves the above problems, integrates lightweight feature extraction, quality perception domain adaptation and progressive hierarchical fine tuning, and aims to cooperatively solve the problem that the efficiency, robustness and generalization capability of the scene text recognition are difficult to be compatible when the low-quality image is oriented. In order to achieve the above purpose, the present invention adopts the following technical scheme: in a first aspect, an embodiment of the present invention provides a low-quality image-oriented lightweight scene text recognition method, including the steps of: Step 1, processing sample images in an acquired training data set by using a quality perception module, calculating a quality perception domain distance of each sample image, and dividing the training data set into a plurality of batches of data sets according to the quality perception domain distance; Step 2, constructing a lightweight scene text recognition model, wherein the lightweight scene text recognition model comprises a lightweight feature extraction backbone network and a sequence decoding module which are connected in series and based on LS convolution, wherein the lightweight feature extraction backbone network comprises a local perception module, an LS module, a multi-head attention module and a convolution module which are sequentially connected in series, the local perception module extracts feature tensors from an image, the LS module generates detail features according to the feature tensors based on an abnormal-scale context modeling mechanism, the multi-head attention module captures all-situation lengths Cheng Yilai from the detail features to generate final feature representation, and the convolution module generates a visual feature sequence according to the final feature representation; Training the light-weight scene text recognition model by using a layered training strategy by using a plurality of batches of data sets to obtain a trained light-weight scene text recognition model; And 4, inputting the acquired image to be identified into a trained lightweight scene text recognition model to generate a recognition result. Preferably, step 1 processes the sample image by using a quality perception mo