CN-121999503-A - Text recognition method, device and storage medium

CN121999503ACN 121999503 ACN121999503 ACN 121999503ACN-121999503-A

Abstract

The application provides a text recognition method, a device and a storage medium, wherein the method comprises the steps of obtaining a plurality of first text images, respectively carrying out text recognition on the first text images by utilizing a plurality of first models obtained through pre-training for each first text image to obtain a plurality of text recognition results, taking the first text images as images to be processed under the condition that the plurality of text recognition results meet preset conditions, wherein the preset conditions comprise that the plurality of text recognition results comprise different text recognition results, synthesizing a plurality of pieces of first training data based on the plurality of text recognition results, utilizing a first training set to train to obtain a second model for text recognition, wherein the first training set comprises the plurality of pieces of first training data, and utilizing the second model to carry out text recognition on each image to be processed. The application can improve the accuracy of character recognition and the labeling efficiency of the character data set.

Inventors

LEI XIAOKANG
DONG BIN
JIANG SHANSHAN
DING LEI

Assignees

株式会社理光

Dates

Publication Date: 20260508
Application Date: 20241101

Claims (15)

1. A method of text recognition comprising the steps of: s1, acquiring a plurality of first text images; s2, respectively carrying out text recognition on each first text image by utilizing a plurality of first models obtained by pre-training to obtain a plurality of text recognition results, wherein when the plurality of text recognition results meet preset conditions, the first text image is taken as an image to be processed, and the preset conditions comprise that the plurality of text recognition results comprise different text recognition results; S3, synthesizing a plurality of pieces of first training data based on the text recognition results, and training by using a first training set to obtain a second model for text recognition, wherein the first training set comprises the plurality of pieces of first training data; And S4, carrying out text recognition on each image to be processed by using the second model.
2. The method as recited in claim 1, further comprising: And taking the text recognition result as a final text recognition result of the first text image under the condition that the text recognition results are the same.
3. The method of claim 1, wherein the predetermined condition further comprises a first text recognition result having a largest ratio among the plurality of text recognition results having a ratio not exceeding a predetermined ratio.
4. A method as recited in claim 3, further comprising: And taking the first text recognition result as a final text recognition result of the first text image under the condition that the plurality of text recognition results do not meet the preset condition.
5. The method of claim 1, wherein synthesizing a plurality of first training data based on the plurality of text recognition results comprises: Performing de-duplication processing on the text recognition results, taking the text recognition results after the de-duplication processing as foreground texts, and performing synthesis processing on the foreground texts and background images to obtain sample images; And taking the text recognition result synthesized into the sample image as text labeling information of the sample image to obtain the first training data.
6. The method of claim 1, wherein S4 comprises: Performing text recognition on each image to be processed by using the second model to obtain a second text recognition result and the confidence coefficient thereof; When the confidence coefficient is larger than or equal to a first threshold, the second text recognition result is used as a final text recognition result of the image to be processed; prompting that the image to be processed fails to be identified under the condition that the confidence is smaller than or equal to a second threshold; and adding the image to be processed into a first image library under the condition that the confidence coefficient is smaller than the first threshold and larger than the second threshold.
7. The method as recited in claim 6, further comprising: and executing steps S2-S4 aiming at the images in the first image library under the condition that the number of the images in the first image library is larger than the preset number, and obtaining a final text recognition result of the images in the first image library.
8. The method as recited in claim 1, further comprising: and generating text labeling data by using all the obtained first text images and the final text recognition results thereof.
9. The method as recited in claim 1, further comprising: acquiring a plurality of pieces of second training data, wherein the second training data are training data in a public data set for text recognition; Taking a pre-selected text sample as a foreground text, and synthesizing the foreground text and a background image to obtain a sample image; Generating a plurality of second training sets based on the plurality of second training data and/or the plurality of third training data; And training to obtain a plurality of first models by using the plurality of second training sets, wherein each second training set is trained to obtain a first model.
10. The method according to claim 5 or 9, wherein, when synthesizing the foreground text with the background image to obtain the sample image, the method further comprises: the sample image is subjected to enhancement processing, wherein the enhancement processing comprises at least one enhancement mode of setting a font of a foreground text, deforming the foreground text, deleting pixel points in the sample image, adding a preset pattern in the sample image and adjusting the color of the background image.
11. The method as recited in claim 10, further comprising: Receiving configuration information input by a user; and selecting a synthesis processing mode adopted by the synthesis processing according to the configuration information.
12. The method of claim 11, wherein the configuration information includes business context and/or text recognition accuracy, and wherein the selecting the composition processing mode used for the composition processing based on the configuration information includes at least one of: Selecting an image related to the service scene as the background image according to the service scene; and determining the number of the background images adopted by the synthesis processing and/or the type of the enhancement mode adopted by the enhancement processing according to the text recognition precision.
13. A text recognition device, comprising: The information receiving and transmitting module is used for acquiring a plurality of first text images; The text recognition module is used for recognizing texts of the first text images by utilizing a plurality of first models obtained through pre-training aiming at each first text image, so as to obtain a plurality of text recognition results; taking the first text image as an image to be processed under the condition that the plurality of text recognition results meet the preset condition, wherein the preset condition comprises that the plurality of text recognition results comprise different text recognition results; the model generation module is used for synthesizing a plurality of pieces of first training data based on the text recognition results, and training by utilizing a first training set to obtain a second model for text recognition, wherein the first training set comprises the plurality of pieces of first training data; And the text recognition module is also used for recognizing the text of each image to be processed by using the second model.
14. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 12.
15. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 12.

Description

Text recognition method, device and storage medium Technical Field The invention relates to the technical fields of deep learning, optical character recognition (Optical Character Recognition, OCR) and the like, in particular to a text recognition method, a text recognition device and a storage medium. Background The text recognition method based on deep learning requires massive text labeling data, and the quantity and quality of the labeling text directly influence the accuracy of text recognition. In general, the more text images are marked, the more accurate the marking, and the better the text recognition effect. The most commonly used labeling method at present is to predict a text image through a text recognition model and then manually correct a prediction result. Training a text recognition model requires a large amount of labeled text, while manually labeling text requires a large amount of labor, is inefficient, and manually labeled text has problems such as rarely used characters and special symbols being difficult to label (people may not recognize them and cannot label them). In addition, for similar characters, false labeling is easily performed manually. Therefore, a text recognition method is needed to automatically recognize the text in the text image, and improve the efficiency of text labeling and the accuracy of text recognition. Disclosure of Invention At least one embodiment of the application provides a text recognition method, a text recognition device and a storage medium, which can improve the efficiency of text labeling and the accuracy of text recognition. In order to solve the technical problems, the application is realized as follows: In a first aspect, an embodiment of the present application provides a text recognition method, including: s1, acquiring a plurality of first text images; s2, respectively carrying out text recognition on each first text image by utilizing a plurality of first models obtained by pre-training to obtain a plurality of text recognition results, wherein when the plurality of text recognition results meet preset conditions, the first text image is taken as an image to be processed, and the preset conditions comprise that the plurality of text recognition results comprise different text recognition results; S3, synthesizing a plurality of pieces of first training data based on the text recognition results, and training by using a first training set to obtain a second model for text recognition, wherein the first training set comprises the plurality of pieces of first training data; And S4, carrying out text recognition on each image to be processed by using the second model. Optionally, the method further comprises: And taking the text recognition result as a final text recognition result of the first text image under the condition that the text recognition results are the same. Optionally, the preset condition further includes that the ratio of the first text recognition result with the largest ratio in the plurality of text recognition results does not exceed the preset ratio. Optionally, the method further comprises: And taking the first text recognition result as a final text recognition result of the first text image under the condition that the plurality of text recognition results do not meet the preset condition. Optionally, based on the plurality of text recognition results, synthesizing a plurality of first training data includes: Performing de-duplication processing on the text recognition results, taking the text recognition results after the de-duplication processing as foreground texts, and performing synthesis processing on the foreground texts and background images to obtain sample images; And taking the text recognition result synthesized into the sample image as text labeling information of the sample image to obtain the first training data. Optionally, the S4 includes: Performing text recognition on each image to be processed by using the second model to obtain a second text recognition result and the confidence coefficient thereof; When the confidence coefficient is larger than or equal to a first threshold, the second text recognition result is used as a final text recognition result of the image to be processed; prompting that the image to be processed fails to be identified under the condition that the confidence is smaller than or equal to a second threshold; and adding the image to be processed into a first image library under the condition that the confidence coefficient is smaller than the first threshold and larger than the second threshold. Optionally, the method further comprises: and executing steps S2-S4 aiming at the images in the first image library under the condition that the number of the images in the first image library is larger than the preset number, and obtaining a final text recognition result of the images in the first image library. Optionally, the method further comprises: and generating text labeling data