CN-121999499-A - Text recognition method, apparatus, device, storage medium and computer program product

CN121999499ACN 121999499 ACN121999499 ACN 121999499ACN-121999499-A

Abstract

The application provides a text recognition method, a device, equipment, a storage medium and a computer program product, wherein the text recognition method comprises the steps of processing a picture to be recognized to obtain an input picture with a target size, processing the input picture by utilizing a U-Net model with a U-shaped network structure to obtain an output picture with the target size, wherein the output picture is a text picture meeting a first condition, and the first condition comprises at least one of using a target font, using a target font size and using a white background and black background format, and processing the output picture by utilizing a codec model to obtain text content corresponding to the picture to be recognized. The method and the device can support solving the problem that the whole model cannot be multiplexed across scenes, and can support solving the problem that the text recognition scheme has low recognition efficiency and accuracy.

Inventors

Yang Xiagan

Assignees

中国移动通信有限公司研究院
中国移动通信集团有限公司

Dates

Publication Date: 20260508
Application Date: 20241105

Claims (18)

1.A method of text recognition, comprising: Processing the picture to be identified to obtain an input picture with a target size; Processing the input picture by utilizing a U-Net model with a U-shaped network structure to obtain an output picture with the target size, wherein the output picture is a text picture meeting a first condition, and the first condition comprises at least one of using a target font, using a target font size and using a white background and black background format; and processing the output picture by using a coder-decoder model to obtain text content corresponding to the picture to be identified.
2. The text recognition method according to claim 1, wherein the processing the picture to be recognized to obtain the input picture with the target size includes: According to the target size, performing first processing operation on the picture to be identified to obtain an input picture with the target size; Wherein the first processing operation comprises at least one of: Size conversion; Short sides are filled.
3. The text recognition method according to claim 1, wherein the processing the input picture using the U-Net model with the U-network structure to obtain the output picture with the target size includes: Performing image feature extraction and image recovery processing on the input picture by using a U-Net model to obtain an intermediate picture meeting the first condition; Executing a second processing operation on the intermediate picture to obtain an output picture with the target size; wherein the second processing operation comprises at least one of: White edge deletion; Size conversion; Short sides are filled.
4. The text recognition method of claim 3, wherein the U-Net model comprises a first encoder structure and a first decoder structure, the first encoder structure comprising a convolutional layer and a pooling layer, the first decoder structure comprising an upsampling layer and a convolutional layer; The step of performing image feature extraction and image restoration processing on the input image by using a U-Net model to obtain an intermediate image meeting the first condition comprises the following steps: Extracting image features of the input picture by using the first encoder structure to obtain a feature map; And performing image recovery processing on the feature map by using the first decoder structure to obtain an intermediate picture meeting the first condition.
5. The text recognition method according to claim 1 or 2, characterized by further comprising: Executing a third processing operation on the training picture to obtain a training input picture; training the U-Net model by using the training input picture; Wherein the third processing operation comprises at least one of: performing at least one of size conversion and short side filling according to the target size; A data enhancement operation is performed that includes at least one of increasing color dithering, adding noise, and applying blurring processing.
6. The text recognition method according to claim 1, further comprising, before processing the output picture with a codec model to obtain text content corresponding to the picture to be recognized: Performing fine adjustment on the codec model according to the current application scene information to obtain an adjusted codec model; The processing the output picture by using the codec model to obtain text content corresponding to the picture to be identified includes: And processing the output picture by using the adjusted codec model to obtain text content corresponding to the picture to be identified.
7. The text recognition method of claim 1 or 6, wherein the codec model is a converter model based on a visual converter and decoder structure.
8. A text recognition device, comprising: The first processing module is used for processing the picture to be identified to obtain an input picture with a target size; The second processing module is used for processing the input picture by utilizing a U-Net model with a U-shaped network structure to obtain an output picture with the target size, wherein the output picture is a text picture meeting a first condition, and the first condition comprises at least one of using a target font, using a target font size and using a white background and black background format; And the third processing module is used for processing the output picture by utilizing the codec model to obtain text content corresponding to the picture to be identified.
9. The text recognition device of claim 8, wherein the processing the picture to be recognized to obtain the input picture of the target size comprises: According to the target size, performing first processing operation on the picture to be identified to obtain an input picture with the target size; Wherein the first processing operation comprises at least one of: Size conversion; Short sides are filled.
10. The text recognition device of claim 8, wherein the processing the input picture using the U-Net model of the U-network structure to obtain the output picture of the target size comprises: Performing image feature extraction and image recovery processing on the input picture by using a U-Net model to obtain an intermediate picture meeting the first condition; Executing a second processing operation on the intermediate picture to obtain an output picture with the target size; wherein the second processing operation comprises at least one of: White edge deletion; Size conversion; Short sides are filled.
11. The text recognition device of claim 10, wherein the U-Net model comprises a first encoder structure and a first decoder structure, the first encoder structure comprising a convolutional layer and a pooling layer, the first decoder structure comprising an upsampling layer and a convolutional layer; The step of performing image feature extraction and image restoration processing on the input image by using a U-Net model to obtain an intermediate image meeting the first condition comprises the following steps: Extracting image features of the input picture by using the first encoder structure to obtain a feature map; And performing image recovery processing on the feature map by using the first decoder structure to obtain an intermediate picture meeting the first condition.
12. The text recognition device of claim 8 or 9, further comprising: The first execution module is used for executing a third processing operation on the training picture to obtain a training input picture; The first training module is used for training the U-Net model by utilizing the training input picture; Wherein the third processing operation comprises at least one of: performing at least one of size conversion and short side filling according to the target size; A data enhancement operation is performed that includes at least one of increasing color dithering, adding noise, and applying blurring processing.
13. The text recognition device of claim 8, further comprising: the fourth processing module is used for performing fine adjustment on the codec model according to the current application scene information before processing the output picture by using the codec model to obtain text content corresponding to the picture to be identified, so as to obtain an adjusted codec model; The processing the output picture by using the codec model to obtain text content corresponding to the picture to be identified includes: And processing the output picture by using the adjusted codec model to obtain text content corresponding to the picture to be identified.
14. The text recognition device of claim 8 or 13, wherein the codec model is a converter model based on a visual converter and decoder architecture.
15. A text recognition apparatus includes a processor; the processor is used for processing the picture to be identified to obtain an input picture with a target size; Processing the input picture by utilizing a U-Net model with a U-shaped network structure to obtain an output picture with the target size, wherein the output picture is a text picture meeting a first condition, and the first condition comprises at least one of using a target font, using a target font size and using a white background and black background format; and processing the output picture by using a coder-decoder model to obtain text content corresponding to the picture to be identified.
16. A text recognition device comprising a memory, a processor and a program stored on the memory and executable on the processor, characterized in that the processor implements the text recognition method according to any one of claims 1 to 7 when executing the program.
17. A readable storage medium, having stored thereon a program, which when executed by a processor, implements the steps of the text recognition method according to any one of claims 1 to 7.
18. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the text recognition method of any of claims 1 to 7.

Description

Text recognition method, apparatus, device, storage medium and computer program product Technical Field The present application relates to the field of image processing, and in particular, to a text recognition method, apparatus, device, storage medium, and computer program product. Background In the prior art, an artificial intelligence-based text recognition system mainly relies on three steps of text detection, text recognition and text relationship combing to acquire OCR (optical character recognition) information of a document. The application scene of text recognition is very wide, and the application scene is widely applied to scenes such as guideboard detection recognition, scene text translation or book text recognition extraction. However, recognition models used in text recognition in the prior art are respectively trained for various scenes, the whole model cannot be directly used across the scenes, and the recognition efficiency and accuracy of the model are low. Therefore, the text recognition scheme in the prior art has the problems that the whole model cannot be multiplexed across scenes, the recognition efficiency and the accuracy are low, and the like. Disclosure of Invention The application aims to provide a text recognition method, a device, equipment, a storage medium and a computer program product, which are used for solving the problems that in the text recognition scheme in the prior art, the whole model cannot be multiplexed across scenes and the recognition efficiency and accuracy are low. In order to solve the above technical problems, an embodiment of the present application provides a text recognition method, including: Processing the picture to be identified to obtain an input picture with a target size; Processing the input picture by utilizing a U-Net model with a U-shaped network structure to obtain an output picture with the target size, wherein the output picture is a text picture meeting a first condition, and the first condition comprises at least one of using a target font, using a target font size and using a white background and black background format; and processing the output picture by using a coder-decoder model to obtain text content corresponding to the picture to be identified. Optionally, the processing the picture to be identified to obtain an input picture with a target size includes: According to the target size, performing first processing operation on the picture to be identified to obtain an input picture with the target size; Wherein the first processing operation comprises at least one of: Size conversion; Short sides are filled. Optionally, the processing the input picture by using a U-Net model with a U-network structure to obtain an output picture with the target size includes: Performing image feature extraction and image recovery processing on the input picture by using a U-Net model to obtain an intermediate picture meeting the first condition; Executing a second processing operation on the intermediate picture to obtain an output picture with the target size; wherein the second processing operation comprises at least one of: White edge deletion; Size conversion; Short sides are filled. Optionally, the U-Net model comprises a first encoder structure and a first decoder structure, wherein the first encoder structure comprises a convolution layer and a pooling layer, and the first decoder structure comprises an upsampling layer and a convolution layer; The step of performing image feature extraction and image restoration processing on the input image by using a U-Net model to obtain an intermediate image meeting the first condition comprises the following steps: Extracting image features of the input picture by using the first encoder structure to obtain a feature map; And performing image recovery processing on the feature map by using the first decoder structure to obtain an intermediate picture meeting the first condition. Optionally, the method further comprises: Executing a third processing operation on the training picture to obtain a training input picture; training the U-Net model by using the training input picture; Wherein the third processing operation comprises at least one of: performing at least one of size conversion and short side filling according to the target size; A data enhancement operation is performed that includes at least one of increasing color dithering, adding noise, and applying blurring processing. Optionally, before the processing the output picture by using the codec model to obtain the text content corresponding to the picture to be identified, the method further includes: Performing fine adjustment on the codec model according to the current application scene information to obtain an adjusted codec model; The processing the output picture by using the codec model to obtain text content corresponding to the picture to be identified includes: And processing the output picture by using the adjusted codec model to ob