CN-116453139-B - Pre-training method and related method and equipment

CN116453139BCN 116453139 BCN116453139 BCN 116453139BCN-116453139-B

Abstract

The invention provides a pre-training method and related method and equipment, wherein the pre-training method comprises the steps of obtaining a training data set, carrying out text detection and recognition on training document pictures in the training data set to obtain texts corresponding to text detection boxes, taking the training document pictures as input of a pre-training model, obtaining characteristics of the text detection boxes based on the pre-training model, obtaining semantic characteristics of the texts corresponding to the text detection boxes based on a text encoder, taking the characteristics of the text detection boxes approach to the semantic characteristics of the texts corresponding to the text detection boxes as targets, carrying out parameter update on the pre-training model, and taking the pre-training model obtained through training as a target pre-training model. Because the target pre-training model takes the document picture as input, text semantic information can be mined from the document picture, and further, the characteristics rich in text semantics are output without inputting text, and therefore, when the target pre-training model is applied to a downstream task, text recognition is not required, and the expense of an OCR engine is saved.

Inventors

ZHANG ZHENRONG
ZHANG JIANSHU

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260505
Application Date: 20230419

Claims (13)

1. A method of pre-training, comprising: Acquiring a training data set, wherein the training data set comprises a plurality of training document pictures; performing text detection and recognition on the training document pictures in the training data set to obtain a text detection box and texts corresponding to the text detection box; Taking the training document picture as input of a pre-training model, acquiring characteristics of the text detection box based on the pre-training model, taking a text corresponding to the text detection box as input of a text encoder, and acquiring semantic characteristics of the text corresponding to the text detection box based on the text encoder, wherein the pre-training model comprises a visual encoder, and the text encoder is obtained by training a plurality of training texts in advance; and taking the characteristic of the text detection box as a target to enable the characteristic of the text detection box to approach to the semantic characteristic of the text corresponding to the text detection box, updating parameters of the pre-training model, and taking the pre-training model obtained through training as a target pre-training model.
2. The pre-training method according to claim 1, wherein the obtaining the feature of the text detection box based on the pre-training model using the training document picture as an input of the pre-training model comprises: Coding the training document picture by a vision coder based on a pre-training model to obtain the characteristics of the training document picture; And acquiring the characteristics of each text detection box based on the characteristics of each text detection box and the training document picture.
3. The method of claim 1, wherein the pre-training model further comprises a first feature processing module and a second feature processing module; The step of obtaining the characteristics of the text detection box based on the pre-training model by taking the training document picture as the input of the pre-training model comprises the following steps: Coding the training document picture by a vision coder based on a pre-training model to obtain the characteristics of the training document picture; Acquiring a first characteristic of each text detection box based on the characteristics of each text detection box and the training document picture; Processing the first characteristics of each text detection box based on a first characteristic processing module of the pre-training model to obtain second characteristics of each text detection box, wherein the second characteristics of one text detection box comprise time sequence information of texts in the text detection box; And processing the second characteristics of each text detection box based on the second characteristic processing module of the pre-training model to obtain a third characteristic of each text detection box as a final characteristic of each text detection box, wherein the third characteristic of one text detection box comprises dependency information between the text detection box and each text detection box.
4. A method of pre-training according to claim 3, wherein the pre-training model-based second feature processing module processes the second feature of each text detection box to obtain a third feature of each text detection box, comprising: for each text detection box: Determining the relevance weights of the text detection boxes and the text detection boxes respectively based on a second feature processing module of the pre-training model so as to obtain the relevance weights corresponding to the text detection boxes respectively; And weighting and summing the second characteristics of each text detection box according to the corresponding correlation weights of each text detection box, so as to obtain the third characteristics of the text detection box.
5. The pre-training method according to any one of claims 1 to 4, wherein the performing parameter update on the pre-training model with the objective of making the feature of the text detection box approach to the semantic feature of the text corresponding to the text detection box includes: For each text detection box, determining a feature prediction loss of a pre-training model on the text detection box based on the features of the text detection box and the semantic features of the text corresponding to the text detection box; And according to the characteristic prediction loss of the pre-training model on each text detection box, updating parameters of the pre-training model.
6. The method according to claim 5, wherein determining the feature prediction loss of the pre-training model on the text detection box based on the feature of the text detection box and the semantic feature of the text corresponding to the text detection box comprises: calculating the mean square error of the characteristics of the text detection box and the semantic characteristics of the text corresponding to the text detection box, and taking the mean square error as the characteristic prediction loss of the pre-training model on the text detection box; And according to the characteristic prediction loss of the pre-training model on each text detection box, updating parameters of the pre-training model, wherein the method comprises the following steps: fusing the characteristic prediction loss of the pre-training model on each text detection box to obtain fused loss; and updating parameters of the pre-training model according to the loss after fusion.
7. An information prediction model acquisition method is characterized by comprising the following steps: constructing an initial information prediction model based on a target pre-training model and a prediction module aiming at a specified task, wherein the target pre-training model is obtained by training by adopting the pre-training method according to any one of claims 1-6; And fine-tuning the initial information prediction model by adopting a training document picture with the annotation data aiming at the appointed task to obtain the information prediction model aiming at the appointed task.
8. The information prediction model acquisition method according to claim 7, wherein the specified task is a detection task for a document picture; The initial information prediction model is constructed based on the target pre-training model and a prediction module aiming at a specified task, and comprises the following steps: Constructing an initial detection model based on a visual encoder in the target pre-training model and a prediction module aiming at the detection task; The training document picture with the annotation data aiming at the appointed task is adopted to carry out fine adjustment on the initial information prediction model, and the method comprises the following steps: and fine-tuning the initial detection model by adopting a training document picture with the annotation data aiming at the detection task.
9. The information prediction model acquisition method according to claim 7, wherein the specified task is a classification task for a region of interest in a document picture; The target pre-training model comprises a first feature processing module and a second feature processing module; The initial information prediction model is constructed based on the target pre-training model and a prediction module aiming at a specified task, and comprises the following steps: Constructing an initial region-of-interest classification model based on a visual encoder, a first feature processing module and a second feature processing module in a target pre-training model, and a prediction module for the classification task; The training document picture with the annotation data aiming at the appointed task is adopted to carry out fine adjustment on the initial information prediction model, and the method comprises the following steps: And fine-tuning the initial region-of-interest classification model by adopting a training document picture with the region of interest and the category marked with the region of interest.
10. An information prediction method, comprising: Acquiring a target document picture of a designated task; processing the target document picture based on an information prediction model obtained by adopting the information prediction model obtaining method according to any one of claims 7-9, so as to obtain an information prediction result corresponding to the target document picture on the appointed task.
11. The pre-training device is characterized by comprising a training data acquisition module and a model training module; The training data acquisition module is used for acquiring a training data set, wherein the training data set comprises a plurality of training document pictures; The model training module is used for carrying out text detection and recognition on the training document pictures in the training data set to obtain text detection boxes and texts corresponding to the text detection boxes, taking the training document pictures as input of a pre-training model, acquiring characteristics of the text detection boxes based on the pre-training model, taking the texts corresponding to the text detection boxes as input of a text encoder, acquiring semantic characteristics of the texts corresponding to the text detection boxes based on the text encoder, enabling the characteristics of the text detection boxes to approach to the semantic characteristics of the texts corresponding to the text detection boxes as targets, carrying out parameter updating on the pre-training model, and taking the pre-training model obtained through training as a target pre-training model, wherein the pre-training model comprises a visual encoder, and the text encoder is obtained through training by adopting a plurality of training texts in advance.
12. A processing device is characterized by comprising a memory and a processor; the memory is used for storing programs; The processor is configured to execute the program to implement the steps of the pre-training method according to any one of claims 1 to 6.
13. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the pre-training method according to any of the claims 1-6.

Description

Pre-training method and related method and equipment Technical Field The invention relates to the technical field of unsupervised learning, in particular to a pre-training method and related method and equipment. Background The information prediction model including the coding part is generally obtained by constructing an initial information prediction model based on a target pre-training model and a prediction module, and then fine-tuning the initial information prediction model by using training data of a specified task, thereby obtaining an information prediction model for the specified task finally. The target pre-training model is obtained by training an initial pre-training model by adopting unsupervised data. The document intelligent technology is widely applied to industries of finance, insurance, energy, logistics, medical treatment and the like, and at present, for tasks aiming at document pictures, such as a text line detection task, a document classification task, an interested region classification task and the like, a target pre-training model is generally obtained in a pre-training mode, wherein firstly, text recognition is carried out on training document pictures based on an OCR (Optical Character Recognition ) engine, then part of content in the recognized text is covered, finally, the covered text is utilized, and the initial pre-training model is trained by combining a text prediction task (predicting the covered content), so that the target pre-training model is obtained. After the target pre-training model is obtained in the above manner, it can be applied to downstream tasks (e.g., text line detection, document classification, etc.). It will be appreciated that when the target pre-training model obtained via the above method is applied to a downstream task, text recognition by the OCR engine is also required, which means that if the target pre-training model is obtained via the above method is used downstream, a part of OCR engine overhead is necessarily incurred, and for some downstream tasks, such as document detection, document classification, etc., text content in a document picture is not required. Disclosure of Invention In view of the above, the present invention provides a pre-training method and related method and device, which are used to solve the problem that a part of OCR engine overhead is caused when a target pre-training model obtained by an existing pre-training method is applied to a downstream task, and the technical scheme is as follows: a method of pre-training, comprising: Acquiring a training data set, wherein the training data set comprises a plurality of training document pictures; performing text detection and recognition on the training document pictures in the training data set to obtain a text detection box and texts corresponding to the text detection box; Taking the training document picture as input of a pre-training model, acquiring characteristics of the text detection box based on the pre-training model, taking a text corresponding to the text detection box as input of a text encoder, and acquiring semantic characteristics of the text corresponding to the text detection box based on the text encoder, wherein the pre-training model comprises a visual encoder, and the text encoder is obtained by training a plurality of training texts in advance; and taking the characteristic of the text detection box as a target to enable the characteristic of the text detection box to approach to the semantic characteristic of the text corresponding to the text detection box, updating parameters of the pre-training model, and taking the pre-training model obtained through training as a target pre-training model. Optionally, the obtaining the characteristics of the text detection box based on the pre-training model by using the training document picture as the input of the pre-training model includes: Coding the training document picture by a vision coder based on a pre-training model to obtain the characteristics of the training document picture; And acquiring the characteristics of each text detection box based on the characteristics of each text detection box and the training document picture. Optionally, the pre-training model further comprises a first feature processing module and a second feature processing module; The step of obtaining the characteristics of the text detection box based on the pre-training model by taking the training document picture as the input of the pre-training model comprises the following steps: Coding the training document picture by a vision coder based on a pre-training model to obtain the characteristics of the training document picture; Acquiring a first characteristic of each text detection box based on the characteristics of each text detection box and the training document picture; Processing the first characteristics of each text detection box based on a first characteristic processing module of the pre-training model to obtain