CN-121997216-A - Document identification method, device, equipment and medium based on large language model

CN121997216ACN 121997216 ACN121997216 ACN 121997216ACN-121997216-A

Abstract

The invention relates to the technical field of large language models and discloses a bill identification method, device, equipment and medium based on a large language model, which comprises the steps of training field terms of the large language model by utilizing corpus data to obtain a business field basic model; the method comprises the steps of training an adapter layer in a business field basic model by using historical electronic bill data to obtain an enterprise business model, carrying out model optimization on the enterprise business model by using historical bill anomaly type data to obtain an anomaly identification optimization model of a target enterprise, identifying text mode anomaly probability of a target business bill by the anomaly identification optimization model, analyzing structure mode anomaly probability of the target business bill, carrying out bimodal weighted fusion on the text mode anomaly probability and the structure mode anomaly probability to obtain target anomaly probability of the target business bill, and identifying bill type of the target financial bill according to the target anomaly probability. The invention can improve the accuracy of bill identification.

Inventors

LIN ZHICHANG
HUANG HETAO
LIU YANG
FANG FANG

Assignees

招商局金融科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251211

Claims (10)

1. A document identification method based on a large language model, comprising: collecting corpus data of a target business field, and training field terms of a preset large language model by utilizing the corpus data to obtain a business field basic model; Collecting historical electronic bill data and historical bill anomaly type data of a target enterprise, and training an adapter layer in the business field basic model by utilizing the historical electronic bill data to obtain an enterprise business model; Converting positive and negative sample pairs in the historical bill anomaly type data into feature expression vectors, calculating feature distances of similar samples and heterogeneous samples in the positive and negative sample pairs according to the feature expression vectors, and optimizing model parameters in the enterprise business model according to the feature distances to obtain an anomaly identification optimization model; analyzing semantic abnormal modes and coding abnormal modes among text fields in a preset target business document through the abnormal recognition optimization model, and outputting text mode abnormal probability of the target business document aiming at the semantic abnormal modes and the coding abnormal modes through an activation function in the abnormal recognition optimization model; Constructing a graph structure of the target business bill according to the bill entity of the target business bill and the logic relation between the bill entities, carrying out node characteristic propagation and edge weight learning on the graph structure by utilizing a preset graph neural network to obtain a node abnormal mode and an edge abnormal mode, and outputting structural mode abnormal probability of the target business bill aiming at the node abnormal mode and the edge abnormal mode through a classifier in the graph neural network; and carrying out bimodal weighted fusion on the text modal abnormal probability and the structural modal abnormal probability to obtain the target abnormal probability of the target business bill, comparing the target abnormal probability with a preset risk probability threshold, and determining the risk type of the target financial bill based on a comparison result.
2. The document recognition method based on a large language model as claimed in claim 1, wherein the training of domain terms on a preset large language model by using the corpus data to obtain a business domain basic model comprises: performing word segmentation on the corpus data, and replacing each word in the word segmentation corpus data with a unique integer sequence number corresponding to each word in a preset vocabulary; The unique integer sequence number is used as a row index of an embedding matrix in an embedding layer of the large language model, word vectors corresponding to the unique integer sequence number in the embedding matrix are positioned through the row index, and self-attention mechanism and feedforward neural network analysis are carried out on the word vectors to obtain a hidden state vector sequence containing context semantics; Performing linear classification on each word vector in the hidden state vector sequence to obtain probability distribution of each word vector; calculating training loss values between the probability distribution of each word vector and the true probability value of the corresponding position of the word vector in the digital sequence; Back-propagating the large language model according to the training loss value, and updating model parameters of the large language model based on back-propagation; And when the training loss value is smaller than a preset loss threshold value, taking the large language model corresponding to the updated model parameter as a business field basic model.
3. The large language model based document identification method of claim 1, wherein training the adapter layer in the business domain base model with the historical electronic document data to obtain an enterprise business model comprises: freezing target parameters in the business field basic model, and taking parameters corresponding to unfrozen adapter layers in the business field basic model as model training parameters; performing forward propagation calculation on the business field basic model by using the historical electronic bill data to obtain probability distribution of each bill in the historical electronic bill data aiming at different bill categories, and selecting the category with the highest probability distribution as a category label of each bill; calculating a loss value between the category label and the real category label of each bill through a preset loss function, and updating the model training parameters by using a preset low-rank adaptation algorithm; And when the loss value is smaller than a preset convergence threshold, updating the adapter layer through the updated model training parameters, and integrating the updated adapter layer into the business field basic model to obtain an enterprise business model.
4. The method for recognizing documents based on large language model as claimed in claim 1, wherein optimizing model parameters in the enterprise business model according to the feature distance to obtain an anomaly recognition optimization model comprises: Calculating average similarity between the homogeneous samples based on the feature expression vectors of the homogeneous samples, and calculating average difference between the heterogeneous samples based on the feature expression vectors of the heterogeneous samples; When the average similarity is smaller than a preset similarity threshold and the average difference is larger than a preset difference threshold, generating a convergence signal of the characteristic distance; After the convergence signal is generated, calculating the gradient of the characteristic distance to the model parameter by using a preset back propagation algorithm, and updating the model parameter by using a preset optimizer for the gradient; And taking the enterprise business model corresponding to the updated model parameters as an anomaly identification optimization model.
5. The document identification method based on a large language model as claimed in claim 1, wherein the analyzing the semantic anomaly mode and the coding anomaly mode between text fields in a preset target business document through the anomaly identification optimization model includes: Extracting a plurality of text field contents in the target business document, and segmenting the text field contents by using a word segmentation device of the abnormal recognition optimization model to obtain a plurality of target word segmentation sequences; calculating a logical association degree between the plurality of target word sequences based on vector representations of the plurality of target word sequences in a semantic space of the anomaly recognition optimization model; Judging the target word segmentation sequence with the logic association degree lower than a preset association threshold value as a semantic anomaly mode; And extracting a coding field containing a preset coding format rule from the target business document, analyzing the coding field according to coding structure knowledge learned by the anomaly identification optimization model to obtain a conflict result between the constituent elements of the coding field and field information of the target business document, and judging the coding field with the conflict result as a coding anomaly mode.
6. The method for identifying documents based on a large language model as claimed in claim 1, wherein the step of performing node feature propagation and edge weight learning on the graph structure by using a preset graph neural network to obtain a node anomaly mode and an edge anomaly mode comprises the following steps: iteratively aggregating the characteristic information of the edges associated with the nodes and the characteristic information of the neighbor nodes of the nodes according to each node in the graph structure through a multi-layer information transfer mechanism of the graph neural network, and synchronously updating the edge weights corresponding to the associated edges; After the information transfer process is finished, acquiring a final state vector of each node in the graph structure, calculating a characteristic clustering center of the node based on the final state vectors of all the nodes, and comparing the final state vector of each node with the characteristic clustering center to obtain the deviation degree of each node; identifying a node of which the deviation exceeds a preset node deviation threshold value, and judging the state of the node as a node abnormal mode; after the weight learning process is completed, acquiring a final weight value of each side in the graph structure, and comparing the final weight value of each side with a preset side weight threshold; and identifying the edge with the final weight value lower than the preset edge weight threshold value, and judging the state of the edge as an edge abnormal mode.
7. The method for recognizing documents based on a large language model as claimed in claim 1, wherein the performing bimodal weighted fusion on the text modal anomaly probability and the structural modal anomaly probability to obtain the target anomaly probability of the target business document comprises: Determining text modal weights and structural modal weights corresponding to the target business documents according to the business types of the target business documents; weighting the text modal weight and the text modal anomaly probability through a preset weighted voting mechanism to obtain first target data, and weighting the structural modal weight and the structural modal anomaly probability to obtain second target data; And linearly combining the first target data and the second target data to obtain the target abnormal probability of the target business document.
8. A document identification device based on a large language model, comprising: the domain term training module is used for collecting corpus data of the target business domain, and performing domain term training on a preset large language model by utilizing the corpus data to obtain a business domain basic model; The adapter layer training module is used for acquiring historical electronic bill data and historical bill anomaly type data of a target enterprise, and training an adapter layer in the business field basic model by utilizing the historical electronic bill data to obtain an enterprise business model; The model optimization module is used for converting positive and negative sample pairs in the historical bill anomaly type data into feature representation vectors, calculating feature distances of similar samples and heterogeneous samples in the positive and negative sample pairs according to the feature representation vectors, and optimizing model parameters in the enterprise business model according to the feature distances to obtain an anomaly identification optimization model; The text mode abnormal probability analysis module is used for analyzing semantic abnormal modes and coding abnormal modes among text fields in a preset target business document through the abnormal recognition optimization model, and outputting text mode abnormal probabilities of the target business document aiming at the semantic abnormal modes and the coding abnormal modes through an activation function in the abnormal recognition optimization model; The structural mode abnormal probability analysis module is used for constructing a graph structure of the target business bill according to the bill entity of the target business bill and the logic relationship between the bill entities, carrying out node characteristic propagation and edge weight learning on the graph structure by utilizing a preset graph neural network to obtain a node abnormal mode and an edge abnormal mode, and outputting structural mode abnormal probability of the target business bill aiming at the node abnormal mode and the edge abnormal mode through a classifier in the graph neural network; The risk type identification module is used for carrying out bimodal weighted fusion on the text modal abnormal probability and the structural modal abnormal probability to obtain the target abnormal probability of the target business bill, comparing the target abnormal probability with a preset risk probability threshold value and determining the risk type of the target financial bill based on a comparison result.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the large language model based document identification method of any one of claims 1 to 7 when the computer program is executed.
10. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the large language model-based document identification method according to any one of claims 1 to 7.

Description

Document identification method, device, equipment and medium based on large language model Technical Field The present invention relates to the field of large language models, and in particular, to a document identification method, device, equipment and medium based on a large language model. Background Along with the expansion of business scale of enterprises, the quantity of financial documents grows exponentially, and document types (such as invoices, reimbursement sheets and payment certificates) and abnormal scenes (such as false notes, repeated reimbursement and amount falsification) are diversified day by day, so that the documents of different categories are required to be identified, and the diversified identification requirements are met. The existing bill identification technology is mainly based on a rule engine or a simple machine learning model, a large number of judgment conditions are required to be preset manually by the rule engine, complex and changeable abnormal modes are difficult to cover, the simple machine learning model has weak resolving capability on unstructured bill information (such as handwriting notes and fuzzy seals), generalization is limited by sample distribution, abnormal identification and missed judgment of the bill are caused, and therefore the accuracy in bill identification is low. Disclosure of Invention The invention provides a bill identification method, device, equipment and medium based on a large language model, which are used for solving the technical problem of low accuracy in bill identification. In a first aspect, a document identification method based on a large language model is provided, including: collecting corpus data of a target business field, and training field terms of a preset large language model by utilizing the corpus data to obtain a business field basic model; Collecting historical electronic bill data and historical bill anomaly type data of a target enterprise, and training an adapter layer in the business field basic model by utilizing the historical electronic bill data to obtain an enterprise business model; Converting positive and negative sample pairs in the historical bill anomaly type data into feature expression vectors, calculating feature distances of similar samples and heterogeneous samples in the positive and negative sample pairs according to the feature expression vectors, and optimizing model parameters in the enterprise business model according to the feature distances to obtain an anomaly identification optimization model; analyzing semantic abnormal modes and coding abnormal modes among text fields in a preset target business document through the abnormal recognition optimization model, and outputting text mode abnormal probability of the target business document aiming at the semantic abnormal modes and the coding abnormal modes through an activation function in the abnormal recognition optimization model; Constructing a graph structure of the target business bill according to the bill entity of the target business bill and the logic relation between the bill entities, carrying out node characteristic propagation and edge weight learning on the graph structure by utilizing a preset graph neural network to obtain a node abnormal mode and an edge abnormal mode, and outputting structural mode abnormal probability of the target business bill aiming at the node abnormal mode and the edge abnormal mode through a classifier in the graph neural network; and carrying out bimodal weighted fusion on the text modal abnormal probability and the structural modal abnormal probability to obtain the target abnormal probability of the target business bill, comparing the target abnormal probability with a preset risk probability threshold, and determining the risk type of the target financial bill based on a comparison result. In a second aspect, there is provided a document identification apparatus based on a large language model, including: the domain term training module is used for collecting corpus data of the target business domain, and performing domain term training on a preset large language model by utilizing the corpus data to obtain a business domain basic model; The adapter layer training module is used for acquiring historical electronic bill data and historical bill anomaly type data of a target enterprise, and training an adapter layer in the business field basic model by utilizing the historical electronic bill data to obtain an enterprise business model; The model optimization module is used for converting positive and negative sample pairs in the historical bill anomaly type data into feature representation vectors, calculating feature distances of similar samples and heterogeneous samples in the positive and negative sample pairs according to the feature representation vectors, and optimizing model parameters in the enterprise business model according to the feature distances to obtain an anomaly identification optimi