CN-121981191-A - Fine tuning method for contract examination large language model

CN121981191ACN 121981191 ACN121981191 ACN 121981191ACN-121981191-A

Abstract

The invention provides a fine tuning method of a contract examination large language model, which belongs to the technical field of artificial intelligence and comprises the steps of S1, fusing multisource laws and regulations, arbitrating judge factors and industry contract templates, generating a first training data set for supervision fine tuning, S2, training an initial model by the data set, adjusting a chain fine tuning strategy decoupled from a task stage through self-adaptive parameters driven by task complexity, obtaining a supervision fine tuning model, S3, constructing a second training data set containing positive and negative example sample pairs based on expert knowledge, S4, training a reward model for compliance scoring based on the second training data set, S5, using the supervision fine tuning model as a strategy model, performing reinforcement learning fine tuning by using a strategy optimization algorithm with KL divergence constraint by utilizing signals provided by the reward model, and combining clause-level risk feedback and a high risk clause penalty mechanism, and outputting the optimized contract examination model.

Inventors

DUAN GUOHUA
WU LIN
LIU JIANYU
Hu Renbing
WANG SIYU
JIANG TAO
FENG WEI

Assignees

武汉数众科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251230

Claims (10)

1. A method for fine tuning a contract review large language model, comprising the steps of, in order: s1, fusing multisource laws and regulations, arbitration referee elements and industry contract templates, and generating a first training data set for supervision and fine adjustment through a structural labeling and semantic nesting mechanism; S2, training an initial large language model by using the first training data set, and obtaining a supervision fine-tuning model by a task complexity-driven self-adaptive parameter adjustment and task stage decoupling chain fine-tuning strategy, wherein the chain fine-tuning strategy sequentially comprises four special training stages including contract structure understanding, compliance clause matching, risk point identification and examination suggestion generation; S3, based on expert knowledge, acquiring and labeling a sample pair comprising a positive example contract segment and a negative example contract segment, wherein risk points are labeled for the negative example contract segment, and a compliance adjustment example is provided, so that a second training data set comprising legal basis and compliance score is formed; S4, training a reward model based on the second training data set, wherein the reward model is configured to carry out compliance scoring on the input contract text fragments and the candidate revision texts thereof, and a loss function based on the pairing sample priority ranking is adopted for training the reward model; s5, taking the supervision fine tuning model as a strategy model to be optimized, providing a reward signal by using the reward model, performing reinforcement learning fine tuning on the strategy model by using a strategy optimization algorithm with KL divergence constraint, introducing clause level risk feedback and high risk clause penalty mechanism in the fine tuning process, and finally outputting an optimized contract examination model.
2. The method according to claim 1, wherein the step S1 comprises: Respectively carrying out structural processing and knowledge fusion on multisource legal regulations, arbitration referee elements and industry contract templates to generate the first training data set, wherein: Formatting legal texts into a tree structure, wherein the treaty is taken as a node, carrying out fine granularity analysis on the treaty based on the syntactic structure and role extraction, and simultaneously introducing a three-layer nested semantic tag system to mark the treaty, wherein the three-layer tag comprises a first-layer tag for marking a contract life cycle stage, a second-layer tag for marking a contract application scene and a third-layer tag for marking contract role responsibility; Extracting the dispute case elements from the judge document to construct an arbitration element map, and carrying out clustering analysis on the disputed points through a clustering algorithm based on the arbitration element map to form a dispute focus network, wherein each dispute point is given a weight label; annotating the industry contract template via the three-layer nested semantic tag system; the first training data set is generated by fusing the annotated legal regulations, the cluster-weighted dispute focus network, and the annotated industry contract template.
3. The method according to claim 1, wherein in step S2, the learning rate of the model training is dynamically controlled by task complexity driven adaptive parameter adjustment, specifically comprising: calculating a task complexity factor based on contract type, industry domain and structural complexity of the samples in the current training batch ; According to the task complexity coefficient Dynamically calculating an adaptive learning rate for use in the batch training ; Wherein the task complexity coefficient Is the clause structure density Degree of semantic variation And industry span index Is calculated by the following formula: ; Wherein, alpha 1, alpha 2 and alpha 3 are preset weight coefficients; the adaptive learning rate The calculation formula of (2) is as follows: ; Wherein, the Is a preset basic learning rate, and beta is a super-parameter for controlling the sensitivity of the learning rate to task complexity variation.
4. The method according to claim 1, wherein in the step S2, training is performed by a task-stage decoupling chain-type fine-tuning strategy, the strategy decoupling a contract-checking task into four special stages sequentially executed by structure understanding, compliance matching, risk recognition and suggestion, and chain-type fine-tuning is performed by using a dedicated model structure or a loss function for each stage, specifically including: in the structure understanding stage, the structure hierarchy identifying contract terms is extracted by syntactic analysis and roles, and an embedded representation is calculated for each term unit The calculation formula is as follows: ; Wherein the method comprises the steps of For the text of the clause, Numbering the structure level where the structure is located; In the compliance matching stage, adopting a double-tower model structure to respectively encode contract clauses and corresponding legal provision, and minimizing the semantic cosine distance loss Training, wherein the loss function is as follows: ; during the risk identification phase, based on triplet samples Training, wherein the triplets comprise risk fragments and risk types And reason explanation Loss function at this stage For the weighted sum of the classification loss and the generation loss: ; Wherein the method comprises the steps of The loss is classified for the type of risk, To explain the generation loss of sentences, lambda is a super parameter; And in the suggestion generation stage, generating a compliant contract clause revised suggestion in a sequence-to-sequence generation mode based on the output of the risk identification stage, and integrating the revised suggestion into an industry corpus template to standardize the format and style of the generated text.
5. The method according to claim 1, wherein in the step S3, constructing the second training data set includes: Selecting a compliant contract segment from the real contract text as a positive example sample, and annotating the legal provision of each positive example sample as a legal basis; Collecting contract fragments with compliance defects from contract examination cases as negative examples, and labeling each negative example sample with the existing risk types and corresponding legal consequences, wherein the risk types comprise ambiguous terms, illegal terms and term deletions; And generating sample pairs based on the positive example samples and the negative example samples, and giving compliance scores to each sample pair, wherein the scores of the positive example samples are higher than those of the negative example samples, so that structured data comprising contract fragments, legality basis, risk marks and compliance scores are formed and used as the second training data set.
6. The method according to claim 1, wherein the step S4 comprises: formatting the samples in the second training data set obtained in the step S3 into four-element groups Wherein As a fraction of the original contract, For the candidate output to be evaluated, A standard compliance output provided to an expert, Setting r= +1 for positive examples and r as a negative value according to the risk degree for negative examples for a reward value preset based on sample compliance; training reward models using formatted four-tuple samples Wherein the input of the rewards model is a contract segment And candidate output Is output as a scalar fraction representing the candidate output compliance ; While training the reward model, a preference pair is constructed based on the quadruple sample Wherein Is superior to And employs a loss function based on pairing ordering And (3) optimizing: ; Wherein the method comprises the steps of And Output candidates for reward models And Is a predictive score of (a).
7. The method according to claim 1, wherein in step S5, the policy model is subjected to a policy optimization algorithm with KL divergence constraints Reinforcement learning fine tuning is performed to update the parameters of the policy model by minimizing the following loss function LR: ; Wherein, the Representing a desire based on a current policy model Generating an output , Is to input contract fragments by a reward model And generating an output Given the compliance score of the subject matter, Is the current policy model With the initial policy model KL divergence between, magnitude for constraint policy update, β is a hyper-parameter for controlling KL divergence constraint intensity, wherein the initial policy model The model obtained after the fine tuning is supervised in step S2.
8. The method according to claim 1, wherein in step S5, the clause-level risk feedback mechanism is implemented by: Targeting the policy model to input contract segments Generated candidate output text Dividing into multiple clauses ; Using the reward model For each clause Independent compliance scoring to obtain clause scores ; According to each clause The importance weight of the contract clause type is preset ; Aggregating weighted scores of all clauses, calculating to obtain a total prize value for policy model update 。
9. The method according to claim 1, wherein said step S5 further comprises a high risk clause penalty and logical consistency verification mechanism, comprising in particular: predefining a set of high risk clause patterns including, but not limited to, indefinite authorization, disclaimer clause asymmetry, and fuzzy performance definitions; Introducing a fixed negative penalty term in calculating the total prize value when it is detected that any of the high risk clause patterns is included in the candidate output generated by the policy model; At the same time, a set of clause logical consistency rules are predefined, including that if the contract clauses contain conditional statements, they must correspond to explicit outcome descriptions; verifying the candidate output in real time based on the logic consistency rule by a rule engine; The result of the verification is converted into a binary signal, wherein positive values are verified, negative values are not verified, and this signal is added to the total prize value as an additional prize term, together forming a final prize signal for updating the policy model.
10. The method according to claim 1, further comprising a risk classification enhancement mechanism when performing the fine-tuning in step S5: Presetting risk levels for different types of contract risks based on the dispute focus network and the weight labels extracted from the arbitration judge elements; When the reward model carries out compliance scoring on clauses, the risk level is used as a weighting factor to be fused into scoring calculation, so that the compliance defect corresponding to the high risk level generates a stronger negative reward signal; And optimizing the strategy model according to the reward signal integrated with the risk level information so as to improve the recognition sensitivity of the model to the high-risk compliance problem.

Description

Fine tuning method for contract examination large language model Technical Field The invention relates to the technical field of artificial intelligence, in particular to a contract examination large language model fine tuning method based on reinforcement learning and legal knowledge fusion, which is particularly suitable for application scenes such as legal science and technology, intelligent contract examination and the like. Background With the development of artificial intelligence technology, large language models (Large Language Models, LLM) are increasingly being used in legal science and technology fields such as legal document generation, question-answering systems, intelligent examination and the like. Particularly, in the task of contract examination, the language model is utilized to automatically understand and identify the risks of contract terms, so that the method becomes an important means for improving legal service efficiency and reducing compliance cost. However, the existing large language model still faces a plurality of technical bottlenecks in practical application: on the one hand, models have insufficient understanding of legal text structures. Legal terms typically have complex logical structures and mandatory specifications that contain linguistic features such as nested conditions, exceptional cases, and multi-persona obligations. The general language model lacks accurate modeling on the structures, and is difficult to accurately identify contract level and responsibility division, and the semantic analysis effect of terms is affected. On the other hand, models lack specialized compliance knowledge in identifying risk terms. Many legal risks are not explicit violations in contract reviews, but rather manifest as potential compliance flaws such as "ambiguities in terms", "absence of terms", or "inequalities of obligations. The current model is often trained based on empirical samples, and low-frequency but high-risk compliance problems are difficult to generalize and identify. In addition, the problem that the language is smooth but the legal expression is not standard and the logic is not strict often exists when the model generates the proposal clause, and the reliability and the usability of the model in the examination link are affected. The existing part of research attempts to conduct model directional optimization by fine tuning model parameters and introducing legal question-answering equivalent modes, but the method still has the defects that (1) training corpus lacks multi-source legal structure knowledge support and cannot cover real examination points under different contract types and industry scenes, (2) the training process does not fuse feedback mechanisms of compliance risks, so that model output lacks sensitivity to legal consequences and standard logic, and (3) model optimization targets lack fine task decomposition and cannot respectively model and optimize examination links such as structural understanding, term compliance, risk positioning and suggestion generation. Therefore, a novel model fine tuning method for fusing structured legal knowledge, disputed judge elements and industry contract corpus is needed, and legal expertise is introduced into a model training process in an explicit label and risk scoring mode by combining a reinforcement learning mechanism, so that the interpretability, the professional and the practicability of the model in a contract examination task are obviously improved. Disclosure of Invention In view of the technical drawbacks and technical shortcomings existing in the prior art, embodiments of the present invention provide a fine tuning method for examining a large language model under contract, which overcomes or at least partially solves the above-mentioned problems, and the specific scheme is as follows; As a first aspect of the present invention, there is provided a method for fine tuning a contract review large language model, comprising the steps of, in order: s1, fusing multisource laws and regulations, arbitration referee elements and industry contract templates, and generating a first training data set for supervision and fine adjustment through a structural labeling and semantic nesting mechanism; S2, training an initial large language model by using the first training data set, and obtaining a supervision fine-tuning model by a task complexity-driven self-adaptive parameter adjustment and task stage decoupling chain fine-tuning strategy, wherein the chain fine-tuning strategy sequentially comprises four special training stages including contract structure understanding, compliance clause matching, risk point identification and examination suggestion generation; S3, based on expert knowledge, acquiring and labeling a sample pair comprising a positive example contract segment and a negative example contract segment, wherein risk points are labeled for the negative example contract segment, and a compliance adjustment