CN-121980271-A - Training data set generation method and system based on multi-mode data fusion

CN121980271ACN 121980271 ACN121980271 ACN 121980271ACN-121980271-A

Abstract

The invention belongs to the technical field of data processing, and discloses a training data set generation method and system based on multi-mode data fusion, wherein the method comprises the steps of carrying out multi-mode preprocessing on multi-mode original data to obtain a cross-mode standardized data set and a multi-dimensional basic feature set; generating a cross-modal fusion feature vector based on a multi-dimensional basic feature set, combining a preset text risk examination rule base to determine actual risk points and associated data of a contract, determining risk labels of the actual risk points according to preset classification standards, binding the risk labels with the associated data through a data unique identifier ID to form a preliminary training data set, and optimizing the preliminary training data set to finally obtain a training data set for text risk examination. The training data set generated by the method has the advantages of integrity, accuracy, balance and timeliness, and can provide high-quality training samples for the text legal compliance risk inspection model.

Inventors

WANG WEIHUA
SONG QINGMIN
ZHANG QIWEN

Assignees

杭州义信数字发展有限公司

Dates

Publication Date: 20260505
Application Date: 20260206

Claims (10)

1. The training data set generation method based on multi-mode data fusion is characterized by comprising the following steps of: S1, acquiring multi-mode original data, carrying out modal splitting pretreatment on the multi-mode original data to obtain a cross-mode standardized data set and a multi-dimensional basic feature set, wherein the multi-mode original data comprises contract text modal data, legal provision modal data, standard specification modal data and contract associated behavior modal data; S2, generating a cross-modal fusion feature vector based on the multi-dimensional basic feature set, and determining actual risk points and associated data of the contract by combining a preset text risk examination rule base; S3, determining risk labels of the actual risk points according to preset classification standards based on the associated data corresponding to the actual risk points, and binding the risk labels with the associated data through a data unique Identifier (ID) to form a preliminary training data set; and S4, optimizing the preliminary training data set to finally obtain the training data set for text risk examination.
2. The method for generating a training data set based on multi-modal data fusion according to claim 1, wherein the performing the multi-modal preprocessing on the multi-modal raw data to obtain a cross-modal standardized data set and contents of a multi-dimensional basic feature set includes: carrying out structural conversion and garbage removal on the contract text modal data to obtain a structural tag text and text semantic features, wherein the structural tag text comprises a clause ID, text content and an associated entity corresponding to a text clause unit; respectively carrying out structural analysis on the legal provision modal data and the standard specification modal data, and constructing an association relation to obtain a legal-standard knowledge graph, legal association features and standard association features, wherein the legal association features and the standard association features form legal-standard association features; The contract association behavior modal data comprises at least one of contract history performance data, contract signing subject credit and litigation data and interaction behavior log data of a user on a contract examination platform; And allocating unique data unique identifiers for each contract in advance, and sorting structured label text of contract text modal data, legal provision modal data, legal-standard knowledge graph of standard specification modal data and quantized structured data of contract association behavior modal data to obtain a cross-modal standardized dataset.
3. The method for generating a training data set based on multimodal data fusion according to claim 2, wherein the step of obtaining the structured tag text and text semantic features comprises: The method comprises the steps of carrying out cleaning and standardization processing on contract texts in contract text modal data, identifying and marking entities in the contract texts by using a preset entity identification model, dividing continuous texts into independent text clause units according to inherent chapter titles or natural paragraphs of the contract texts by combining a deep learning paragraph segmentation model, and distributing clause IDs; extracting sentence vectors of text clause units by using pre-trained legal and standard field BERT models as basic semantic features, calculating TF-IDF similarity vectors of the text clause units, legal provision modal data and standard specification modal data according to each text clause unit, splicing the two vectors to be used as law-standard relevance features, and combining the basic semantic features and the law-standard relevance features to form the text semantic features of the clause.
4. The training data set generation method based on multi-modal data fusion according to claim 3, wherein the obtaining step of the law-standard knowledge-graph includes: The method comprises the steps of respectively analyzing legal provision and standard specification into a name, a code, a hierarchical path and a structural form of core content by utilizing a hierarchical structure inherent to the legal provision and standard specification through a rule template and a sequence labeling model, extracting legal requirements and standard requirements from the core content to form legal/standard associated data, and forming the structural provision/standard based on the structural form and the legal/standard associated data; And storing the nodes and the relation edges by adopting a graph database to obtain a legal knowledge graph sub-graph and a standard knowledge graph sub-graph, and integrating the legal knowledge graph sub-graph and the standard knowledge graph sub-graph to form a legal-standard knowledge graph.
5. The method for generating a training data set based on multi-modal data fusion according to claim 4, wherein the step of obtaining the quantized structured data and behavior rules and quantization features includes: Analyzing the interactive behavior log into a standard event stream, analyzing the signing subject credit and litigation data comprising credit reports and historical litigation records, mapping the ratings in the credit reports into numerical scores to obtain credit scores, quantifying the historical litigation records into litigation indexes, quantifying the historical litigation data of the contract comprising payment progress, violation records and acceptance results, quantifying the payment into percentages, quantifying the violation records into historical violation times and average violation amounts, quantifying the acceptance results into binary values or scores, and quantifying the historical litigation data of the contract to form the performance indexes; The method comprises the steps of counting the total times of marked risks, the number of modified clauses and the problem number of user consultation AI (advanced technology) of a contract to form a behavior statistical feature, encoding a standard event stream by using a cyclic neural network to obtain a behavior sequence vector representing user examination habits and attention points, namely a behavior sequence feature, normalizing the credit score, litigation index and performance index to splice the credit score, the litigation index and the performance index into a main credit and performance feature vector, namely an external data feature, and splicing the behavior statistical feature, the behavior sequence feature and the external data feature to form a behavior rule and a quantization feature corresponding to the contract.
6. The method for generating a training data set based on multi-modal data fusion according to claim 5, wherein the generating the content of the cross-modal fusion feature vector based on the multi-dimensional basic feature set includes: Carrying out standardization and dimension alignment treatment on the text semantic features, legal-standard association features and behavior rules and quantization features; respectively calculating initial weights of text semantic features, law-standard association features, behavior rules and quantization features based on the law-standard knowledge graph; Constructing a multi-head attention layer, wherein each head corresponds to a type of risk subtask, integrating analysis results of the attention heads through a softMax function, generating a dynamic weight matrix, dynamically calibrating initial weights of text semantic features, law-standard association features, behavior rules and quantization features in real time, and obtaining final weights; based on the final weight, carrying out weighted summation on the text semantic features, the law-standard association features and the behavior rules and the quantized features to obtain initial fusion features; the information integration is carried out on the initial fusion characteristics through the 2 full-connection layers, and redundant information is removed; and mapping the integrated features into cross-modal fusion feature vectors through a projection layer.
7. The method for generating a training data set based on multi-modal data fusion according to claim 6, wherein the calculating contents of initial weights of text semantic features, law-standard association features, and behavior rules and quantization features based on the law-standard knowledge-graph respectively includes: for the initial weight of the text semantic feature, combining the TF-IDF similarity vector, extracting the maximum value in the TF-IDF similarity vector, and determining the initial weight of the text semantic feature according to the maximum value; For the initial weight of the legal-standard association feature, distributing the initial weight according to the network centrality index; And for the initial weights of the behavior rules and the quantized features, distributing the initial weights of the behavior rules and the quantized features based on the credit and performance feature vectors of the main body.
8. The method for generating training data set based on multi-modal data fusion according to claim 3, wherein the determining the contents of the actual risk points and associated data of the contract in combination with a preset text risk review rule base includes: Combining the cross-modal fusion feature vector and the text risk examination rule base by taking the text clause unit as a unit, analyzing clauses one by one, and positioning potential risk points; for each potential risk point in the potential risk point list, finding a root from legal, standard, characteristic and behavioral dimensions, and integrating to form an attribution report; removing false risk points caused by model misjudgment and rule adaptation errors of the potential risk points through multi-layer verification to obtain actual risk points; And recording and storing the associated data corresponding to the actual risk points.
9. The method for generating a training data set based on multi-modal data fusion according to claim 8, wherein the positioning the content of the potential risk points by terms analysis in units of text term units in combination with cross-modal fusion feature vectors and text risk review rule base comprises: The method comprises the steps of inputting cross-modal fusion feature vectors corresponding to each clause one by one into a preset risk prediction model, outputting risk probability and predicted risk type of the clause by the model to form model prediction results of each clause, marking potential risk points, calling a text risk examination rule base to check text content and associated features of each clause rule by rule, marking the potential risk points to obtain rule check results, combining the model prediction results and the rule check results, eliminating repeated marks, distributing unique risk identifiers for each potential risk point, and integrating to form a potential risk point list.
10. A training data set generation system based on multimodal data fusion, the system comprising: the system comprises a data acquisition module, a multi-mode data processing module and a multi-dimensional data processing module, wherein the data acquisition module is used for acquiring multi-mode original data, carrying out modal preprocessing on the multi-mode original data to obtain a cross-mode standardized data set and a multi-dimensional basic feature set, wherein the multi-mode original data comprises contract text modal data, legal provision modal data, standard specification modal data and contract association behavior modal data; The risk point determining module is used for generating a cross-mode fusion feature vector based on the multi-dimensional basic feature set and determining actual risk points and associated data of the contract by combining a preset text risk examination rule base; The training set generation module is used for determining risk labels of the actual risk points according to preset classification standards based on the associated data corresponding to the actual risk points, and binding the risk labels with the associated data through a data unique Identification (ID) to form a preliminary training data set; And the training set optimization module is used for performing optimization processing on the preliminary training data set to finally obtain the training data set for text risk examination.

Description

Training data set generation method and system based on multi-mode data fusion Technical Field The invention relates to the technical field of data processing, in particular to a training data set generation method and system based on multi-mode data fusion. Background With the deep application of artificial intelligence technology in the industry field, the requirement of a specific vertical scene (such as intelligent law, legal compliance censoring, knowledge question answering) on a high-quality and specialized training data set is increasingly urgent. The construction of the traditional training data set mainly depends on manual collection and labeling, and has the problems of high cost, low efficiency, limited scale and difficulty in ensuring consistency. In the prior art, although there is a method for automatic data enhancement by a single data source (such as a plain text or a pure database), complex association of multi-source heterogeneous data in the real world cannot be handled, and the generated data set has insufficient diversity and low fit with a downstream business scene. Especially for a text legal compliance risk inspection scene, the existing training data set generation method cannot give consideration to multi-mode data association and risk guidance, and is difficult to support accurate risk inspection model training. Therefore, a training data set generating method and system capable of integrating multi-source multi-modal data, enhancing business scenario association and risk guidance, and improving data structuring and suitability is needed. Disclosure of Invention In order to overcome the defects in the prior art and achieve the purposes, the invention provides the following technical scheme: a training data set generation method based on multi-modal data fusion, the method comprising: S1, acquiring multi-mode original data, carrying out modal splitting pretreatment on the multi-mode original data to obtain a cross-mode standardized data set and a multi-dimensional basic feature set, wherein the multi-mode original data comprises contract text modal data, legal provision modal data, standard specification modal data and contract associated behavior modal data; S2, generating a cross-modal fusion feature vector based on the multi-dimensional basic feature set, and determining actual risk points and associated data of the contract by combining a preset text risk examination rule base; S3, determining risk labels of the actual risk points according to preset classification standards based on the associated data corresponding to the actual risk points, and binding the risk labels with the associated data through a data unique Identifier (ID) to form a preliminary training data set; and S4, optimizing the preliminary training data set to finally obtain the training data set for text risk examination. Further, performing the multi-mode preprocessing on the multi-mode original data to obtain a cross-mode standardized data set and contents of a multi-dimensional basic feature set includes: carrying out structural conversion and garbage removal on the contract text modal data to obtain a structural tag text and text semantic features, wherein the structural tag text comprises a clause ID, text content and an associated entity corresponding to a text clause unit; respectively carrying out structural analysis on the legal provision modal data and the standard specification modal data, and constructing an association relation to obtain a legal-standard knowledge graph, legal association features and standard association features, wherein the legal association features and the standard association features form legal-standard association features; The contract association behavior modal data comprises at least one of contract history performance data, contract signing subject credit and litigation data and interaction behavior log data of a user on a contract examination platform; And allocating unique data unique identifiers for each contract in advance, and sorting structured label text of contract text modal data, legal provision modal data, legal-standard knowledge graph of standard specification modal data and quantized structured data of contract association behavior modal data to obtain a cross-modal standardized dataset. Further, the step of obtaining the structured tag text and the text semantic features includes: The method comprises the steps of carrying out cleaning and standardization processing on contract texts in contract text modal data, identifying and marking entities in the contract texts by using a preset entity identification model, dividing continuous texts into independent text clause units according to inherent chapter titles or natural paragraphs of the contract texts by combining a deep learning paragraph segmentation model, and distributing clause IDs; extracting sentence vectors of text clause units by using pre-trained legal and standard field BERT models as basic