CN-122019799-A - AI-based data retrieval system and method

CN122019799ACN 122019799 ACN122019799 ACN 122019799ACN-122019799-A

Abstract

The invention discloses an AI-based data retrieval method, and belongs to the technical field of big data information. The method comprises the following steps of S1, preprocessing historical data, namely constructing a training data set by taking a historical case data set containing case related personnel information, event details and case classification labels, a historical rule matching result set containing rule matching records corresponding to historical cases, a historical structured report set containing generated public weight reports, report filling content and manual quality assessment records as the basis, removing redundancy through regularization and filtering invalid records, and constructing a training data set by combining manual review and data standardization labeling multidimensional quality labels, S2, extracting depth features, namely analyzing and extracting basic features of historical big data, calculating latitude-crossing associated features and utilizing a big model to mine semantic depth features. The AI accurate search matching link can screen the regulations in the effective interval through MySQL time index based on the case occurrence time, and automatically exclude the regulations which are invalid or not effective.

Inventors

WU EN
LIU YONGHUA
FANG HAIJUN
HU JIAN
GU XIN
FENG TINGTING
Duan Shaopu
LV XIAOPING

Assignees

浙江兰贝斯信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20251104

Claims (9)

1. The AI-based data retrieval method is characterized by comprising the following steps: S1, preprocessing historical data, namely constructing a training data set by taking a historical case data set containing case related personnel information, event details and case classification labels, a historical rule matching result set containing rule matching records corresponding to the historical cases and a historical structured report set containing generated public weight reports, report filling content and manual quality assessment records as the basis, removing redundant and filtering invalid records through regularization, and marking multidimensional quality labels by combining manual review and data standardization; S2, extracting depth features, namely analyzing basic features of the extracted historical big data, calculating latitude-crossing associated features, and mining semantic depth features by utilizing a big model; S3, constructing a report evaluation model, namely constructing XGBoost and BERT double-module fusion model by taking preprocessing historical data, basic characteristics, associated characteristics and depth characteristics as inputs, optimizing super-parameters and evaluating various indexes of the model; s4, constructing a structured regulation knowledge base, namely adopting regular and manual cleaning regulation data, disassembling the regulation data into an atomization regulation item according to the 'x th' rule, and storing 'illegal characteristic-regulation-effective time' associated data into a MySQL database after analyzing illegal characteristics by a large model; s5, acquiring case data and constructing a retrieval instruction, namely acquiring the case data through an API, screening actual regulations corresponding to case time, and filling a dynamic prompt word model to generate the retrieval instruction; s6, AI-driven accurate retrieval and matching, namely analyzing case time and illegal behaviors by a large model, and semantically matching the most suitable rule items in the aging rule; S7, generating a structured report, namely calling Jinja a report template, generating a report by fusing cases and regulations data filling through a docxtpl engine, extracting basic features, associated features and depth features of the generated report, and forming a report feature vector: and S8, reporting an evaluation model evaluation report, namely triggering the report evaluation model to input a report feature vector and output a quality score, giving out a problem suggestion, and using the precipitated data for model iteration.
2. The AI-based data retrieval method as recited in claim 1, wherein the step S2 comprises the steps of: S21, extracting basic features from historical cases, rule matching results and report data, wherein the basic features comprise historical features, namely case types, personnel involved levels, time duration and the number of illegal action keywords, historical condition matching, namely matching files of rules, rule effective duration and the number of illegal feature sub-actions, and historical reports, namely report template types, the number of filling fields, field deletion rate and report generation duration; s22, extracting association features among cases, regulations and reports based on basic features, wherein the association features comprise case-regulation time matching degree, case-regulation illegal feature matching degree, report-template field fitting degree and history matching stability features; S23, deep semantic mining is carried out on unstructured text in historical data by adopting an open source large language model based on basic features and associated features, and high-order semantic features are extracted, wherein the high-order semantic features comprise illegal behavior semantic granularity features, report content logical consistency features, rule applicable semantic compliance features and cross-case semantic migration features.
3. The AI-based data retrieval method as set forth in claim 1, wherein in the step S3, the preprocessed historical data samples are spliced with corresponding basic features, associated features and depth features to form a feature matrix, a comprehensive quality label is used as a core label, matching accuracy, content integrity and compliance labels are used as auxiliary labels, training sets and verification sets are divided according to 8:2, each feature is adapted by adopting XGBoost and BERT double-module architecture, mean square errors of model predictive score and manual labeling labels are used as training targets, and training is stopped until the accuracy of the comprehensive quality score, pearson correlation coefficient of each dimension score and manual label and matching accuracy predictive accuracy reach standards.
4. The AI-based data retrieval method as set forth in claim 1, wherein in step S4, matching and deleting non-core treaty content through a custom regular expression, providing an online editing interface for a file cleaned by a system, manually correcting missing redundant or erroneously deleted content, starting with a "x-th" rule based on the rule, using a "x-th" rule expression as a cutting mark, precisely disassembling the cleaned treaty file into independent and atomized treaty items, and performing semantic parsing on each treaty item by using a custom prompt word engineering driving big language model, wherein a system role is set as a legal specialist familiar with laws and regulations, and professional positioning of the big model is defined; the method comprises the steps of setting a user requirement to extract a plurality of illegal behaviors in regulations, simplifying the illegal behaviors into a section of text, outputting the text according to a format of 'number of the strips', keeping all core information, outputting a result of a large model to be the illegal feature corresponding to the regulation item, storing the related data of the illegal feature and the regulation item into a MySQL database, wherein a core field comprises f_index, illegal feature, f_content, regulation item original text, f_title, regulation file name, f_number, regulation item number, start_time/end_time, regulation effective time interval and taking the regulation item number and regulation file name as unique keys.
5. The AI-based data retrieval method as set forth in claim 1, wherein in the step S5, an inspection instruction of the adaptation large model is dynamically generated based on case data, all conditions of case time E [ start_time, end_time ] in a knowledge base are screened, candidate conditions are filled into lawer variables of a prompt word template in a JSON format, the prompt word template comprises a system role 1, a system role 2, user content, the system role 1 is used as a legal rule expert, analyzes cases and outputs according to [ name ]: xxx [ serial number ]: xxx: illegal act ]: xxx format, the system role 2 is a candidate condition library: { JSON. Dummy (lawer, ensure _time=false) }, all fields are required to be reserved, a data structure is not modified, text content is not omitted, the user content is the existing case { case data }, and the best matching conditions are judged in combination with the candidate condition library.
6. The AI-based data retrieval method as set forth in claim 1, wherein in the step S6, the large model extracts accurate event occurrence time from the case data through semantic reasoning, takes the case time as a constraint, rapidly screens out valid candidate regulations through a time index in a MySQL knowledge base, the large model deeply analyzes illegal behaviors in the case and illegal features of the candidate regulations based on dynamic prompt words, calculates matching degree through semantic reasoning, and finally outputs unique and best-matched regulations.
7. The AI-based data retrieval method as set forth in claim 1, wherein the step S7 is characterized in that a corresponding structured template is called from a template library according to service requirements, jinja template grammar is supported, including dynamic content style inheritance, header and footer variable support, logic control statement, multi-source data fusion filling, creating docx format template file, constructing a data model in a Python program, mapping template variable and actual data, loading a template and rendering data, generating a normalized public weight report, automatically extracting basic feature, associated feature and depth feature of the report, forming a report feature vector, and directly inputting the report feature vector into a report evaluation model.
8. The AI-based data retrieval method as set forth in claim 1, wherein in the step S8, the evaluation trigger mechanism is divided into automatic trigger and manual trigger, after the report generation is completed, the system automatically transmits 'report data + report feature vector' to the evaluation model through an API without manual operation, manual trigger, support staff initiates reevaluation to the historical report, the evaluation model outputs a structured evaluation result including a comprehensive score, a problem positioning and improvement suggestion and a history contrast reference based on the input report feature vector, the score output comprises matching accuracy, content integrity, semantic compliance, logic coherence each item score and a comprehensive quality score, the problem positioning and improvement suggestion comprises automatically labeling a high risk report if a certain item score is smaller than a item score preset value, evaluating the feature analysis problem cause and giving a suggestion if the comprehensive quality score is lower than a comprehensive quality score preset value, prompting to carry out report generation again, directly enters a subsequent service link when the comprehensive quality score is not lower than a first threshold, automatically carries out the report with a second item as a training score when the comprehensive quality score is lower than a second item and is lower than a first threshold, and carries out the second item and correction suggestion according to the second item and the history comparison reference, and the evaluation suggestion is stored in the evaluation model as a second item after the first item score is lower than a threshold.
9. The AI-based data retrieval system is characterized by being used for realizing the AI-based data retrieval method according to any one of claims 1-8, and comprises a historical data preprocessing module, a depth feature extraction module, a report evaluation model construction module, a structured rule knowledge base module, a case data acquisition and retrieval instruction construction module, an AI accurate retrieval matching module, a structured report generation module and a report evaluation and data precipitation module; The historical data preprocessing module is used for constructing a training data set by combining manual review and data standardization marking multidimensional quality labels on the basis of a historical case data set containing case related personnel information, event details and case classification labels, a historical rule matching result set containing rule matching records corresponding to the historical cases and a historical structured report set containing generated public weight reports, report filling content and manual quality assessment records through regular elimination of redundant and invalid records; The depth feature extraction module is used for analyzing and extracting basic features of historical big data, calculating latitude-longitude associated features and utilizing a big model to mine semantic depth features; the report evaluation model construction module is used for constructing XGBoost and BERT double-module fusion models by taking preprocessing historical data, basic characteristics, associated characteristics and depth characteristics as inputs, optimizing super-parameters and evaluating various indexes of the models; the structured regulation knowledge base module is used for adopting regular and manual cleaning regulation data, disassembling the regulation data into an atomization regulation item according to the 'x th' rule, and storing the 'regulation feature-regulation-effective time' associated data into the MySQL database after the large model analyzes the illegal feature; The case data acquisition and retrieval instruction construction module is used for acquiring case data through an API, screening effective regulations corresponding to case time, and filling a dynamic prompt word model to generate a retrieval instruction; the AI accurate retrieval matching module is used for analyzing the case time and the illegal behaviors by the large model, and semantically matching the most suitable rule items in the aging rule; The structured report generating module is used for calling Jinja a report template, generating a report by fusing cases and regulations data filling through a docxtpl engine, extracting basic features, associated features and depth features of the generated report, and forming a report feature vector; and the report evaluation and data precipitation module is used for triggering the report evaluation model to input the report feature vector and output the quality score, giving out a problem suggestion and precipitating data for model iteration.

Description

AI-based data retrieval system and method Technical Field The invention relates to the technical field of big data information, in particular to an AI-based data retrieval system and method. Background The most widely used large model search on the market is currently RAG (text enhanced search). RAG is applied to public rights system, cutting the regulation file, vectorizing the fragments and storing the fragments in vectorizing database. Cases are vectorized during retrieval, and the rules which are most matched with the cases are found from a vector database by using cosine similarity, euclidean distance and other algorithms. However, the method has the advantages that the cutting of the regulation file is unclear, a plurality of regulations exist in one section, and a complete regulation can be cut into two sections. The matching degree accuracy of the vector is not high, and the requirement of a public weight system is not met. Disclosure of Invention The invention aims to provide an AI-based data retrieval system and method, which can solve the problems of unclear cutting of an regulations file and low accuracy of matching degree of vectors. According to one aspect of the invention, the technical scheme is that the data retrieval method based on AI specifically comprises the following steps: S1, preprocessing historical data, namely constructing a training data set by taking a historical case data set containing case related personnel information, event details and case classification labels, a historical rule matching result set containing rule matching records corresponding to the historical cases and a historical structured report set containing generated public weight reports, report filling content and manual quality assessment records as the basis, removing redundant and filtering invalid records through regularization, and marking multidimensional quality labels by combining manual review and data standardization; S2, extracting depth features, namely analyzing basic features of the extracted historical big data, calculating latitude-crossing associated features, and mining semantic depth features by utilizing a big model; S3, constructing a report evaluation model, namely constructing XGBoost and BERT double-module fusion model by taking preprocessing historical data, basic characteristics, associated characteristics and depth characteristics as inputs, optimizing super-parameters and evaluating various indexes of the model; s4, constructing a structured regulation knowledge base, namely adopting regular and manual cleaning regulation data, disassembling the regulation data into an atomization regulation item according to the 'x th' rule, and storing 'illegal characteristic-regulation-effective time' associated data into a MySQL database after analyzing illegal characteristics by a large model; s5, acquiring case data and constructing a retrieval instruction, namely acquiring the case data through an API, screening actual regulations corresponding to case time, and filling a dynamic prompt word model to generate the retrieval instruction; s6, AI-driven accurate retrieval and matching, namely analyzing case time and illegal behaviors by a large model, and semantically matching the most suitable rule items in the aging rule; S7, generating a structured report, namely calling Jinja a report template, generating a report by fusing cases and regulations data filling through a docxtpl engine, extracting basic features, associated features and depth features of the generated report, and forming a report feature vector: and S8, reporting an evaluation model evaluation report, namely triggering the report evaluation model to input a report feature vector and output a quality score, giving out a problem suggestion, and using the precipitated data for model iteration. Further, the step S2 specifically includes the following steps: S21, extracting basic features from historical cases, rule matching results and report data, wherein the basic features comprise historical features, namely case types, personnel involved levels, time duration and the number of illegal action keywords, historical condition matching, namely matching files of rules, rule effective duration and the number of illegal feature sub-actions, and historical reports, namely report template types, the number of filling fields, field deletion rate and report generation duration; s22, extracting association features among cases, regulations and reports based on basic features, wherein the association features comprise case-regulation time matching degree, case-regulation illegal feature matching degree, report-template field fitting degree and history matching stability features; S23, deep semantic mining is carried out on unstructured text in historical data by adopting an open source large language model based on basic features and associated features, and high-order semantic features are extracted, wherein the high-order semantic features