Search

CN-122021619-A - Water conservancy science and technology report automatic examination system and method based on intelligent body architecture

CN122021619ACN 122021619 ACN122021619 ACN 122021619ACN-122021619-A

Abstract

The invention discloses a water conservancy science and technology report automatic examination system and method based on an agent framework, and relates to the technical field of information processing in water conservancy industry. The method comprises the steps of (1) building a document feature portrait based on LLM document semantics, recombining fragmented text into an examination unit with independent context by utilizing cooperation of the LLM and a rule engine, solving the problem of semantic splitting caused by segmentation, (2) dynamically examining task planning, autonomously generating task topology comprising a calling component, a priority and a parallel strategy by the LLM based on semantic features of the examination unit, realizing on-demand dynamic arrangement of examination flow, and (3) performing examination by adopting a mixed driving examination execution strategy of field adaptation according to different tasks such as terms, calculation, contract consistency and the like and adopting a mixed driving mode of a special fine tuning model, a field knowledge base and a rigid rule. The invention effectively improves the analysis precision, the examination coverage and the examination conclusion specialty of the long file.

Inventors

  • ZHOU YIFAN
  • DUAN HAO
  • TAN XINGYAN
  • ZHAO HONGLI
  • WANG JIANHUA
  • LIU SHIDA
  • LI HAO
  • SUN GUIYU

Assignees

  • 中国水利水电科学研究院

Dates

Publication Date
20260512
Application Date
20260119

Claims (10)

  1. 1. The automatic water conservancy science and technology report examination method based on the intelligent body architecture is characterized by comprising the following steps of: Step 1, document analysis and semantic recombination based on a large language model LLM: The method comprises the steps of receiving a water conservancy science and technology report document and a matched contract document uploaded by a user, analyzing the document into a physical text block carrying positioning information by utilizing an OCR tool, extracting global attribute characteristics, content distribution characteristics and hierarchical topological characteristics of the physical text block, combining the global attribute characteristics, the content distribution characteristics and the hierarchical topological characteristics to generate a document characteristic portrait, guiding LLM to analyze the document characteristic portrait based on prompt word engineering, and combining a preset semantic recombination strategy and constraint verification of a rule engine to aggregate fragmented text blocks into an examination unit with independent context semantics; step 2, dynamically checking task planning: Dynamically generating a checking task topology, wherein the checking task topology defines the specific checking component type, the execution priority and the parallel processing strategy required to be called by each atomic checking unit; step 3, performing hybrid driving examination of field adaptation: And (3) according to the inspection task topology in the step (2), calling a corresponding inspection module to execute inspection, wherein the inspection module adopts a special fine-tuning model, a domain knowledge base and a rigid rule hybrid driving mechanism, and specifically comprises the following steps: Aiming at a general text examination task, a general large language model is called to be executed in combination with general rules; Aiming at specific professional examination tasks, including professional term examination, calculation examination and contract consistency examination, respectively loading a special parameter high-efficiency fine adjustment method LoRA fine adjustment weight trained by specific field data, and automatically mounting a corresponding field vector knowledge base; In the examination process, the related knowledge is recalled from a mounted knowledge base by utilizing a retrieval enhancement generation RAG technology, and an examination result containing problem positioning and modification suggestions is output by combining the reasoning capacity of a special LoRA fine tuning model and a preset rigid rule template; Step 4, enhancing professional knowledge: Constructing a multi-dimensional instruction fine tuning data set, performing training by adopting a parameter efficient fine tuning method, and generating LoRA fine tuning model weights respectively adapting to terms, calculation and contract consistency tasks so as to improve the logic reasoning and standard adaptation capability of the model under specific tasks; Step 5, generating and summarizing examination results: after the inspection of each module is completed, a structured result is generated for each inspection content, including inspection conclusion, problem description or judgment basis and document positioning information, and all inspection results are unified and arranged to generate an overall inspection quality report containing document overall quality evaluation, various problem classification summarization, modification suggestion and positioning traceability information.
  2. 2. The automatic water conservancy science and technology report examination method based on the intelligent agent architecture according to claim 1, wherein the document feature images in the step 1 comprise global attribute features including total number of pages of the document and total number of text blocks, content distribution features including number of tables, number of formulas and position indexes thereof, hierarchical topology features including title text, title hierarchy, word count statistics and chapter nesting relation; the semantic reorganization strategy comprises structural rationality, text block attribute re-judging based on context, text paragraph wrongly identified as title, semantic integrity, fragmented text block with finely divided structure identification, and text block independence of catalogue, cover and the like; The rule engine comprises granularity control rules for executing forced merging on units which do not meet a threshold value according to the hierarchical topological priority, and structural integrity rules for executing merging processing on isolated titles or floating chart structures.
  3. 3. The method for automatically examining a water conservancy science and technology report based on an agent architecture according to claim 1, wherein in step 2, the task of examination is assigned to be completed by guiding the LLM through a specially designed task plan Prompt, the task of understanding the content property of the current examination unit, judging whether the examination unit needs to be examined, assigning a proper examination task type to the examination unit, and giving priority to task execution and parallel advice.
  4. 4. The method for automatically inspecting a water conservancy science and technology report based on an agent architecture according to claim 1, wherein in the step 3, the inspection module comprises a cover inspection module, a catalog inspection module, a numbering inspection module, a misprinting word inspection module, a unhappy choice of words inspection module, a water conservancy term inspection module, a consistency inspection module and a water conservancy calculation inspection module, wherein each sub-module is independently packaged, the segmented sub-document content can call a module of a corresponding task to perform inspection, and each module can be expressed as: ; Wherein, the The model is an examination model and comprises a basic LLM and a fine tuning model; In order to be a domain knowledge base, The rules are censored for customization.
  5. 5. The automatic water conservancy science and technology report examination method based on an intelligent body framework according to claim 1, wherein in the step 4, in the text segmentation of the professional knowledge enhancement system operated in the construction supporting step 3, a water conservancy term knowledge base is a basic unit according to a single term; The water conservancy calculation relation library is divided into logic units according to formulas and contexts thereof, each semantic block is controlled within the preset length of the input window of the adaptive vector model, the complete knowledge semantics are ensured to be contained, and the maximum input length of the adaptive vector model is ensured; the contract-report corpus adopts a mixed strategy that a single term is a minimum labeling unit and a semantic paragraph is a basic segmentation unit; In the construction of the vectorization knowledge base, a vector model is selected as an encoder, and the vectorization process is as follows: Let knowledge semantic block set be Its vector is expressed as: ; Wherein Encoder () is a pre-trained vector model, and all vectors are built into a vector index library And efficient approximate nearest neighbor search is realized through a vector retrieval algorithm.
  6. 6. The method for automatically examining a water conservancy science and technology report based on an intelligent body framework according to claim 1, wherein in the step 4, the construction and support of the search and generation of the professional knowledge enhancement system operated in the step 3 is performed, the search knowledge enhancement adopts a task-oriented multi-knowledge base collaborative search mechanism, is automatically routed to a corresponding knowledge sub-base according to the type of an examination unit, the term examination calls a water conservancy term knowledge base, and the calculation examination calls a water conservancy calculation relation base Retrieving top-k related knowledge blocks from corresponding knowledge base The retrieval result is spliced to a template of the template, and the input form is as follows: ; Wherein, the In order to be a task instruction, In order to check the text block to be checked, Is the associated information retrieved from the knowledge base.
  7. 7. The automatic examination method of water conservancy science and technology report based on intelligent agent architecture as claimed in claim 1, wherein in step 4, the instruction fine adjustment data set comprises a term examination data set, a calculation verification data set and a contract consistency judgment data set, the data bases are respectively derived from a water conservancy term knowledge base, a water conservancy calculation relation base and a contract-report corpus, the whole construction process adopts a unified expert template guide+large model automatic generation+manual verification optimization technical path, and high quality and expandability of data in terms of semantic correctness, structural consistency and coverage breadth are ensured.
  8. 8. The automated water conservancy science and technology report auditing method based on an agent architecture according to claim 7, wherein the expert template guidance comprises three parts of task instructions, structured input content and expected output feedback, and positive and negative sample generation logic is defined in each type of task; The large model automatically generates a diversified sample through a Prompt guide model, and covers different semantic scenes and common error types; The manual verification optimization sets a hierarchical sampling proportion according to task complexity aiming at a term examination data set, a calculation verification data set and a contract consistency judgment data set, and performs expert verification on a generated sample.
  9. 9. The automatic water conservancy science and technology report examination method based on an agent architecture according to claim 1, wherein in step 4, the fine tuning model weights are used for three specific tasks, namely term examination, calculation examination and contract consistency examination, respectively constructing independent fine tuning models, and the fine tuning implementation process is as follows: 1) The design of model parameters, namely configuring differentiated low-rank adaptation LoRA parameter rank values aiming at examination tasks with different complexity, setting the rank value to be 6-10 aiming at the term examination tasks, setting the rank value of the complex calculation examination tasks required by logic reasoning to be 14-18, and introducing a regularization mechanism to the consistency tasks easy to pass fitting; 2) The optimization strategy is to update parameters by adopting a self-adaptive moment estimation optimizer and set weight attenuation for regularization; 3) The training implementation comprises the steps of realizing parallel training based on a distributed deep learning training framework, introducing a gradient accumulation mechanism to optimize the occupation of the video memory, and introducing an early stopping mechanism to terminate the training when the performance of the verification set is not improved.
  10. 10. An agent architecture-based automatic water conservancy science and technology report auditing system, comprising a storage medium storing a computer program/instruction, wherein the computer program/instruction is executed based on the steps of the method as claimed in claim 1, and the automatic auditing system adopts a four-layer architecture of a user interaction layer, a task planning layer, an auditing execution layer and a tool layer, and all the layers cooperatively realize the complete flow of reporting 'uploading-analyzing-planning-auditing-outputting' through a data interface: 1) The user interaction layer comprises a document uploading interface, a natural language input interface and a result output interface, wherein the document uploading interface supports uploading of water conservancy science and technology reports in Word, PDF and scanning element formats, the natural language input interface is used for receiving task description, examination requirements and special constraint conditions and carrying out semantic analysis on instructions, and the result output interface is used for presenting examination reports, problem lists, modification suggestions and accurate positioning information; 2) The task planning layer comprises a document analysis module, a feature portrait construction module, an examination unit construction module and a task planning engine, wherein the document analysis module is used for identifying document content through an OCR/analysis tool, extracting text and recording page numbers and paragraph position feature information, the feature portrait construction module is used for extracting global attribute features, special content distribution features and hierarchical topological features and combining the global attribute features, special content distribution features and hierarchical topological features to generate a document feature portrait; 3) The inspection execution layer comprises a cover inspection module, a catalogue inspection module, a numbering inspection module, a misprinting word inspection module, a unhappy choice of words inspection module, a water conservancy term inspection module, a consistency inspection module, a water conservancy calculation inspection module and a general inspection component, wherein the general inspection component generates inspection results by taking an inspection model, a field knowledge base, inspection rules and text blocks to be inspected as inputs and outputting the inspection results after task planning, model reasoning, rule verification and retrieval enhancement; 4) The tool layer comprises a general large language model, a domain knowledge base, a search enhancement engine, a document analysis and OCR tool chain, a rule and template engine, wherein the general large language model is used for supporting structural analysis, task planning and basic text examination, the domain knowledge base integrates water conservancy terms, calculation relations and specification standards, related knowledge fragments are recalled through a vector coding and search engine, the document analysis and OCR tool chain is used for outputting positioning feature information containing page numbers/paragraphs/block IDs, and the rule and template engine is used for managing related specification templates and verification rules of the water conservancy domain for the examination module to call.

Description

Water conservancy science and technology report automatic examination system and method based on intelligent body architecture Technical Field The invention relates to the technical field of water conservancy industry information processing, in particular to a water conservancy science and technology report automatic examination system and method based on an intelligent body architecture. Background The water conservancy science and technology report is used as a core document of project planning, construction and acceptance, and standardability, logic rigor and consistency with contract clauses relate to project implementation quality. In the water conservancy field, report inspection mainly depends on manual inspection, and the research of an intelligent report inspection method is still in a starting stage. Although the manual inspection method guarantees the inspection professional to a certain extent, the problems of certain efficiency challenges, inconsistent standard execution and the like may exist when facing long-spread and multidimensional inspection tasks. An efficient and accurate automated inspection method is needed to improve inspection efficiency. The current automatic report inspection technology is mainly constructed around three links of document digitization, information extraction and intelligent inspection, and a core technology path forms an industry universal paradigm. The document digitizing stage includes constructing special examining vector knowledge base and rule base, processing rule strips with embedded model and storing in vector data base, processing PDF, word, scanning part and other carrier via document format analysis module, combining OCR technology to extract text/form content and page layout information, completing layout analysis of paragraph, form, title and other structural units via layout analysis model, and creating structural document via layout and semantic after text cleaning and redundancy elimination. And in the information extraction stage, technologies such as segmentation semantic classification, keyword triggering, sequence labeling model, template matching and the like are fused, and project background, basic attribute, use scene, key parameter, important index and the like in the report are identified to examine target data. And in the intelligent examination stage, a large model searches a knowledge base according to structural parameters based on a retrieval enhancement (RAG) mechanism, reports fragments, structural parameters and retrieval rule entries are input into the LLM as combined contexts, rule judgment logic is combined, examination conclusion is output, and parameter integrity, limit value compliance, logic conflict and compliance assessment are included. However, the existing intelligent examination method is difficult to adapt to the examination requirements of the water science and technology report structure with complex structure, strong specialization and examination task in multiple dimensions, and has the following defects in terms of long text processing capacity, multi-task coordination capacity and field suitability. Firstly, the document analysis and segmentation capability is weak, a general OCR tool is difficult to accurately reserve explicit structures such as hierarchical titles, cross-page table relation and the like of professional documents, tight semantic blocks such as fixed character length or simple semantic segmentation easy-cutting formula derivation, clause logic and the like are difficult to obtain, so that context deletion and logic chain breakage required by examination are caused, secondly, the examination flow is rigid, a static solidified task sequence and text binding strategy is adopted, an examination path cannot be dynamically adjusted according to document structure change and front links (such as chapter identification) abnormality, task dislocation and key node omission easily occur at the same time of causing calculation force waste, thirdly, a single framework is difficult to adapt to multi-dimensional heterogeneous examination tasks in the water-interest field, and professional depth and expansibility are insufficient. The prior art relies on a single mode of 'general LLM+plug-in knowledge base or simple rule', is mainly limited to basic examination of parameter integrity and the like, and is difficult to meet the deep requirements of strong differentiation of term normalization, calculation logic, contract consistency and the like. Due to the lack of adaptation to the knowledge system of the water conservancy vertical field, the universal model has large expansion difficulty when facing to the newly added professional examination dimension, and is extremely easy to generate illusion due to lack of field knowledge, so that the examination conclusion lacks professional interpretability and credibility. Disclosure of Invention Aiming at the problems, the invention provides an automatic water conserva