CN-121981104-A - Method, device, equipment and storage medium for analyzing and checking long text
Abstract
The invention provides a long text-oriented analysis and verification method, a device, equipment and a storage medium, which remarkably improve the verification efficiency and accuracy through intelligent slicing and mixed scheduling. The method is characterized in that fragments with quotation or dependency relationship are distributed to serial auditing nodes for deep and coherent context verification according to semantic relevance among the fragments of the text, and independent fragments are distributed to parallel auditing nodes for concurrent processing, so that auditing time is greatly shortened from an hour level to a minute level while the problem of semantic splitting of a long text is solved. In addition, the output behavior of the large language model is restrained by loading a predefined structured instruction set in the auditing node, so that the illusion and misjudgment generated when the model processes complex forms and nested clauses are effectively restrained, and the reliability of the checking result is remarkably improved. The method has flexibility and expandability, and is suitable for automatic and intelligent checking scenes of various professional long texts such as contracts, bidding documents and the like.
Inventors
- TANG CHENG
- YANG YANQUAN
- LIANG JIANWEN
- HUANG QIANG
- CAI ZHI
- WU YIQI
- Tang Jiechen
Assignees
- 广东联合电子服务股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260408
Claims (10)
- 1. The analysis and check method for the long text is characterized by comprising the following steps of: receiving a long text to be checked; Identifying and dividing paragraph boundaries of the long text to be checked to generate a plurality of text fragments; distributing the text fragments to a target auditing node according to the semantic relevance among the text fragments; in the target auditing node, a predefined structured instruction set is loaded and applied to call a long text auditing model to execute a text auditing task so as to obtain a text checking result, wherein the structured instruction set is used for restricting the output behavior of the long text auditing model in the process of generating the auditing result; The target auditing node comprises a serial auditing node or a parallel auditing node, wherein the serial auditing node is used for processing text fragments with reference relations or context relativity, and the parallel auditing node is used for processing text fragments with mutually independent contents.
- 2. The long text oriented analysis and verification method of claim 1, wherein before the step of invoking a long text audit model to perform a text audit task by loading and applying a predefined structured instruction set in the target audit node to obtain a text audit result, further comprising: when the text fragment contains a table or content having a nested hierarchy, the table or nested hierarchy content is flattened and converted to a plain text entry format having a hierarchical number.
- 3. The long text oriented analysis and verification method of claim 1, wherein after the step of invoking a long text audit model to perform a text audit task by loading and applying a predefined structured instruction set in the target audit node to obtain a text audit result, further comprising: acquiring feedback information submitted by a user aiming at the text checking result; Carrying out structuring treatment on the collected feedback information, and storing the feedback information and the corresponding text fragments, auditing results and applied structuring instruction sets in a system knowledge base in a correlated manner; Periodically calling the long text auditing model to analyze feedback information accumulated in the system knowledge base, and generating a structured feedback report containing specific problem description, cause preliminary analysis and optimization suggestions; And updating the structured instruction set or the text slicing logic based on the structured feedback report so as to optimize the subsequent text checking effect.
- 4. The long text-oriented analysis and verification method according to claim 1, wherein the step of identifying and dividing the paragraph boundary of the long text to be verified and generating a plurality of text fragments specifically comprises: Performing rough segmentation processing on the long text to be checked, and identifying format features in the document based on a predefined regular matching rule so as to determine an initial segmentation paragraph; Carrying out semantic verification and fine granularity division on the initial divided paragraphs obtained by the rough segmentation processing to combine the related content which is divided by mistake; Filtering auxiliary text content which does not need to be subjected to semantic auditing according to auditing task requirements; And generating the text fragments according to the fine granularity division and the filtered result.
- 5. The long text-oriented analysis and verification method according to claim 1, wherein the step of distributing the text fragments to a target audit node according to semantic relevance among the text fragments specifically comprises: analyzing the semantic relation among the text fragments, and identifying a first text fragment with direct reference, indirect reference or context logic dependency relation and a second text fragment with independent content; distributing the first set of fragments to the serial audit node for deep cross-validation with its ability to maintain continuous context; And distributing the second group of fragments to the parallel auditing nodes so as to improve auditing throughput by utilizing the parallel processing capacity of the second group of fragments.
- 6. The long text oriented analysis and verification method of claim 1, wherein said step of assigning said second set of tiles to said parallel audit nodes to increase audit throughput utilizing its parallel processing capability, comprises: acquiring state indexes of all parallel auditing nodes in a system in real time, wherein the state indexes at least comprise the current computing power utilization rate, the length of a task queue to be processed and the available memory; and calculating the real-time load scores of the parallel auditing nodes based on the state indexes, and preferentially distributing each independent text fragment in the second group of fragments to the parallel auditing node with the lowest current load score for processing.
- 7. The long text oriented analysis and verification method of claim 1, wherein the structured instruction set comprises: role definition instructions for defining roles of the long text audit model as text check executors; The task boundary instruction is used for defining the input, output and processing range of the text auditing task; outputting a format instruction for forcing the long text audit model to structurally output a check result according to a preset template; And the constraint instruction is used for limiting that the long text audit model cannot rewrite, optimize or explain the long text and the template to be checked when the text audit task is audit of the comparison template.
- 8. A long text-oriented analysis and verification device, comprising: The text receiving module is used for receiving the long text to be checked; the text segmentation module is used for identifying and segmenting paragraph boundaries of the long text to be checked to generate a plurality of text fragments; the text distribution module is used for distributing the text fragments to a target auditing node according to the semantic relevance among the text fragments; The text checking module is used for calling a long text checking model to execute a text checking task by loading and applying a predefined structured instruction set in the target checking node so as to obtain a text checking result, wherein the structured instruction set is used for restricting the output behavior of the long text checking model in the process of generating the checking result; The target auditing node comprises a serial auditing node or a parallel auditing node, wherein the serial auditing node is used for processing text fragments with reference relations or context relativity, and the parallel auditing node is used for processing text fragments with mutually independent contents.
- 9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the long text oriented analysis and verification method of any one of claims 1 to 7 when the program is executed by the processor.
- 10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the long text oriented analysis and verification method according to any one of claims 1 to 7.
Description
Method, device, equipment and storage medium for analyzing and checking long text Technical Field The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for long text-oriented analysis and verification. Background Compliance and consistency auditing of texts up to tens of thousands or even longer is a critical but extremely cumbersome task in professional fields such as bidding documents, contractual documents, policies and regulations. Traditionally, the work is seriously finished by relying on manual work, auditors need to check word by word and sentence by sentence, the time is huge (for example, the average time for auditing a copy of a five-thousand word file is about two hours), the problem of missed detection is easy to cause due to fatigue or omission, and the efficiency and the accuracy are difficult to be compatible. In recent years, with the development of large language model technology, a scheme of directly utilizing a model to carry out full text auditing is presented, but the scheme is limited by the length of a context window of the model, forced segmentation is often needed when a very long text is processed, so that semantic association information between chapters is lost, and cross-chapter clause comparison and reference auditing cannot be effectively completed. Meanwhile, when facing complex tables and multi-layer nested clauses, the large model is easy to generate 'illusion', makes error judgment or generates false content, and the false report rate is high. Furthermore, while automated tools based on predefined rules can promote some efficiency, they lack deep semantic understanding capabilities, failing to identify terms that differ in terms of expression but are semantically identical (e.g., "three years of experience" and "3 years of experience"), and suffer from inadequate flexibility and adaptability. In view of the foregoing, there is a need to solve the drawbacks of the prior art. Disclosure of Invention The invention provides a method, a device, equipment and a storage medium for analyzing and checking long texts, which are used for solving the defects in the prior art and realizing the automatic and intelligent checking of ultra-long professional texts. The invention provides a long text-oriented analysis and check method, which comprises the following steps: receiving a long text to be checked; Identifying and dividing paragraph boundaries of the long text to be checked to generate a plurality of text fragments; distributing the text fragments to a target auditing node according to the semantic relevance among the text fragments; in the target auditing node, a predefined structured instruction set is loaded and applied to call a long text auditing model to execute a text auditing task so as to obtain a text checking result, wherein the structured instruction set is used for restricting the output behavior of the long text auditing model in the process of generating the auditing result; The target auditing node comprises a serial auditing node or a parallel auditing node, wherein the serial auditing node is used for processing text fragments with reference relations or context relativity, and the parallel auditing node is used for processing text fragments with mutually independent contents. According to the long text-oriented analysis and verification method provided by the invention, before the step of calling a long text verification model to execute a text verification task by loading and applying a predefined structured instruction set in the target verification node to obtain a text verification result, the method further comprises the following steps: when the text fragment contains a table or content having a nested hierarchy, the table or nested hierarchy content is flattened and converted to a plain text entry format having a hierarchical number. According to the long text-oriented analysis and verification method provided by the invention, after the step of calling a long text verification model to execute a text verification task by loading and applying a predefined structured instruction set in the target verification node to obtain a text verification result, the method further comprises the following steps: acquiring feedback information submitted by a user aiming at the text checking result; Carrying out structuring treatment on the collected feedback information, and storing the feedback information and the corresponding text fragments, auditing results and applied structuring instruction sets in a system knowledge base in a correlated manner; Periodically calling the long text auditing model to analyze feedback information accumulated in the system knowledge base, and generating a structured feedback report containing specific problem description, cause preliminary analysis and optimization suggestions; And updating the structured instruction set or