CN-122021618-A - Structured text segmentation verification method and system based on logarithmic linear tolerance budget

CN122021618ACN 122021618 ACN122021618 ACN 122021618ACN-122021618-A

Abstract

The application provides a structured text segmentation verification method and a structured text segmentation verification system based on logarithmic linear tolerance budget, wherein the method comprises the steps of accessing text blocks output by an external segmentation module, reading and recording the total length of an original document and the length information of each text block, and providing input for subsequent fault-tolerant calculation; the method comprises the steps of calculating the relative tolerance of a current document based on a logarithmic linear mathematical model, applying an absolute loss lower limit and a relative tolerance upper limit to the preliminarily calculated tolerance, carrying out upper and lower bound correction, respectively and independently calculating the content loss and content repetition introduced after text segmentation, outputting the content loss and the content repetition in an absolute numerical form, respectively comparing the calculated content loss and content repetition with the allowable errors after boundary protection adjustment, and adaptively processing the content loss and content repetition problems of documents of different scales.

Inventors

LI HUI
CHEN YINCHAO
SUN SHAOSHAN
CUI XIAOJING
SHI KE
YAN PINGPING

Assignees

中国航空工业集团公司成都飞机设计研究所

Dates

Publication Date: 20260512
Application Date: 20251227

Claims (10)

1. A structured document segmentation validation method based on a log-linear tolerance budget, the method comprising: Step 1, accessing text blocks output by an external segmentation module, reading and recording the total length of an original document and the length information of each text block, and providing input for subsequent fault-tolerant calculation; Step 2, calculating the relative tolerance of the current document based on a logarithmic linear mathematical model, so that the tolerance is smoothly attenuated along with the document scale, and the accuracy of the short document and the robustness of the long document are considered; step 3, performing upper and lower bound correction on the preliminarily calculated tolerance, and applying an absolute loss lower limit and a relative tolerance upper limit to prevent tolerance mutation from occurring at the position of minimum or maximum document length; step 4, respectively and independently calculating the content loss and the content repetition which are introduced after text segmentation, and outputting the content loss and the content repetition in an absolute numerical form to ensure the accuracy of downstream tolerance comparison; Step 5, comparing the calculated content loss and content repetition with the allowable error after boundary protection adjustment respectively, and judging whether the segmentation quality passes or needs to be triggered for examination; And step 6, routing the text block passing the verification to downstream components such as a question-answering system, a vector index and the like according to a decision result, triggering an alarm or rollback mechanism under the condition of failed verification, and completing the whole verification pipeline.
2. The method according to claim 1, wherein the step 2 comprises: Using equation (1), the log-linear tolerance is calculated (1) Where n is the length of the source document, n min is the reference length, T 0 is the reference tolerance, s is the log-decay slope, To calculate the tolerance.
3. The method according to claim 2, wherein the step 3 comprises: in calculating initial calculated tolerance Thereafter, boundary protection adjustments need to be performed to ensure that the final tolerance does not exceed the maximum relative tolerance upper limit C rel due to too small a document length, nor is it below the minimum absolute loss constant C abs due to too large a document length, thereby smoothing the tolerance curve, avoiding abrupt changes at extreme lengths, and ensuring stability and consistency of the segmentation validation results.
4. A method according to claim 3, wherein said step 4 comprises: By independently calculating absolute metric values of content loss and content repetition, accurate input is provided for subsequent tolerance comparison, specifically as follows: Using equation (2), the content loss amount loss - abs is calculated, when the sum of all text block lengths is smaller than the original document length, (2) Where n is the original document length, Sum of all text block lengths; Calculating the content repetition amount by using the formula (3) When the sum of all text block lengths is greater than the original document length, (3)。
5. The method according to claim 4, wherein the step 5 comprises: Converting the relative tolerance T final obtained by upstream calculation into corresponding absolute allowable error Allowance, and respectively calculating the loss of content - abs and the repetition of content The double comparison is performed to produce a verification result, specifically as follows: Conversion absolute allowable error: for content loss amount loss - abs and content repetition amount Respectively carrying out independent comparison to ensure that both are within an acceptable range; generating an explicit decision signal for subsequent processing: and if not, judging that the verification is failed, and triggering an alarm or a rollback mechanism.
6. A structured document segmentation validation system based on a log-linear tolerance budget, the system comprising: The text block data acquisition module is used for accessing the text blocks output by the external segmentation module, reading and recording the total length of the original document and the length information of each text block, and providing input for subsequent fault-tolerant calculation; The log-linear tolerance calculation module is used for calculating the relative tolerance of the current document based on a log-linear mathematical model, so that the tolerance is smoothly attenuated along with the document scale, and the short document precision and the long document robustness are considered; The boundary protection module is used for carrying out upper and lower boundary correction on the preliminarily calculated tolerance, applying an absolute loss lower limit and a relative tolerance upper limit, and stopping tolerance mutation at the position of the minimum or maximum document length; The loss and repetition measurement module is used for independently calculating the content loss and the content repetition which are introduced after text segmentation and outputting the content loss and the content repetition in an absolute numerical form so as to ensure the accuracy of downstream tolerance comparison; the threshold value comparison decision module is used for comparing the calculated content loss amount and content repetition amount with the allowable error after boundary protection adjustment respectively, and judging whether the segmentation quality passes or needs to be triggered for examination; And the verification result routing module is used for routing the text block passing verification to downstream components such as a question-answering system, a vector index and the like according to the decision result, triggering an alarm or rollback mechanism under the condition of verification failure and completing the whole verification pipeline.
7. The system of claim 6, wherein the log-linear tolerance calculation module is further configured to calculate the log-linear tolerance using equation (1) (1) Where n is the length of the source document, n min is the reference length, T 0 is the reference tolerance, s is the log-decay slope, To calculate the tolerance.
8. The system of claim 7, wherein the boundary protection module is further configured to, when calculating an initial calculation tolerance Thereafter, boundary protection adjustments need to be performed to ensure that the final tolerance does not exceed the maximum relative tolerance upper limit C rel due to too small a document length, nor is it below the minimum absolute loss constant C abs due to too large a document length, thereby smoothing the tolerance curve, avoiding abrupt changes at extreme lengths, and ensuring stability and consistency of the segmentation validation results.
9. The system of claim 8, wherein the loss and repetition metric module is further configured to provide accurate input for subsequent tolerance comparisons by independently calculating absolute metric values of content loss and content repetition, in particular as follows: Using equation (2), the content loss amount loss - abs is calculated, when the sum of all text block lengths is smaller than the original document length, (2) Where n is the original document length, Sum of all text block lengths; Calculating the content repetition amount by using the formula (3) When the sum of all text block lengths is greater than the original document length, (3)。
10. The system of claim 9, wherein the threshold comparison decision module is configured to convert the calculated relative tolerance T final upstream into a corresponding absolute tolerance Allowance and to determine the loss of content - abs and the repetition of content, respectively The double comparison is performed to produce a verification result, specifically as follows: Conversion absolute allowable error: for content loss amount loss - abs and content repetition amount Respectively carrying out independent comparison to ensure that both are within an acceptable range; generating an explicit decision signal for subsequent processing: and if not, judging that the verification is failed, and triggering an alarm or a rollback mechanism.

Description

Structured text segmentation verification method and system based on logarithmic linear tolerance budget Technical Field The application belongs to the technical field of text processing, and particularly relates to a structured text segmentation verification method and system based on logarithmic linear tolerance budget. Background Currently, in many systems, when long document processing is involved, the document needs to be split into small pieces for processing. But this process may present problems of content loss or content duplication. The existing detection method either leaks errors in short documents or generates excessive false alarms in long documents, and the detection method is specifically as follows: 1. Fixed absolute threshold content integrity monitoring The basic idea is to set a fixed upper limit (such as 1000 characters) of the difference value of characters (or words) in advance, calculate the absolute difference value of the sum of the length of the original document and the length of all text blocks after segmentation, and trigger an alarm or rollback if the absolute difference value exceeds the threshold value. The method has the main defects that the sensitivity is lacking for the length of the document, namely, short documents can not give an alarm even if the loss proportion is high, long documents can be misreported even if the short documents are lost very little, and the follow-up retrieval and answer extraction tasks are affected by missed detection and misreport. 2. Fixed percentage threshold validation mechanism The basic idea is that the tolerable error is defined as a fixed proportion (for example, 1%) of the document length, the allowable error is calculated according to the proportion after segmentation, and the alarm is given if the allowable error exceeds the allowable error. The linear proportion model cannot meet the requirements of documents of different scales, namely, the small punctuation difference in the small document triggers false alarm, and the large document can lose a plurality of pages of content and does not alarm, so that the linear proportion model is too tight and too loose, and cannot meet the self-adaptive requirements of the small document and the large document. 3. Linear attenuation adaptive model The basic idea is that a linear function is adopted, so that the tolerance is linearly reduced along with the length of the document, the relative tolerance is calculated according to the current length after segmentation, and then the relative tolerance is converted into an absolute threshold value for comparison. The method has the main defects that the cliff effect exists, namely, even 1 character is poor in tolerance mutation at a certain critical length, so that verification results are discontinuous, meanwhile, different document types need to be independently adjusted, parameters are difficult to be used universally across domains, maintenance cost is high, and the results are unstable. 4. Multi-level threshold content verification framework The basic idea is to divide the document into a plurality of sections (such as short text, medium and oversized) according to the length, configure independent absolute or relative threshold values in each section, determine the section to which the document belongs after dividing, and then load the corresponding threshold value for verification. The main defects are that the interval boundary can cause tolerance sudden change, namely, adjacent two sides are only a few characters worse and use completely different thresholds, so that fault tolerance is incoherent, and the configuration and management complexity grows exponentially along with the increase of the partition number. The four schemes have advantages and disadvantages, but the requirements of 'short document high precision' and 'long document low false alarm' cannot be simultaneously met on documents ranging from hundred-word level to million-word level. Disclosure of Invention The invention aims to provide a unified monitoring framework for a structured text segmentation pipeline, which can adaptively process the problems of content loss and content repetition of documents of different scales by introducing a log-linear tolerance budget mechanism. In a first aspect, the present application provides a structured document segmentation validation method based on a log-linear tolerance budget, the method comprising: Step 1, accessing text blocks output by an external segmentation module, reading and recording the total length of an original document and the length information of each text block, and providing input for subsequent fault-tolerant calculation; Step 2, calculating the relative tolerance of the current document based on a logarithmic linear mathematical model, so that the tolerance is smoothly attenuated along with the document scale, and the accuracy of the short document and the robustness of the long document are considered; step 3, p