CN-122019731-A - Question and answer data synthesis method and system based on plan driving

CN122019731ACN 122019731 ACN122019731 ACN 122019731ACN-122019731-A

Abstract

The invention provides a question-answer data synthesis method and system based on plan driving, which belong to the technical field of data processing and comprise the steps of compiling input user configuration to generate synthesis specifications; based on the synthesis specification and the literature set, generating an execution plan set comprising a plurality of execution plan items with mutually independent generation intentions, generating question-answer records corresponding to each execution plan item, executing quality verification on the question-answer records, storing the question-answer records into a final record set if the acceptance condition is met, repeating the generation and verification steps based on correction suggestions if the acceptance condition is not met and the retry upper limit is not met, and finally aggregating the question-answer records to generate a final data set. The invention eliminates source uncertainty through compiling and generating a synthesis specification, accurately decomposes a target and avoids distribution drift through generating an execution plan set pointing to independent intentions, and generates structured question-answer data which accords with initial multidimensional constraint and has high quality consistency through establishing a closed-loop treatment mechanism based on correction suggestion retry and improving qualification rate.

Inventors

WANG PEIYANG
WANG YIWEI
Tan chang

Assignees

安徽飞数信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (20)

1. A question-answer data synthesis method based on plan driving is characterized by comprising the following steps: Compiling the input user configuration to generate a synthesis specification; Generating an execution plan set based on the synthesis specification and a preset document set, wherein the execution plan set comprises a plurality of independent execution plan items for generating intentions; Generating a question-answer record corresponding to each execution plan item; executing quality verification on the question-answer records corresponding to any execution plan item, wherein when the quality of the question-answer records is judged to meet a preset acceptance condition, the question-answer records are stored into a final record set; and aggregating all question-answer records in the final record set to generate a final data set.
2. The program-driven question-answer data synthesis method according to claim 1, further comprising, before the compiling process of the input user configuration: generating a global random seed; And deriving corresponding local random seeds for each sub-module and/or each execution plan item through a deterministic hash function based on the global random seeds, wherein the local random seeds are irrelevant to the execution sequence and concurrency.
3. The method for synthesizing question-answer data based on plan driving according to claim 2, wherein deriving a corresponding local random seed for each sub-module and/or each execution plan item by deterministic hash function comprises: and splicing the global random seed, the target module identifier, the target task identifier and the retry count, and executing deterministic hash operation on the spliced result to obtain the local random seed.
4. The method for synthesizing question-answer data based on plan driving according to claim 1, wherein said generating a question-answer record corresponding to each of said execution plan items comprises: Loading corresponding original text paragraphs according to the source document identifications and paragraph identifications specified in the execution plan items; Splitting the original text paragraph into ordered sentence sequences, and adding a position label for each sentence; selecting a corresponding instruction template according to the target question type and the constraint condition specified in the execution plan item; assembling the ordered sentence sequence with the position tag and the instruction template into a prompt word; And inputting the prompt word into a large language model to generate the question-answer record, wherein the question-answer record at least comprises a stem, an answer, a analytic reasoning process and an evidence field referencing the position label.
5. The method for synthesizing question-answer data based on plan driving according to claim 1, wherein said performing a quality check on a question-answer record corresponding to any one of said execution plan items comprises the steps of: A structural format checking step, namely checking whether the field integrity, the field type and the position label quoted in the evidence field of the question-answer record are in the sentence number range of the original text paragraph or not; The method comprises the steps of checking semantic consistency, namely taking an original sentence set pointed by an evidence field in a question-answer record as a precondition text, taking answers and analysis in the question-answer record as a hypothesis text, and executing semantic implication judgment on the precondition text and the hypothesis text to obtain a loyalty judgment result; And a logic specification checking step, namely grading the difficulty of the questions of the question-answer records, and comparing the difficulty grading with a target difficulty interval appointed in the execution plan item to obtain a difficulty matching result.
6. The method of claim 5, wherein the step of verifying semantic consistency further comprises: extracting the execution entity and the keyword of the question stem of the question-answering record to obtain a first entity set; extracting an entity and a keyword from the original sentence pointed by the evidence field to obtain a second entity set; calculating the ratio of the intersection of the first entity set and the second entity set to the first entity set to obtain a problem anchoring score; and marking the question-answer record as a question illusion when the question anchoring score is lower than a preset threshold value.
7. The program-driven based question-answer data synthesis method according to claim 1, wherein when it is determined that the quality of the question-answer record does not satisfy the acceptance condition and the number of times of regenerating the question-answer record does not reach a preset retry upper limit, further comprising: calculating the uncertainty score of the question-answer records; when the uncertainty score exceeds a preset uncertainty threshold, calling a model gear with higher capacity than the current model when the question-answer record is regenerated next time; And when the uncertainty score does not exceed the uncertainty threshold, keeping the current model gear unchanged.
8. The method of claim 7, wherein the calculating an uncertainty score for the question-answer record comprises: Acquiring component values of at least two uncertainty signal sources, wherein the uncertainty signal sources comprise at least two of model self-evaluation confidence, generation probability entropy, self-consistency divergence and external verification confidence; and carrying out weighted fusion on each component value after normalizing to obtain the uncertainty fraction.
9. The program-driven question-answer data synthesis method according to claim 1, further comprising: When the quality of the question-answer records is judged not to meet the acceptance condition and the number of times of regenerating the question-answer records does not reach the preset retry upper limit, regenerating the question-answer records corresponding to the execution plan items based on the correction suggestions generated by verification, and executing quality verification on the regenerated question-answer records until the regenerated question-answer records meet the acceptance condition or the number of times of regenerating the question-answer records reaches the retry upper limit; when the quality of the question-answer records does not meet the acceptance condition and the number of times of regenerating the question-answer records reaches the retry upper limit, marking the execution plan item as a refusal state, and reporting a failure signal to a core coordinator; The core coordinator executes the governance strategy based on the current global state, wherein the global state at least comprises an accumulated failure rate, the number of execution plan items which are not executed and the consumed cost.
10. The program-driven question-answer data synthesis method according to claim 9, wherein said executing a governance policy includes: When the accumulated failure rate exceeds a preset failure rate threshold value and the number of the residual unexecuted execution plan items is greater than zero, triggering a rescheduling operation, discarding the execution plan items marked as the refused state, and generating new execution plan items to be supplemented into the execution plan set; when the cumulative failure rate does not exceed the failure rate threshold, confirming the reject status without triggering global intervention; and terminating the current task when the consumed cost exceeds a preset cost threshold.
11. The program-driven question-answer data synthesis method according to claim 1, wherein said storing the question-answer records into a final record set comprises: Carrying out normalized serialization processing on the question-answer records to generate record fingerprints; determining a storage path based on the recorded fingerprint and performing a presence check; and when no corresponding file exists in the storage path, writing the normalized question-answer record into a temporary file, and converting the temporary file into a formal file through atomic renaming operation after carrying out integrity verification on the temporary file.
12. The method for synthesizing question-answer data based on plan driving according to claim 11, wherein said normalizing and serializing the question-answer records to generate record fingerprints comprises: Screening fields related to semantic results from the question-answer records, and excluding volatile fields; sorting the filtered field key name classical sequence, and executing coding normalization processing on the character strings; and performing hash operation on the processed byte stream to obtain the record fingerprint.
13. The plan-driven question-answer data synthesis method according to claim 1, further comprising, before generating a question-answer record corresponding to each of the execution plan items, executing a cache query on a current execution plan item: Performing normalized serialization on the current execution plan item and static configuration, and then performing hash operation to generate a plan fingerprint; Inputting the planned fingerprint into a first-level cache for inquiry, wherein the first-level cache is a bloom filter; When the first-level cache returns a non-existing result, judging that the cache is not hit, and executing the step of generating a question-answer record on the current execution plan item; when the primary cache returns a possible result, the planned fingerprint is input into a secondary cache for accurate inquiry, wherein the secondary cache is a key value mapping table; and when the second-level cache hits and the state of the corresponding item is acceptable, reading the existing question-answer record associated with the item for multiplexing, and skipping the steps of generating the question-answer record and executing the quality check on the current execution plan item.
14. The program-driven question-answer data synthesis method according to claim 1, further comprising, after said generating a final data set: and generating a task-level blood-edge record file, wherein the task-level blood-edge record file at least comprises a task identifier, a synthetic standard fingerprint, a global random seed, a planning fingerprint set, a record fingerprint set and an index abstract.
15. A plan-driven question-answer data synthesis system, comprising: The synthesis specification compiling module is used for compiling the input user configuration to generate a synthesis specification; the execution plan generation module is used for generating an execution plan set based on the synthesis specification and a preset document set, wherein the execution plan set comprises a plurality of execution plan items; The question-answer synthesizing module is used for generating question-answer records corresponding to the execution plan items; the quality verification and gating module is used for executing quality verification on the question-answer records corresponding to any execution plan item, wherein when the quality of the question-answer records is judged to meet the preset acceptance condition, the question-answer records are stored into a final record set; And the delivery aggregation module is used for aggregating all the question-answer records in the final record set and generating a final data set.
16. The plan-driven question-answer data synthesis system of claim 15, further comprising: the storage and cache module is used for carrying out normalized serialization processing on the question-answer records and generating record fingerprints; the storage and cache module is further used for determining a storage path based on the recorded fingerprint and executing presence check; And the storage and cache module is further used for writing the normalized question-answer record into a temporary file when no corresponding file exists in the storage path, and converting the temporary file into a formal file through atomic renaming operation after the integrity of the temporary file is checked.
17. The plan-driven question-answer data synthesis system of claim 15, further comprising: The core coordinator is used for triggering a rescheduling operation when the accumulated failure rate exceeds a preset failure rate threshold value and the number of the execution plan items which are not executed is larger than zero, discarding the execution plan items marked as the refused state, and generating a new execution plan item to be supplemented into the execution plan set; The core coordinator is further configured to confirm the reject status without triggering global intervention when the cumulative failure rate does not exceed the failure rate threshold; The core coordinator is further configured to terminate the current task when the consumed cost exceeds a preset cost threshold.
18. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the program-driven question-answer data synthesis method according to any one of claims 1 to 14 when the computer program is executed.
19. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a program-driven question-answer data synthesis method according to any one of claims 1 to 14.
20. A computer program product comprising a computer program which, when executed by a processor, implements a program-driven question-answer data synthesis method according to any one of claims 1 to 14.

Description

Question and answer data synthesis method and system based on plan driving Technical Field The invention relates to the technical field of data processing, in particular to a question-answer data synthesis method and system based on plan driving. Background With the continuous expansion of large language model application scenarios, the need for high quality, controllable distribution and huge scale question-answer data sets for vertical domain model training has increased dramatically. In the existing program-driven question-answer data synthesis flow, a large language model end-to-end automatic generation mode is generally adopted to expand the data scale. The scheme mainly comprises the steps of directly segmenting an original input document, inputting the original input document into a large language model by combining fixed general prompt words, and continuously and randomly outputting a large amount of question-answer pair data in a pipeline batch processing mode by completely relying on free divergence and probability sampling capability of the large language model. The generation mode of the end-to-end dependent probability sampling lacks overall planning of a macroscopic synthetic target and fine granularity constraint on a microscopic generation process, so that unpredictable drift easily occurs in the problem type and difficulty distribution of output data. Meanwhile, when the quality of individual question-answer data output by the model is not up to standard, the method can only rely on mechanical blind resampling or directly carry out post-cut screening, cannot effectively correct local correction aiming at specific error reasons, causes serious calculation power consumption, and finally is difficult to stably deliver a high-quality structured data set strictly conforming to initial multidimensional constraint conditions. Disclosure of Invention The invention provides a question-answer data synthesis method and system based on plan driving, which are used for solving the defects in the prior art and realizing the generation of structured question-answer data which strictly accords with initial multidimensional constraint and has high consistency in quality. The invention provides a question-answer data synthesis method based on plan driving, which comprises the following steps: Compiling the input user configuration to generate a synthesis specification; Generating an execution plan set based on the synthesis specification and a preset document set, wherein the execution plan set comprises a plurality of independent execution plan items for generating intentions; Generating a question-answer record corresponding to each execution plan item; executing quality verification on the question-answer records corresponding to any execution plan item, wherein when the quality of the question-answer records is judged to meet a preset acceptance condition, the question-answer records are stored into a final record set; and aggregating all question-answer records in the final record set to generate a final data set. The invention also provides a question-answer data synthesis system based on plan driving, which comprises the following modules: The synthesis specification compiling module is used for compiling the input user configuration to generate a synthesis specification; the execution plan generation module is used for generating an execution plan set based on the synthesis specification and a preset document set, wherein the execution plan set comprises a plurality of execution plan items; The question-answer synthesizing module is used for generating question-answer records corresponding to the execution plan items; the quality verification and gating module is used for executing quality verification on the question-answer records corresponding to any execution plan item, wherein when the quality of the question-answer records is judged to meet the preset acceptance condition, the question-answer records are stored into a final record set; And the delivery aggregation module is used for aggregating all the question-answer records in the final record set and generating a final data set. The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the question-answer data synthesis method based on the plan driving when executing the program. The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a program-driven question-answer data synthesis method according to any one of the above. The invention also provides a computer program product comprising a computer program which when executed by a processor implements a program-driven question-answer data synthesis method as described in any one of the above. In summary, one or more technical solut