CN-121997941-A - Task processing method, device, equipment and medium based on multi-source data synthesis
Abstract
The invention relates to the technical field of artificial intelligence, and provides a task processing method, device, equipment and medium based on multi-source data synthesis, which can acquire a unified configuration file, generate a multi-source data synthesis task according to the unified configuration file, ensure controllability and reproducibility, execute a local corpus synthesis task based on an evidence constraint synthesis mechanism to obtain a local data set, execute an open data synthesis task based on an open data set detection screening mechanism to obtain an open data set, execute a distillation data synthesis task based on a distillation data synthesis mechanism to obtain a distillation data set so as to realize high-quality data synthesis of multiple data sources, combine the local data set, the open data set and the distillation data set to train a model, and execute a target task by utilizing the trained model, thereby assisting in improving the task execution effect based on the synthesized high-quality multi-source data set.
Inventors
- XU CHENGJIN
- JIANG XUHUI
- ZHOU HAO
Assignees
- 数创弧光(深圳)科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260202
Claims (10)
- 1. The task processing method based on the multi-source data synthesis is characterized by comprising the following steps of: Acquiring a unified configuration file, and generating a multi-source data synthesis task according to the configuration file, wherein the multi-source data synthesis task comprises a local corpus synthesis task, an open data synthesis task and a distillation data synthesis task; Executing the local corpus synthesis task based on an evidence constraint synthesis mechanism to obtain a local data set, executing the open data synthesis task based on an open data set detection screening mechanism to obtain an open data set, and executing the distillation data synthesis task based on a distillation data synthesis mechanism to obtain a distillation data set; combining the local data set, the open data set and the distillation data set to obtain a training sample set; training an initial model by using the training sample set to obtain a target model; responding to an execution instruction of a target task, and acquiring data to be processed; and processing the data to be processed by using the target model to obtain a target task processing result.
- 2. The method for processing tasks based on multi-source data synthesis according to claim 1, wherein the performing the local corpus synthesis tasks based on the evidence constraint synthesis mechanism to obtain a local data set comprises: Traversing the local document catalogue to obtain a document to be processed; Carrying out structural analysis on the document to be processed to obtain an analysis text; dividing the analysis text into a plurality of corpus units, and writing the corpus units into a local corpus in a structured file form; Acquiring a task instruction, an optional example and a domain boundary from the configuration file, and generating a retrieval intention according to the task instruction, the optional example and the domain boundary; executing recall in the local corpus according to the retrieval intention to obtain a reference fragment set; Acquiring input and output structure constraint, answer marking rule and example mode from the configuration file, and constructing a joint prompt according to the task instruction, the input and output structure constraint, the answer marking rule, the example mode and the reference fragment set; A generation model is called, and data synthesis is carried out based on the joint prompt by utilizing the generation model; and obtaining output data of the generated model to construct the local data set.
- 3. The multi-source data synthesis-based task processing method according to claim 2, wherein the data synthesis based on the joint hint using the generation model comprises: Splitting the joint prompt into a plurality of batches, and carrying out data synthesis of each batch in parallel by adopting multithreading; Recording intermediate results and progress information when data synthesis is carried out; the intermediate result and the progress information are used for supporting breakpoint continuous running and repeated execution data synthesis.
- 4. The method for processing tasks based on multi-source data synthesis according to claim 2, wherein said performing the open data synthesis tasks based on an open data set detection screening mechanism to obtain open data sets comprises: Generating a search keyword set according to the task instruction and the domain boundary; Searching according to the search keyword set by designating an open data platform interface to obtain a candidate data set comprising a plurality of subsets; Acquiring available configuration and data division information from the candidate data set; extracting a preset number of sample rows from the candidate data set as each detection sample according to the available configuration and the data dividing information; identifying a field set of each detection sample, and judging an input field and an output field of each detection sample through field mapping; Carrying out consistency and quality scoring on each detection sample according to the task instruction and the input field and the output field of each detection sample to obtain a quality quantization value of each detection sample; Acquiring a detection sample with the quality quantization value higher than a preset threshold value as each candidate sample, and acquiring a subset corresponding to each candidate sample as each candidate subset according to the data dividing information; Configuring extraction quota of each candidate subset; Extracting samples from the corresponding candidate subsets according to the extraction quota; And carrying out format conversion on the extracted samples according to the input-output structure constraint, and constructing the open data set by using the data obtained after conversion.
- 5. The method of task processing based on multi-source data synthesis according to claim 4, wherein said extracting a predetermined number of sample rows from the candidate data set as each probe sample according to the available configuration and the data division information comprises: screening available data from the candidate data set according to the available configuration; Sampling sample lines from each subset corresponding to the available data according to the data dividing information to serve as each detection sample; Wherein the sum of the amounts of samples extracted from each subset is equal to the preset number.
- 6. The method of task processing based on multi-source data synthesis according to claim 2, wherein the performing the distilled data synthesis task based on a distilled data synthesis mechanism to obtain a distilled data set comprises: Generating a prompt word according to the task instruction, the input-output structure constraint and the selectable example; Invoking a strong teacher model to generate a high-quality sample generation mode based on the prompt word; obtaining format constraints and diversity constraints, and constructing a distillation generation prompt according to the format constraints, the diversity constraints and the high-quality sample generation mode; inputting the distillation generating prompt to the strong teacher model based on multithreading in batches to obtain a structured sample; And carrying out structural analysis, format verification and consistency stabilization treatment on the structured sample, and constructing the distillation data set according to the data obtained after the treatment.
- 7. The multi-source data synthesis based task processing method according to claim 1, wherein the method further comprises: acquiring modal data, acquiring context information corresponding to the modal data, and establishing a mapping relation between the modal data and the context information; calling a multi-mode generation model to generate structured target samples based on the mapping relation, and marking mode field references of each target sample; adding the marked target sample to the training sample set.
- 8. A task processing device based on multi-source data synthesis, characterized in that the task processing device based on multi-source data synthesis comprises: The system comprises a generation unit, a data processing unit and a data processing unit, wherein the generation unit is used for acquiring a unified configuration file and generating a multi-source data synthesis task according to the configuration file, and the multi-source data synthesis task comprises a local corpus synthesis task, an open data synthesis task and a distilled data synthesis task; The execution unit is used for executing the local corpus synthesis task based on the evidence constraint synthesis mechanism to obtain a local data set, executing the open data synthesis task based on the open data set detection screening mechanism to obtain an open data set, and executing the distillation data synthesis task based on the distillation data synthesis mechanism to obtain a distillation data set; A combining unit for combining the local data set, the open data set and the distillation data set to obtain a training sample set; The training unit is used for training the initial model by utilizing the training sample set to obtain a target model; The acquisition unit is used for responding to the execution instruction of the target task and acquiring data to be processed; And the processing unit is used for processing the data to be processed by utilizing the target model to obtain a target task processing result.
- 9. A computer device, the computer device comprising: And a processor executing the instructions stored in the memory to implement the multi-source data synthesis-based task processing method according to any one of claims 1 to 7.
- 10. A computer-readable storage medium having stored therein at least one instruction for execution by a processor in a computer device to implement the multi-source data synthesis based task processing method of any one of claims 1 to 7.
Description
Task processing method, device, equipment and medium based on multi-source data synthesis Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a task processing method, apparatus, device, and medium based on multi-source data synthesis. Background In recent years, large language models (Large Language Models, LLMs) and multi-modal large models have made remarkable progress in tasks such as text generation, dialogue, visual understanding, and the like, and have been explored in vertical field scenarios such as medical, financial, legal, education, and the like. In order to enable the model to have specific domain knowledge and task capability, high-quality training data matched with a target task is generally required to be constructed, and processes such as supervision fine tuning, alignment training or post-model training are performed. However, in many industrial scenarios, the real annotation data is limited by the acquisition cost, copyright and privacy compliance limitations, insufficient coverage, update lag and other factors, and it is difficult to meet the requirements of model iteration on data size and diversity. To alleviate the problems of real data shortage and long tail coverage, synthetic data techniques are widely studied and applied. However, the existing data synthesis schemes still have the following disadvantages: (1) Link splitting is built by multi-source data, and unified flow abstraction and unified sample specification are lacked. The existing scheme is generally realized by adopting different scripts or different tool chains aiming at local corpus, open data set and teacher model generated data respectively, and the data structure, field naming and metadata recording modes are not uniform, so that the synthesized data are difficult to multiplex and manage in the same frame, and meanwhile, the follow-up quality control, screening, export and training access are required to be repeatedly adapted, so that the engineering cost is high, and the problem of inconsistent configuration is easy to be introduced. (2) In the local corpus driven synthesis, the context or evidence is not fully utilized or the constraint mechanism is weak, and the problems of inconsistent facts and sample homogenization are easy to occur. On the other hand, when the diversified sampling, de-duplication and coverage control of evidence segments are absent, the generated samples are easily concentrated on a small number of high-frequency modes or expressions, so that the similarity of the samples is too high, the repeatability is high and the diversity is insufficient, and the training coverage and the generalization effect are affected. (3) The data set of the open data platform has strong isomerism, the field mapping and task matching are highly dependent on manual work, and the quality screening cost is high. The disclosed data sets have large differences in field naming, structure organization, sample format and label system, and many data sets do not naturally accord with input or output definitions of target tasks, and the existing construction mode often needs to manually understand data set structured metadata definitions and write mapping and cleaning logic, so that the efficiency is low and the large-scale implementation is difficult. Meanwhile, data which are not matched with tasks or have high noise are also easy to introduce, and the final training effect is influenced. (4) The distillation type synthesis of the teacher model has the problems of unstable output and format drift, and lacks reproducible generation constraint and consistency check. The data generated by the teacher model usually has randomness, the conditions of inconsistent output formats, inconsistent answer marks, redundant or missing reasoning processes and the like can occur, the quality fluctuation of the synthesized data is large, and the subsequent training and comparison experiments are difficult to stably support. (5) The multi-mode data synthesis and training data organization lacks a universal adapting interface, and the cost is high when expanding to a new mode. The existing scheme is usually customized for specific modes (such as images), so that the processing logic is difficult to multiplex to other modes such as audio, and meanwhile, the binding mode of the mode data and context information (subtitles, transcription and metadata) is not uniform, so that a large number of processes are required to be rewritten when a new mode task is expanded, and engineering iteration efficiency and system expandability are affected. (6) The data synthesis controllability for different fields or tasks is insufficient, and the domain knowledge boundaries and the data forms are difficult to generate stably according to instructions. The existing synthetic data scheme is often customized for a single domain or a single task template, so that when switching to a new domain (