CN-122020154-A - Model training-oriented data set construction method and device

CN122020154ACN 122020154 ACN122020154 ACN 122020154ACN-122020154-A

Abstract

The invention discloses a data set construction method and device for model training, and relates to the technical field of data processing. The method comprises the steps of constructing a data carding agent based on a large language model, receiving multi-source original data and target field business requirements through the agent, screening a high-correlation effective data set by utilizing deep semantic analysis and vectorization matching, inputting the data set into an automatic core processing pipeline, sequentially executing multi-mode data standardization, high-quality question-answer pair construction and safety quality enhancement processing, generating a standard data set for large model training in the target field, carrying out model training and evaluation based on the data set, collecting performance feedback data of the model on business tasks to construct a closed loop iteration optimization mechanism, reversely positioning data production link defects according to performance feedback, and dynamically adjusting screening standards or processing rules until the model performance meets preset requirements. The invention effectively solves the problems of data semantic deletion and difficult quality iteration.

Inventors

YANG JUNHAN
Zu Xia
Liu Fukuo
SUN YUE
Duan Linjia
LIAO GUOYUAN

Assignees

重庆数字资源集团有限公司

Dates

Publication Date: 20260512
Application Date: 20251218

Claims (10)

1. A method for constructing a model-oriented data set, the method comprising: constructing a data carding agent based on a large language model, receiving multi-source original data and target field service requirements through the data carding agent, carrying out deep semantic analysis and vectorization matching on the multi-source original data, and screening out a high-association effective data set; Inputting the high-correlation effective data set into an automatic core processing pipeline, sequentially executing multi-mode data standardization processing, high-quality question-answer pair construction and safety quality enhancement processing, and generating a standard target field data set for target field-oriented large model training; training and evaluating a large target field model based on the standard target field data set, collecting performance feedback data of the large target field model on a service task, and constructing a closed loop iteration optimization mechanism; And reversely positioning the defects of the data production link according to the performance feedback data, and dynamically adjusting the screening standard of the data carding agent or the processing rule of the core processing pipeline until the performance of the large model in the target field meets the preset requirement.
2. The method for constructing a data set facing model training according to claim 1, wherein the constructing a data carding agent based on a large language model, receiving multi-source original data and target domain service requirements through the data carding agent, performing deep semantic analysis and vectorization matching on the multi-source original data, and screening out a high-association effective data set comprises the following steps: presetting a specific capacity requirement of a target field service, and taking the specific capacity requirement as a target semantic benchmark; Extracting meta information and content characteristics of the multi-source original data by using the large language model, and mapping the target semantic benchmark and the content characteristics into a demand vector and a data vector in a high-dimensional semantic space respectively; Calculating cosine similarity between the demand vector and the data vector to obtain a quantitative relevance score; Judging whether the relevance score exceeds a preset decision threshold, if so, judging that the corresponding data is high-association valid data and incorporating the high-association valid data set, otherwise, marking the corresponding data as low-association data and performing isolation processing.
3. The model training oriented data set construction method of claim 1, wherein inputting the high-correlation valid data set into an automated core processing pipeline sequentially performs multi-modal data normalization processing, comprising: Extracting text, table and layout structure information by using a deep parsing tool aiming at unstructured document data, and converting the text, table and layout structure information into standardized text; executing service logic maintenance conversion aiming at table structure data in a service database; analyzing field definitions of a data table and association relations among tables, and reconstructing discrete field information into natural language description paragraphs for reserving logical dependency relations among fields and service contexts according to preset service rules; Aiming at the semi-structured data of the portal, analyzing the web page structure and combing information association, and converting the semi-structured data of the portal into text corpus in a unified format.
4. The model-training oriented dataset construction method as recited in claim 3, wherein the business logic hold transition further comprises: Identifying key business entities and flow nodes in a data table; analyzing the relation between the preconditions and the follow-up flows among the flow nodes; And converting the flow nodes and the circulation relations into descriptive texts containing causal logic so as to reserve service chain semantics in the original data and prevent the service logic context from being lost in the text process.
5. A method of model training oriented data set construction according to claim 3, characterized in that the high quality question-answer pair construction comprises: designing a standardized question-answering framework which covers single-item selection, multiple-item selection, filling and discussion of multiple question types; Based on the standardized text corpus obtained after the multi-modal data standardization processing, manually marking by utilizing the business knowledge of the domain expert to generate an initial question-answer pair fitting the actual business scene; and introducing a cross checking mechanism, and checking and correcting the accuracy and logic of the initial question-answer pair by field experts of different groups to generate high-quality question-answer pair data.
6. The model training oriented dataset construction method as claimed in claim 5, wherein the security quality enhancement process comprises: Detecting the standardized text corpus and the sensitive data in the high-quality question-answer pair by using a sensitive information recognition algorithm, and performing desensitization processing; Performing refined data cleaning to remove invalid characters, advertisement links and redundant repeated data; And packaging and storing the cleaned data according to a specific format required by training the large model in the target field.
7. The model training-oriented data set construction method according to claim 1, wherein the performing target domain large model training and evaluation based on the standard target domain data set, collecting performance feedback data of the target domain large model on a service task, and constructing a closed loop iterative optimization mechanism includes: in a model training or testing stage, monitoring an output result of the large model in the target field on a specific normative document understanding or business process reasoning task; Analyzing the error types in the output result, wherein the error types comprise a factual error and a logical reasoning error; And feeding the error type and the corresponding error sample back to a data production link, and triggering a corresponding optimization strategy.
8. The model training-oriented data set construction method according to claim 7, wherein the reversely locating the defect of the data production link according to the performance feedback data and dynamically adjusting the screening criteria of the data carding agent or the processing rules of the core processing pipeline comprises: when the fact error is detected to frequently occur, judging the fact error to be a basic data quality defect, and generating a first optimizing instruction for correcting a data cleaning rule or a standardized conversion rule in the core processing pipeline; When the logic reasoning errors are frequently generated or the target domain large model cannot cope with the subdivided scenes, the situation that the data logic chains or the scenes are not covered is judged, and a second optimization instruction is generated for adjusting the labeling strategy of the high-quality question-answer pairs or optimizing the decision threshold of the data combing intelligent agent.
9. The model training oriented dataset construction method of claim 1, further comprising: In the closed loop iterative optimization process, recording the data set version and the corresponding model performance index of each round; And establishing a mapping relation between the data version and the target field large model capacity, and continuously updating a preset knowledge base of the data carding agent by comparing and analyzing the influence of different version data sets on the target field large model performance.
10. A model training oriented dataset construction apparatus, the apparatus comprising: The data carding agent module is used for constructing a data carding agent based on a large language model, receiving multi-source original data and target field service requirements through the data carding agent, carrying out deep semantic analysis and vectorization matching on the multi-source original data, and screening out a high-correlation effective data set; The core pipeline processing module is used for inputting the high-correlation effective data set into an automatic core processing pipeline, sequentially executing multi-mode data standardization processing, high-quality question-answer pair construction and safety quality enhancement processing, and generating a standard target field data set for target field large model training; The closed loop iteration optimization module is used for training and evaluating the large target field model based on the standard target field data set, collecting performance feedback data of the large target field model on a service task and constructing a closed loop iteration optimization mechanism; and the dynamic adjustment module is used for reversely positioning the defects of the data production link according to the performance feedback data and dynamically adjusting the screening standard of the data carding intelligent body or the processing rule of the core processing assembly line until the performance of the large model in the target field meets the preset requirement.

Description

Model training-oriented data set construction method and device Technical Field The invention relates to the technical field of data processing, in particular to a data set construction method and device for model training. Background Along with the deep digital transformation, various vertical industries (such as target fields of smart cities, industrial manufacturing, financial wind control and the like) accumulate massive operation data. These data exhibit remarkable multi-source heterogeneous characteristics, including both structured database records from business systems, continuous time series data from internet of things devices, and large amounts of unstructured normative documents, operation manuals, images, and video data. Currently, data processing for such model-oriented training relies primarily on traditional "pipeline cleaning" or "data lake" convergence fabrics. The prior art generally adopts an Extract-Transform-Load (ETL) tool, and performs physical layer aggregation on scattered data according to preset rules, and loads the scattered data into a data warehouse through deduplication, regular filtering and simple format conversion. The core goal of this processing model is often to meet business intelligence (Business Intelligence, BI) report statistics or traditional supervised learning tasks. However, the prior art exposes significant limitations in facing the current cognitive intelligence application needs based on large language models. First, the prior art focuses on "physical tiling" of data rather than "semantic filtering. In the data aggregation stage, the lack of analysis capability of deep semantic association between data content and target domain business requirements results in a large amount of low-value or irrelevant data being mixed into the training set, diluting the density of key knowledge. Second, the data produced by existing processes is often a "record" of fragmentation, lacking the consistency and causal logic required for large model training. The field association in the service database is often hidden in the definition of the table structure, the service logic context is easily lost when the service logic context is converted into the text, so that a large model is difficult to learn the evolution rule of the event, and illusion is easily generated during reasoning. Finally, existing data processing flows are mostly unidirectional, open-loop "pipelined" jobs. Once the cleaning rules and screening criteria are set, it is difficult to dynamically adjust to the actual performance of the downstream model. When the large model in the target field is not good in performance on a specific task, an automatic feedback mechanism is lacked to diagnose the problem of an upstream data link, so that operation and maintenance personnel need to consume a large amount of time to conduct manual investigation, and the landing effect of the large model in the industry is severely restricted. Disclosure of Invention The invention provides a data set construction method and device for model training, which solve the problems of rough data screening, service logic semantic loss and lack of closed-loop optimization mechanism in the prior art. In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme: in a first aspect, an embodiment of the present invention provides a method for constructing a data set for model training, where the method includes: Constructing a data carding agent based on a large language model, receiving multi-source original data and target field service requirements through the data carding agent, carrying out deep semantic analysis and vectorization matching on the multi-source original data, and screening out a high-association effective data set; Inputting the high-correlation effective data set into an automatic core processing pipeline, sequentially executing multi-mode data standardization processing, high-quality question-answer pair construction and safety quality enhancement processing, and generating a standard target field data set for target field large model training; training and evaluating a target field large model based on a standard target field data set, collecting performance feedback data of the target field large model on a business task, and constructing a closed loop iteration optimization mechanism; and reversely positioning the defects of the data production link according to the performance feedback data, and dynamically adjusting the screening standard of the data carding agent or the processing rule of the core processing pipeline until the performance of the large model in the target field meets the preset requirement. Preferably, a data carding agent based on a large language model is constructed, the data carding agent receives multi-source original data and service requirements in a target field, deep semantic analysis and vectorization matching are carried out on the mult