CN-121980134-A - Structured data cleaning and predicting system based on MCP and large model co-evolution
Abstract
The invention relates to the technical field of data processing pre-artificial intelligence, in particular to a structured data cleaning and predicting system based on MCP and large model co-evolution, which aims to solve the problem of data processing bottleneck caused by machine learning, large model and data protocol splitting. The system comprises a data access and protocol adaptation module, a machine learning cleaning engine, a large model collaborative prediction module, a collaborative evolution control center and a data output and feedback module. The invention ensures that the system can adapt to the change of data distribution and the migration of business targets from the full stack collaborative evolution mechanism of data access, cleaning and prediction to protocol feedback, and ensures the overall operation efficiency and long-term reliability of the system while continuously improving the data processing quality and the prediction accuracy, thereby providing a systematic technical scheme for solving the problems of intelligent management and value mining of complex structured data.
Inventors
- WANG QIUJU
- WANG CONGCONG
- HAN GUOQUAN
- CHU YI
- WANG TENGTENG
- GUO XIAOYAN
Assignees
- 太极计算机股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260109
Claims (10)
- 1. A structured data cleaning and prediction system based on co-evolution of MCP and a large model, comprising: The data access and protocol adaptation module receives original structured data streams from a plurality of heterogeneous data sources, and performs standardized encapsulation and analysis on the data streams according to a preset MCP protocol specification so as to generate standardized data message packets; a machine learning cleaning engine for receiving the standardized data message packet output by the data access and protocol adaptation module and executing targeted data quality repair operation; The large model collaborative prediction module receives the data preliminarily cleaned by the machine learning cleaning engine and the original data context provided by the data access and protocol adaptation module, and executes deep semantic understanding and prediction tasks; the collaborative evolution control center is used for realizing deep collaboration and dynamic evolution between the machine learning cleaning engine and the large model collaborative prediction module; And the data output and feedback module is used for receiving and integrating the cleaned data from the machine learning cleaning engine and the prediction result and interpretation from the large model collaborative prediction module, generating a final high-quality data set and a prediction analysis report, collecting an application effect index of a downstream service system as final service effect feedback, and sending the final service effect feedback to an effect evaluator of the collaborative evolution control center.
- 2. The structured data cleansing and prediction system based on co-evolution of MCP and large models of claim 1, further comprising: The machine learning cleaning engine is internally provided with a plurality of parallel cleaning sub-modules, including a missing value filling sub-module, an abnormal value detecting and correcting sub-module, a consistency checking and correcting sub-module and a noise smoothing sub-module; The output of the large model collaborative prediction module comprises a prediction result, semantic interpretation and correction suggestion vectors aiming at data quality problems; the collaborative evolution control center comprises an evolution strategy maker, an effect evaluator and a parameter optimizer, wherein the evolution strategy maker continuously monitors statistical characteristic changes of data flows and achievement conditions of business prediction targets, the effect evaluator receives semantic interpretation and correction suggestion vectors from the large model collaborative prediction module and final business effect feedback from the downstream, and the parameter optimizer executes closed-loop optimization based on output of the effect evaluator to generate model parameter adjustment instructions for the machine learning cleaning engine and prompt word template adjustment instructions for the large model collaborative prediction module.
- 3. The structured data cleansing and prediction system based on co-evolution of MCP and large models according to claim 2, wherein the data access and protocol adaptation module performs the following operations: identifying interface types and data formats of different data sources, and converting unstructured or semi-structured data into a unified two-dimensional table form; checking and mapping column names, data types and constraint conditions of the two-dimensional table according to the data pattern description defined in the MCP protocol; and carrying out serialization encapsulation on the checked and mapped data and metadata information thereof according to a message format specified by an MCP protocol, generating the standardized data message packet, and distributing the standardized data message packet to the machine learning cleaning engine and the large model collaborative prediction module.
- 4. The structured data cleansing and prediction system based on co-evolution of MCP and large models of claim 2, further comprising: in the machine learning cleaning engine, the missing value filling submodule predicts the value of the missing position by adopting a regression model based on a random forest; The abnormal value detection and correction submodule adopts an isolated forest algorithm to identify outliers in the data and corrects the outliers by using statistical characteristics of neighborhood data; the consistency checking and repairing submodule detects and repairs contradictions among records according to predefined business rules and logic constraints; The noise smoothing submodule adopts a wavelet transformation technology to filter out high-frequency random fluctuation in data; And the model parameters and decision thresholds of each cleaning sub-module in the machine learning cleaning engine are dynamically configured and optimized by a parameter optimizer of the co-evolution control center.
- 5. The structured data cleansing and predicting system based on co-evolution of MCP and large models of claim 4 wherein the optimization process of the machine learning cleansing engine by the parameter optimizer is as follows: analyzing correction suggestion vectors output by the large model collaborative prediction module, converting natural language description aiming at a missing mode into a feature importance weight adjustment instruction and a subtree sampling proportion adjustment instruction, guiding a random forest model to strengthen the attention degree to related features and optimize the diversity of forests in the next training iteration.
- 6. The structured data cleaning and predicting system based on the co-evolution of MCP and large model according to claim 2, wherein the operation flow of the large model co-predicting module is as follows: Converting the structured form data and the column description thereof and the business type metadata into a serialization text input which can be understood by a large model through a prompt word template; The large model executes two core tasks of data quality enhancement prediction and business target prediction based on pre-trained knowledge and semantic analysis of input text; The data quality enhancement prediction is used for identifying hidden errors, semantic inconsistencies or contextually relevant missing values missed by the machine learning cleaning engine and generating the correction suggestion vector; and the business target prediction is used for directly outputting a predicted target value appointed by a user according to the cleaned data.
- 7. The structured data cleansing and prediction system based on co-evolution of MCP and large models of claim 6, further comprising: the prompting word template consists of a fixed part and a variable part; the variable part comprises a statistical abstract of current batch data, a recent data distribution change trend abstract and a characteristic combination prompt which shows best performance in a history prediction task; and the parameter optimizer dynamically generates and replaces the content of the variable part by using a reinforcement learning strategy according to the service prediction accuracy fed back by the effect evaluator.
- 8. The structured data cleansing and prediction system based on co-evolution of MCP and large models of claim 2, further comprising: the effect evaluator adopts a multi-objective evaluation function, and the multi-objective evaluation function simultaneously considers the data quality improvement degree and the service prediction accuracy dimension; the data quality improvement degree is calculated by comparing the change of the data before and after cleaning on preset integrity, consistency and accuracy indexes; Calculating the service prediction accuracy through the key performance index achievement rate fed back by the downstream service system; the parameter optimizer performs a joint parameter search with a weighted sum that maximizes the multi-objective evaluation function as an optimization objective.
- 9. The structured data cleansing and prediction system based on co-evolution of MCP and large models of claim 1, further comprising: The MCP protocol specification is adaptively expanded, and a special message field for bearing evolution metadata is defined; the special message field is used for packaging a model parameter adjustment abstract, a data distribution offset alarm and an optimization suggestion generated by the co-evolution control center; and the data access and protocol adaptation module reads the evolution metadata field when analyzing the data, and fine-adjusts the data mapping rule and the check threshold value according to the evolution metadata field.
- 10. The structured data cleansing and prediction system based on co-evolution of MCP and large models of claim 2, further comprising: The system adopts a framework combining an asynchronous pipeline and micro batch processing; The data access and protocol adaptation module, the machine learning cleaning engine and the large model collaborative prediction module asynchronously process data streams in a pipeline manner; the collaborative evolution control center operates in a micro-batch processing mode, all intermediate results and feedback signals in micro-batches are collected every other time window with preset quantity record or preset duration, evolution evaluation and parameter optimization are executed in a centralized mode, and optimized parameters are synchronized to each module; The co-evolution control center also encapsulates key parameter adjustment records and effect metrics generated in the evolution process according to metadata specifications expanded by the MCP protocol, and feeds the key parameter adjustment records and effect metrics back to the data access and protocol adaptation module for optimizing analysis and preprocessing strategies of subsequent data packets.
Description
Structured data cleaning and predicting system based on MCP and large model co-evolution Technical Field The invention belongs to the technical field of data processing and artificial intelligence, and particularly relates to a structured data cleaning and predicting system based on MCP and large model co-evolution. Background In the age of big data and artificial intelligence, data driven decision has become the core of various industries, wherein structured data is used as a main carrier of information, and the quality of the structured data directly determines the accuracy of subsequent analysis and prediction. The structured data cleaning and predicting technology aims at identifying and correcting errors, filling in deletions and eliminating inconsistencies from massive, multi-source and heterogeneous data, thereby providing a reliable data base for accurate modeling and intelligent decision-making. The method integrates machine learning, a large model and a data interaction protocol for cooperative processing, and is an important technical direction for improving the capability of structured data management and value mining. This direction aims at integrating the pertinence of machine learning in pattern recognition, the powerful ability of large models in semantic understanding and generalization, and the advantages of standardized protocols in ensuring data flow consistency and interoperability to build a more intelligent, more efficient data processing system. The prior art faces significant challenges in processing structured data. Traditional methods based on rules or simple machine learning models have poor flexibility, are difficult to capture complex dynamic modes and semantic contexts in data, and have limited cleaning effects on noise, anomalies and high-deletion-rate data. The model gap exists when the model is simply dependent on a large model, the natural language processing paradigm is not matched with the structural property of the form data, the calculation cost is huge, the real-time requirement is difficult to meet, and the interpretation of the prediction process is also lacking. Although schemes for trying to combine machine learning, large models and MCP protocols partially exist, the three are often in a loose stack state and lack a deep co-evolution mechanism. The output of the machine learning module does not fully consider the interaction specification of the MCP and the input characteristics of the large model, the large model cannot effectively feed back semantic insight thereof to optimize the cleaning strategy of the front end, and the MCP is only used as a transmission channel and does not participate in intelligent processing circulation. The fracture causes the integral system to be unable to adapt to the change of data distribution, and the cleaning and predicting accuracy, efficiency and reliability have bottlenecks. Therefore, how to realize deep collaboration and dynamic optimization among machine learning, a large model and an MCP protocol is a technical problem to be solved in order to improve the intelligent processing efficiency of the structured data. Disclosure of Invention The invention aims to provide a structured data cleaning and predicting system based on the cooperative evolution of MCP and a large model, which solves the problems that in the prior art, machine learning, the large model and a data interaction protocol are split, deep cooperation and dynamic optimization cannot be realized, and the structured data processing has bottlenecks in the aspects of accuracy, efficiency and reliability. The technical scheme of the invention is that the structured data cleaning and predicting system based on the cooperative evolution of MCP and a large model comprises: The data access and protocol adaptation module receives original structured data streams from a plurality of heterogeneous data sources, and performs standardized encapsulation and analysis on the data streams according to a preset MCP protocol specification so as to generate standardized data message packets; a machine learning cleaning engine for receiving the standardized data message packet output by the data access and protocol adaptation module and executing targeted data quality repair operation; The large model collaborative prediction module receives the data preliminarily cleaned by the machine learning cleaning engine and the original data context provided by the data access and protocol adaptation module, and executes deep semantic understanding and prediction tasks; the collaborative evolution control center is used for realizing deep collaboration and dynamic evolution between the machine learning cleaning engine and the large model collaborative prediction module; And the data output and feedback module is used for receiving and integrating the cleaned data from the machine learning cleaning engine and the prediction result and interpretation from the large model collaborative predict