CN-122019645-A - Data processing method and system based on model learning

CN122019645ACN 122019645 ACN122019645 ACN 122019645ACN-122019645-A

Abstract

The invention discloses a data processing method and system based on model learning, and relates to the technical field of data processing. The method comprises the steps of accessing multi-source heterogeneous original data through a distributed framework, carrying out format recognition and preprocessing by utilizing a deep learning model to obtain standardized data, inputting the data into a generated AI core layer, carrying out metadata management, quality detection and repair, standard fusion and safety protection in parallel by utilizing a model after pre-training and fine tuning, storing high-quality data into a lake and warehouse integrated architecture and providing data service, collecting event data in real time, and carrying out closed loop optimization and model incremental training based on a rule engine and a reinforcement learning algorithm. The invention realizes the full-flow intelligentization and self-adaptive evolution of data management by fusing the generation type AI and reinforcement learning technology, and effectively solves the problems of low automation degree, difficult hidden anomaly identification and lack of closed loop optimization mechanism in the prior art.

Inventors

LI YINGYING
Liu Fukuo
Zu Xia
ZHOU WENWEN
XIONG NENG
LIU JUNLIANG

Assignees

重庆数字资源集团有限公司

Dates

Publication Date: 20260512
Application Date: 20251222

Claims (10)

1. A data processing method based on model learning, the method comprising: Accessing multi-source heterogeneous original data through a distributed data acquisition framework, and carrying out format recognition and preprocessing on the original data by utilizing a preset deep learning classification model and a data cleaning algorithm library to obtain standardized preprocessed data; Inputting the preprocessed data into a generated AI core layer, and performing metadata intelligent management, data quality intelligent detection and restoration, data standard automatic fusion and data safety privacy protection processing in parallel by using a plurality of pre-trained and fine-tuned generated artificial intelligent models to obtain high-quality data after treatment; Storing the treated high-quality data into a lake and warehouse integrated storage architecture, constructing a full-text retrieval index and a data version snapshot, and converting the high-quality data into data service through an application interface layer to provide the data service for a user; and collecting service call data of the application interface layer and event data in the treatment process in real time, performing closed-loop optimization analysis based on a rule engine and a reinforcement learning algorithm, generating an optimization strategy, and performing incremental training on a model in the generated AI core layer to obtain an updated data processing model.
2. The data processing method based on model learning according to claim 1, wherein the performing format recognition and preprocessing on the raw data by using a preset deep learning classification model and a data cleaning algorithm library to obtain standardized preprocessed data comprises: adopting a deep learning classification model combining a convolutional neural network CNN and a long-short-term memory network LSTM to extract and fuse characteristics of file header and content sequence of the original data, and identifying a data format of the original data; According to the identified data format, a corresponding analysis strategy is called to convert the original data into a unified data stream; Calculating fingerprint characteristics of the data stream by utilizing SimHash algorithm, performing de-duplication processing, and performing outlier preliminary screening and filtering on the numerical data based on 3 sigma principle to obtain standardized preprocessing data; pushing the preprocessed data to the generated AI core layer through a message queue.
3. The model learning-based data processing method of claim 1, wherein the performing metadata intelligent management in parallel using a plurality of pre-trained and fine-tuned generated artificial intelligence models comprises: Performing syntax tree analysis on the script in the preprocessed data by using an ANTLR4 syntax analyzer, and extracting physical metadata of a table structure, a field name and a data type; Invoking a pre-trained natural language processing model ERNIE 3.0.0, and carrying out semantic mapping analysis on the extracted physical metadata by combining a preset enterprise business dictionary to generate a standardized metadata tag containing business meanings and data sources; Collecting read-write operation and interface calling relations in a data processing log, constructing a directed graph of data nodes and processing nodes by using a graph neural network GNN, and generating a data blood-margin graph; and when the change of the data link is monitored, refreshing the data blood-edge map within a preset time window by using an incremental updating algorithm.
4. The model learning based data processing method of claim 1, wherein performing the data quality intelligent detection and repair comprises: constructing a dynamic reference model based on the generation of the countermeasure network GAN, generating simulated normal data based on historical normal data by using a DCGAN-framework generator, distinguishing real normal data from the simulated normal data by using a CNN-framework discriminator until a loss function converges, and determining a normal data reference; Performing dimension reduction processing on the data flowing in real time by using a Principal Component Analysis (PCA) algorithm, extracting key features, and calculating cosine similarity between the key features and the normal data reference; when the cosine similarity is lower than a preset threshold, judging that the data is abnormal and triggering early warning, and calling a domain knowledge graph to perform context correlation analysis on the abnormal data so as to locate an abnormal source; And generating a repair rule based on the entity relation in the domain knowledge graph, automatically completing or correcting the simple abnormality, and generating a visual report containing repair suggestions for the complex abnormality.
5. The model learning based data processing method of claim 1, wherein performing the data standard automated fusion comprises: Inputting enterprise industry attributes, service ranges and historical data samples into a pre-trained BERT model by utilizing Few-Shot learning frames to carry out semantic coding, and generating a data standard draft containing data formats, coding rules and value range ranges; Converting the existing data standards of different departments into semantic vectors by utilizing Sentence-BERT models, calculating semantic similarity of cross-department standards, identifying standard deviation points and generating a difference comparison matrix; Carrying out fusion scheme search on the identified standard deviation points based on a genetic algorithm, carrying out iterative optimization by taking the highest data consistency and the smallest service influence as objective functions, and outputting an optimal data standard fusion strategy; and carrying out standardized mapping and conversion on the preprocessed data according to the optimal data standard fusion strategy.
6. The model learning-based data processing method according to claim 1, wherein performing the data security privacy protection process includes: carrying out semantic level analysis on the data by utilizing RoBERTa-BiLSTM-CRF deep learning model, identifying encrypted data and sensitive information implied by semantics in the data, determining the identified information as sensitive data, and determining the sensitivity level of the sensitive data; Constructing a desensitization strategy decision tree according to the use scene, the user roles and the authority level of the data; Based on the desensitization strategy decision tree, automatically selecting a partial shielding, field replacement, data generalization or differential privacy algorithm to dynamically desensitize the sensitive data; and binding the desensitized data with the user attribute based on the access control ABAC model of the attribute to realize dynamic authorization and access audit.
7. The model learning based data processing method of claim 1, wherein the application interface layer converting the high quality data into a data service for a user comprises: deploying a natural language processing gateway, integrating the natural language processing gateway into ChatGLM models, and receiving a natural language query request of a user; Converting the natural language query request into a standardized structured query language SQL or an application program interface API call instruction, and issuing the standardized structured query language SQL or the application program interface API call instruction to the lake and warehouse integrated storage architecture for data retrieval; Obtaining a search result, generating a report or a real-time instrument panel by using a preset visual component library, and feeding back to a user; and collecting calling frequency, response time and error rate indexes of the interface in real time by using the service monitoring module, packaging the calling frequency, response time and error rate indexes into service effect data, and pushing the service effect data to a closed-loop optimization analysis flow.
8. The model learning-based data processing method of claim 1, wherein the rule engine and reinforcement learning algorithm-based closed-loop optimization analysis, generating an optimization strategy and incrementally training a model in the generated AI core layer, comprises: automatically processing the conventional abnormal event by utilizing a rule engine; When encountering an emergency without a plan, activating an intelligent plan generating module, and retrieving similar cases from a historical event database by using a case-based reasoning (CBR) algorithm; Combining a standard plan library, performing iterative optimization on the retrieved cases by using a Reinforcement Learning (RL) algorithm to generate an optimal treatment plan, and performing task allocation by a service grid technology; After the event is treated, generating a multi-disc report, extracting characteristic data, and feeding the characteristic data back into a metadata management, quality detection, standard management and safety protection model by using an incremental training algorithm, and updating model parameters and a rule base.
9. The model learning based data processing method of claim 8, wherein the optimization process of the reinforcement learning RL algorithm comprises: Constructing a reinforcement learning environment, defining a state space as a current event feature and a system resource state, and defining an action space as a treatment measure set; defining a reward function, wherein the reward function is constructed based on the treatment time, the resource consumption and the service recovery degree; And performing trial and error learning in the reinforcement learning environment by using reinforcement learning agents, updating a strategy network by maximizing the accumulated rewards until the strategy converges, and outputting an optimal treatment sequence as the optimal treatment plan.
10. A data processing system based on model learning, the system comprising: The data acquisition and preprocessing module is used for accessing multi-source heterogeneous original data through a distributed data acquisition frame, and carrying out format recognition and preprocessing on the original data by utilizing a preset deep learning classification model and a data cleaning algorithm library to obtain standardized preprocessed data; The generation type AI core processing module is used for inputting the preprocessing data into the generation type AI core layer, and performing metadata intelligent management, data quality intelligent detection and restoration, data standard automatic fusion and data safety privacy protection processing in parallel by utilizing a plurality of pre-trained and fine-tuned generation type artificial intelligent models to obtain treated high-quality data; the storage and service management module is used for storing the treated high-quality data into a lake and warehouse integrated storage architecture, constructing a full-text retrieval index and a data version snapshot, and converting the high-quality data into data service through an application interface layer to provide the data service for a user; The closed-loop optimization module is used for collecting service call data of the application interface layer and event data in the treatment process in real time, carrying out closed-loop optimization analysis based on a rule engine and a reinforcement learning algorithm, generating an optimization strategy, and carrying out incremental training on the model in the generated AI core layer to obtain an updated data processing model.

Description

Data processing method and system based on model learning Technical Field The present invention relates to the field of data processing technologies, and in particular, to a data processing method and system based on model learning. Background With the penetration of enterprise digital transformation, data has become a core asset driving business innovation and decision optimization. In the face of exponentially growing multi-source heterogeneous data, how to treat the multi-source heterogeneous data efficiently and accurately becomes a great challenge for enterprises. The traditional data management flow generally covers links such as data acquisition, cleaning, storage, quality detection, safety control and the like, and aims to ensure consistency, accuracy and safety of data. In the prior art, data governance mainly relies on manually defined rules and traditional extraction, transformation, and loading (ETL) tools. In the data acquisition and preprocessing stage, due to wide data sources and complex formats, engineers are often required to manually write analysis scripts for different data sources, which is time-consuming and labor-consuming, and has limited processing capacity when facing unstructured data. In terms of metadata management and quality control, existing solutions are highly dependent on manual labeling and fixed threshold based rule engines. Operators need to manually comb the data blood-edge relationship and define quality check rules (such as null check and range constraint). However, this approach is difficult to cope with complex business logic conflicts or implicit timing anomalies, resulting in anomaly discovery delays, and the repair process typically requires human intervention analysis, with long response periods. In addition, the formulation of the data standard often forms islands among business departments, lacks a unified semantic alignment mechanism, and prevents cross-department data fusion. In the field of data security, the existing sensitive data identification technology mostly adopts regular expression or keyword matching, can only identify dominant features such as identification card numbers, telephone numbers and the like, has low identification rate on sensitive information (such as text containing salary descriptions) implied by semantics, and is difficult to dynamically adjust once set according to usage scenes, so that data is easy to be excessively desensitized or insufficiently protected. Finally, the traditional data management architecture is mostly a unidirectional linear flow, and lacks an automatic optimization mechanism based on service feedback, when the service morphology changes, the management strategy needs to be manually readjusted, the maintenance cost is high, and the continuous adaptability of the system is difficult to ensure. In summary, the existing data management method has low automation degree, weak ability to identify hidden problems and lack of closed-loop evolution mechanism, which has difficulty in meeting the requirement of large-scale and high-complexity data management. Therefore, a data processing scheme capable of integrating deep learning and generating technology and realizing full-flow intellectualization and self-adaption is needed. Disclosure of Invention The invention provides a data processing method and system based on model learning, which solve the problems of low automation degree, difficult implicit abnormality identification and lack of closed-loop evolution mechanism in the prior art. In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme: in a first aspect, an embodiment of the present invention provides a data processing method based on model learning, where the method includes: Accessing multi-source heterogeneous original data through a distributed data acquisition framework, and carrying out format recognition and preprocessing on the original data by utilizing a preset deep learning classification model and a data cleaning algorithm library to obtain standardized preprocessed data; Inputting the preprocessed data into a generated AI core layer, and performing metadata intelligent management, data quality intelligent detection and restoration, data standard automatic fusion and data security privacy protection processing in parallel by using a plurality of pre-trained and fine-tuned generated artificial intelligent models to obtain high-quality data after treatment; storing the treated high-quality data into a lake and warehouse integrated storage architecture, constructing a full-text retrieval index and a data version snapshot, and converting the high-quality data into data service through an application interface layer to provide the data service for a user; Service call data of an application interface layer and event data in a treatment process are collected in real time, closed-loop optimization analysis is carried out based on a rule engine a