CN-122019642-A - Multisource heterogeneous data acquisition system
Abstract
The invention discloses a multi-source heterogeneous data acquisition system. The intelligent preprocessing system comprises a multi-source access subsystem, an intelligent preprocessing subsystem, an edge-cloud cooperative computing subsystem, a security encryption subsystem and a data adaptation output subsystem, wherein the multi-source access subsystem is used for generating an initial multi-source heterogeneous data set, the intelligent preprocessing subsystem is used for generating a preprocessing data set, the edge-cloud cooperative computing subsystem is used for generating a global processing data set, the security encryption subsystem is used for verifying access rights of users and encrypting and storing the global processing data set by adopting a differential privacy method, and the data adaptation output subsystem is used for acquiring target characteristics of the users, and acquiring data in the global processing data set corresponding to the target characteristics from the security encryption subsystem based on the access rights of the users and the target characteristics by using a homomorphic encryption protocol. The invention supports user customized data output, adapts to different application scene requirements, realizes efficient, safe and flexible data acquisition and management, and provides high-quality data support for subsequent data fusion, analysis and application.
Inventors
- LIU XIAOYU
- MA GUOXUAN
- WANG ZHE
- ZHANG CHAOYING
- MA YONG
- WU YANG
Assignees
- 陕西电器研究所
Dates
- Publication Date
- 20260512
- Application Date
- 20251120
Claims (6)
- 1. A multi-source heterogeneous data acquisition system, comprising: The multi-source access subsystem is used for interfacing different types of data sources, acquiring multi-source heterogeneous data by adopting a communication protocol corresponding to the data sources, and generating an initial multi-source heterogeneous data set; the intelligent preprocessing subsystem is used for preprocessing data in the initial multi-source heterogeneous data set to generate a preprocessed data set; The edge-cloud collaborative computing subsystem comprises an edge computing node and a cloud computing node, wherein the edge computing node performs local feature extraction and analysis on data with real-time processing requirements exceeding a first threshold in a preprocessing data set to generate an edge processing result; The data adaptation output subsystem acquires target characteristics of the user, and acquires data in the global processing data set corresponding to the target characteristics from the secure encryption subsystem by using homomorphic encryption protocol based on the user access rights and the target characteristics.
- 2. The system of claim 1, wherein the multi-source access subsystem comprises a protocol adaptation module, a device management module and a data receiving module, wherein the protocol adaptation module is used for interfacing different internet of things devices and storing MQTT, HTTP, OPC UA and TCP/IP protocols, the protocols corresponding to data of the internet of things devices are matched through a protocol automatic identification algorithm, the device management module is used for carrying out identity registration management, state monitoring and firmware updating on the internet of things devices and also storing identification and public keys of the internet of things devices, and the data receiving module is used for collecting multi-source heterogeneous data through the protocols corresponding to the data of the internet of things devices and generating an initial multi-source heterogeneous data set.
- 3. The system of claim 1, wherein the intelligent preprocessing subsystem comprises an anomaly detection module, a missing value processing module and a format standardization module, wherein the anomaly detection module identifies anomaly data in an initial multi-source heterogeneous data set through a visual analysis and statistics method, identifies duplicate data through a hash comparison method, deletes the anomaly data and the duplicate data, the missing value processing module determines a filling mode of missing data based on data types, fills the data in a mean value or median mode, fills the data in a mode, and fills the data in a linear interpolation mode, converts the data in a unified format, converts the text data into a TXT format, converts the image data into a JPEG format and a unified resolution, converts the voice data into a WAV format and a unified sampling rate, and converts the structured data and the time sequence data into a JSON format.
- 4. The system of any of claims 1-3, wherein the edge-cloud collaborative computing subsystem comprises a task allocation module, an edge computing node, a cloud computing node and a model optimization module, wherein the task allocation module determines the real-time performance of the data in the preprocessed data set, allocates the data with the real-time processing requirement exceeding a first threshold value in the preprocessed data set to the edge computing node, allocates the data with the real-time processing requirement not exceeding the first threshold value in the preprocessed data set to the cloud computing node; The edge computing node is provided with an edge feature extraction module, and performs local feature extraction and analysis on data with the real-time processing requirement exceeding a first threshold value in the preprocessing data set through a lightweight convolutional neural network to generate an edge processing result, and sends the edge computing result to the cloud computing node; The cloud computing node is deployed with a cloud multi-mode fusion module and a model optimization module, wherein the cloud multi-mode fusion module and the model optimization module are used for acquiring edge processing results sent by the edge computing node, carrying out depth feature fusion and global analysis on data and the edge processing results, which are not more than a first threshold, in the preprocessing data set in real time through the multi-mode neural network model, and generating a global processing data set, the multi-mode neural network model comprises LSTM, CNN and a transducer, and introducing a attention mechanism, and the model optimization module is used for evaluating the performance of each model in the multi-mode neural network model and the overall performance of the multi-mode neural network model, and adopting a random gradient descent optimizer to adjust the parameters of each model in the multi-mode neural network model.
- 5. The system of claim 4, wherein the secure encryption subsystem comprises a key management module, an access authentication module, a storage encryption module, and a transport encryption module; The system comprises a key management module, a symmetric encryption key, a public key management module and a public key management module, wherein the key management module is used for creating a plurality of public key-private key pairs and the symmetric encryption key; the access verification module is used for verifying the access rights of the user and/or the Internet of things equipment, wherein the verification of the access rights of the Internet of things equipment comprises searching a public key of the Internet of things equipment through an identifier of the Internet of things equipment when the Internet of things equipment is accessed to the multi-source access subsystem, verifying a private key signature of the Internet of things equipment; the storage encryption module adds noise to the global processing data set through a differential privacy technology, and stores the global processing data set after noise addition into the XML database; And the transmission encryption module generates a session key through a homomorphic encryption protocol before sending data in the global processing data set corresponding to the target feature, takes the data in the global processing data set corresponding to the target feature as transmission data, and encrypts the transmission data.
- 6. The system of claim 5, wherein the data adaptation output subsystem comprises a requirements parsing module, a data screening module, a format conversion module, and a data pushing module; The demand analysis module receives demand information input by a user, and determines target characteristics of the user based on the demand information; the data screening module is used for acquiring data in the global processing data set corresponding to the target feature from the secure encryption subsystem based on the user access right and the target feature and using the homomorphic encryption protocol as a target data set; The format conversion module is used for converting the target data set into an adaptation format based on the output requirement of a user, wherein the target data set is converted into a CSV or Excel format when output to the enterprise-level application, is converted into a TFRecord format when output to the machine learning model, and is converted into a PDF report format when output to the terminal user; And the data pushing module is used for pushing the converted target data set with the adaptive format to a user.
Description
Multisource heterogeneous data acquisition system Technical Field The invention relates to the technical field of data processing, in particular to a multi-source heterogeneous data acquisition system. Background With the rapid development of information technology, data sources are increasingly diversified, and typical characteristics of multi-source heterogeneous data are formed. The heterogeneous multi-source data is usually from internet of things equipment, enterprise databases, file systems (such as local documents and cloud storage files), third party API interfaces and the like, and the data format covers structured data (such as database table data), unstructured data (such as text, images and voice) and time-ordered data (such as temperature and humidity data acquired by a sensor in real time). In practical application, multi-source heterogeneous data acquisition faces three major core problems: The access compatibility is poor. The communication protocols adopted by different data sources have large difference, the traditional acquisition system only supports single or few protocols, unified access of multi-equipment and multi-format data is difficult to realize, meanwhile, the equipment of the Internet of things has various types, and the lack of a unified identity authentication and management mechanism is easy to cause disordered equipment access or illegal equipment access. The processing efficiency and the real-time performance are unbalanced. The multi-source heterogeneous data volume is huge, the real-time requirement on partial data (such as industrial equipment fault monitoring data) is extremely high, the traditional centralized cloud processing mode has data transmission delay and cannot meet the real-time requirement, the computational power of the single edge processing mode is limited, complex multi-mode data fusion analysis is difficult to complete, and therefore the data processing efficiency and the real-time performance are difficult to be compatible. The safety is not enough. The data is easy to steal or tamper in the transmission process, the individual data privacy leakage risk exists in the storage process, the traditional encryption mode (such as single symmetric encryption) has low security, and the data full life cycle security cannot be ensured due to lack of fine access authority control. In the prior art, partial multisource data acquisition schemes only pay attention to data access and simple cleaning, and do not solve the balance problem of instantaneity and processing efficiency, and partial schemes introduce edge computing or encryption technology, but do not form a full-flow integrated design of 'access-processing-computing-encryption-output', so that the requirements of high efficiency, safety and flexibility of data acquisition in complex scenes are difficult to meet. Therefore, there is a need for a multi-source heterogeneous data collection system that can achieve unified access, efficient processing, secure transmission, and customized output of multi-source data. Disclosure of Invention In view of the above, the present invention provides a multi-source heterogeneous data collection system, which can solve the above technical problems. The present invention is so implemented as to solve the above-mentioned technical problems. A multi-source heterogeneous data acquisition system comprising: The multi-source access subsystem is used for interfacing different types of data sources, acquiring multi-source heterogeneous data by adopting a communication protocol corresponding to the data sources, and generating an initial multi-source heterogeneous data set; the intelligent preprocessing subsystem is used for preprocessing data in the initial multi-source heterogeneous data set to generate a preprocessed data set; The edge-cloud collaborative computing subsystem comprises an edge computing node and a cloud computing node, wherein the edge computing node performs local feature extraction and analysis on data with real-time processing requirements exceeding a first threshold in a preprocessing data set to generate an edge processing result; the secure encryption subsystem is used for verifying the access rights of users, encrypting and storing the global processing data set by adopting a differential privacy method; And the data adaptation output subsystem is used for acquiring target characteristics of the user, and acquiring data in the global processing data set corresponding to the target characteristics from the secure encryption subsystem by using the homomorphic encryption protocol based on the access rights and the target characteristics of the user. The multi-source access subsystem comprises a protocol adaptation module, an equipment management module and a data receiving module, wherein the protocol adaptation module is used for interfacing different Internet of things equipment, storing MQTT, HTTP, OPC UA and TCP/IP protocols, matching protocols correspo