CN-116775763-B - Data braiding system for decentralized distributed symbiotic sharing

CN116775763BCN 116775763 BCN116775763 BCN 116775763BCN-116775763-B

Abstract

The invention discloses a data braiding system for decentralized distributed symbiotic sharing, which comprises the following steps of 1) constructing a basic supporting platform DataFabric, 2) establishing an asynchronous data ingestion flow through APACHEKAFKA, ingesting/receiving data from various data sources, 3) data extraction, conversion and loading, 4) metadata management, which comprises metadata extraction and metadata ID generation, 5) ontology-based data management, and 6) constructing an industrial data tensor DIKube. The industrial data tensor DIKube constructed by the invention can generate specific DIKube according to the user requirements and can be provided for scene-oriented industrial data application. The invention can realize quick auditing, optimizing, integrating and iterating under the condition of changing the demands of users. The metadata index ID created by the invention can be related to industrial data semantics, can realize information inquiry and sharing of heterogeneous data, can carry out traceability and full life cycle management on the data, can automatically generate the ID of incremental data, and has self-adaptive expandability.

Inventors

CHEN GANG
ZHAO KAI
WANG MINGHAO
WANG XUFEI

Assignees

中云开源数据技术（上海）有限公司

Dates

Publication Date: 20260512
Application Date: 20230625

Claims (5)

1. A data braiding system for decentralized distributed symbiotic sharing, comprising the steps of: 1) Constructing a foundation support platform Data Fabric; 2) Establishing an asynchronous data ingestion flow, via APACHE KAFKA, to ingest/receive data from a variety of data sources; 3) Data extraction, data conversion and data loading; 4) Metadata management including metadata extraction and generation of metadata IDs; 5) Ontology-based data governance; 6) Constructing an industrial data tensor DIKube; The construction process of the industrial data tensor DIKube is as follows: 1) DIKube generation According to the industrial data catalog classification, forming industrial data tensor DIKube formed by metadata labels with different dimensions, wherein DIKube is a formalized semantic data space; 2) Generation DIKube based on user demand Based on the industrial mechanism and the user requirements, DIKube which can meet the user requirements can be pre-generated according to category, application, rule and formula scenes, the pre-generated DIKube can support AI and knowledge graph to pre-generate an optimal proposal meeting the user application scene requirements through big data management analysis, and the method has the characteristics of multiple selection and comprehensive consideration; 3) Generation of DIKube of industry open data Industrial open data DIKube is DIKube formed by classifying open data in the industrial field according to content-Guan Qiedian-genre-format, and satisfies the following characteristics: the method can cover the open data of the global industry; the method accords with the objective existence of open data and is easy to accept by people; machine-readable metadata classification labels, achieving automation; The Data Fabric utilizes the extracted metadata to perform Data blood-edge analysis, data quality management and Data security audit so as to better understand the Data, provide sufficient information support for subsequent Data processing and application, and improve the value and the utilization efficiency of the Data; The metadata ID is generated by using an industry data identification coding system, and is encrypted to form a unique metadata ID based on an industry data catalog and an industry knowledge graph, wherein the metadata ID not only ensures the unique identification of metadata, but also carries semantic, lineage information and an industry mechanism; because the unique Data ID can better identify, search and track the information of the Data set, the Data of different sources can be ensured not to collide; The method for managing the data based on the ontology comprises the following steps: 1) Construction of industry knowledge collection of tunes of poems Building industrial knowledge collection of tunes of poems requires extracting entities and relationships between entities from metadata; The extraction of entities, namely integrating structured, semi-structured and unstructured data according to the meaning of industrial data and expert experience, and identifying and marking the entities in the data, wherein the entities comprise people, places, organizations and terms; Relationship extraction, namely, the relationship among the entities comprises a membership relationship, a similarity relationship and an association relationship; Combining and fusing the extracted entities and relationships to construct an industrial knowledge collection of tunes of poems, using a graph database to store and manage data in knowledge collection of tunes of poems; inquiring and reasoning the knowledge graph through a knowledge graph inquiring language or a reasoning engine so as to support application of multiple scenes; 2) Building metadata index According to the constructed industrial knowledge graph, forming an industrial Data catalog and zyxID to form an index so as to meet the efficiency and expansibility of Data Fabric management and Data query; the index maintenance comprises data updating, data reconstruction and fault-tolerant processing operations so as to ensure the consistency and availability of the index and the data; 3) Data source and metadata blood relationship management Data source management The method comprises the steps of recording the source, format, type and collection time of data, marking a data collector and a data responsible person, recording the blood-edge relationship of the data source, and carrying out comprehensive document recording and metadata marking on the data source, wherein the comprehensive recording comprises the operation process and the processing result of the data source; Management of processed data The source and the target of the data flow are described by metadata, so that the blood-source information of the data is determined; Blood relationship management of metadata Metadata is data describing data, including data structures, field definitions, data types, data quality information, and managing the blood-edge relationship of metadata can help identify and track data derivatization, alteration, and version change conditions.
2. The decentralized distributed symbiotic shared Data braiding system of claim 1 wherein the foundation support platform Data Fabric is constructed as follows: 1.1 Deployment distributed base platform Optimizing the problem of storing massive small files by the HDFS on the basis of supporting the storage of structured data and unstructured data, and storing the small files through minIO, thereby improving the storage efficiency, carrying out unified metadata management on multi-source heterogeneous data; 1.2 Installing file management and migration components on a distributed base platform The MinIO is selected as a component for file management and migration, and a specific position of the file in a file system is not required to be provided when the file is acquired, but a uniform resource locator is acquired through requesting an object storage service; 1.3 Deploying Spark on kubernetes streaming computing on a distributed base platform The Spark on Kubernetes streaming computing is a Spark large number distributed computing framework based on a Docker container, aiming at a large data streaming computing Spark cluster based on a Kubernetes deployment Spark cluster and a Kubernetes platform, the Spark cluster can be rapidly deployed and laterally expanded, the flexible expansion and the contraction of Spark nodes based on loads can be realized, the resource monitoring of the Docker container is carried out, the container resource use data on each Node is collected, and the responsive expansion and contraction activities are executed on the Spark nodes according to the real-time loads.
3. The decentralized distributed symbiotic shared data braiding system of claim 1 wherein the flow of step 2) is as follows: 2.1 Installing Kafka components, providing messaging capability through Kafka component publish/subscribe patterns and partition messaging mechanisms; 2.2 Manually or automatically synchronizing Data from various databases, message queues, file stores, and performing unified management of Data through APACHE KAFKA to Data Fabric.
4. The decentralized distributed symbiotic shared Data braiding system of claim 1 wherein the Data extraction is to obtain Data from a source system and transmit it to a Data Fabric for processing, wherein the ETL of the Data Fabric provides a plurality of Data extraction modes including: file import, namely supporting data import of various file formats; database connection, namely supporting various database types and connection modes; the Web API supports capturing data through a Web API interface; the Data conversion refers to cleaning, processing and converting operation of the extracted Data so as to adapt to the subsequent analysis and application requirements, and the Data Fabric ETL provides a plurality of Data conversion modes, including: Data cleaning, namely removing repeated data, filling blank or error data and adjusting a data format; data preprocessing, namely performing aggregation, calculation, classification and filtering operation on the data; converting the original data, including date conversion and character string format conversion; the Data loading refers to re-importing the converted Data into a target Data warehouse or a business system, wherein the Data Fabric ETL provides a plurality of Data loading modes, including: writing back the converted data back to the source database or file to ensure the integrity and consistency of the source data; storing the converted Data into an internal Data lake bin of the Data Fabric, so that the subsequent query and analysis are convenient; and data export, namely exporting the converted data to other systems.
5. The decentralized distributed symbiotic shared data braiding system of claim 1 wherein the metadata extraction is performed in both an automated manner, wherein metadata information from various data sources is scanned and extracted using a self-grinding metadata extraction tool, and a manual manner, wherein the manual manner refers to manual entry of data types, field names, and data formats for different data sources.

Description

Data braiding system for decentralized distributed symbiotic sharing Technical Field The invention relates to the field of intelligent manufacturing, in particular to a data braiding system for decentralized distributed symbiotic sharing. Background The rapid development of information technology has penetrated into the industrial industry, resulting in explosive growth of multi-source heterogeneous industrial data, thereby bringing the problems of difficult data utilization in enterprises, difficult data sharing in the upstream and downstream of an industrial chain and difficult acquisition of valuable open data on the Internet. Existing techniques for managing multi-source heterogeneous data are generally classified into two categories, one category being large data techniques and the other category being data space techniques. The big data base platform constructed by big data technology is a data sharing platform, and is generally oriented to a plurality of industries and a plurality of types of clients when realizing data management. For data space technology, its use must be combined with an application, i.e. to design a specific data space according to the application requirements to manage the multi-source heterogeneous data required by the application, mainly to provide a unified view of the access heterogeneous data sources and intelligent decision support for the user. The existing big data base platform is usually deployed in one-stop mode when managing multi-source heterogeneous data, and the deployment mode can eliminate compatibility problems among different software or hardware, save debugging time and create value for clients, but is difficult to realize on-demand customization of users. In addition, when the demands of users change, the users can only be upgraded orderly on the basis of the existing products, and the users cannot optimize and iterate quickly. In contrast, the data space is used as a solution, and is applied to a plurality of fields such as complex scientific data management, ecological data analysis, environment observation and prediction, social networks, intelligent manufacturing and the like, and the application of the data space in the fields overcomes the problem of poor expansibility and universality of the existing database management system, data integration system, desktop search system, search engine and the like, however, some of the data space is still a universal data sharing platform and cannot be directly transplanted to the industrial field with an industrial mechanism, and the other data space is subjected to application verification in different industrial application scenes, but the research and design of the industrial data space still have a gap from a real industrial information system, such as a data flow problem, a data security design problem, an access management dilemma problem, potential conflict and feedback problem in system evolution and the like, but are not effectively solved. Accordingly, there is a need for improvements in such prior art to overcome the above-described deficiencies. Disclosure of Invention The invention aims to provide a data braiding system for decentralized distributed symbiotic sharing, which is used for constructing industrial knowledge collection of tunes of poems and an industrial data tensor DIKube based on an industrial data identification system and by means of an industrial mechanism. DIKube only stores metadata and corresponding identifications thereof, so that the universal storage of industrial data can be realized, technical guarantee is provided for data validation and data security, and the real-time on-demand and on-demand use oriented to industries and scene driving can be realized. The technical aim of the invention is realized by the following technical scheme: a data braiding system for decentralized distributed symbiotic sharing, comprising the steps of: 1) Constructing a foundation support platform Data Fabric; 2) Establishing an asynchronous data ingestion flow, via APACHE KAFKA, to ingest/receive data from a variety of data sources; 3) Data extraction, conversion and loading; 4) Metadata management including metadata extraction and generation of metadata IDs; 5) Ontology-based data governance; 6) An industrial data tensor DIKube is constructed. Further, the construction process of the foundation support platform Data Fabric is as follows: 1.1 Deployment distributed base platform Optimizing the problem of storing massive small files by the HDFS on the basis of supporting the storage of structured data and unstructured data, and storing the small files through minIO, thereby improving the storage efficiency, carrying out unified metadata management on multi-source heterogeneous data; 1.2 Installing file management and migration components on a distributed base platform The MinIO is selected as a component for file management and migration, and a specific position of the file in a file system is not