Search

CN-122020501-A - Dynamic fusion processing method and system for multi-source multi-domain heterogeneous data

CN122020501ACN 122020501 ACN122020501 ACN 122020501ACN-122020501-A

Abstract

The invention discloses a dynamic fusion processing method and a system for multi-source multi-domain heterogeneous data, wherein the method comprises the following steps of S1, dynamically collecting multi-source heterogeneous data, S2, achieving multi-domain data semantic deep fusion, S3, executing distributed intelligent parallel computation, S4, and implementing hierarchical storage and life cycle management. The invention breaks through the bottleneck of multi-source multi-domain heterogeneous data processing in the prior art through innovative data acquisition, fusion, calculation and management methods, realizes high-efficiency and accurate processing of complex data, improves the utilization value of the data, and provides solid and reliable technical support for decision and application based on the data in each field.

Inventors

  • LI YAN
  • ZHOU JIAN
  • YANG ZHIGAO
  • TIAN SHUO
  • GUO XIANFEI

Assignees

  • 中科卫星(安徽)数据科技有限公司

Dates

Publication Date
20260512
Application Date
20250925

Claims (9)

  1. 1. A dynamic fusion processing method for multi-source multi-domain heterogeneous data is characterized by comprising the following steps: s1, dynamically collecting multi-source heterogeneous data: identifying the type of the data source, the data format and the transmission protocol in real time; Dynamically generating a data acquisition strategy based on the identification result, wherein the data acquisition strategy comprises acquisition frequency, a transmission mode and an analysis rule; Executing data acquisition, and converting the heterogeneous data into a unified intermediate format through a self-adaptive analyzer; S2, realizing multi-domain data semantic deep fusion: Extracting entity and attribute information from the multi-source heterogeneous data, and annotating the category by semantic meaning; constructing knowledge maps of different fields, and displaying semantic relations among data in a graphical mode; Comparing entities, attributes and relations in knowledge maps of different fields, mining semantic association, and generating a data set with uniform semantics by utilizing the semantic association and fusing multi-source heterogeneous data; s3, performing distributed intelligent parallel computing: Decomposing the processing task into mutually independent and related subtasks according to the data type and the processing logic; monitoring the resource state of the computing nodes in real time, and dynamically distributing subtasks to the distributed computing nodes; Executing subtasks by adopting a parallel strategy of data partition parallelism or task pipeline parallelism; S4, implementing hierarchical storage: Selecting a storage medium-structured data to store in a relational database according to the data structuring degree and the access frequency, and storing the semi/unstructured data in a non-relational database or a distributed file system; Establishing a data catalog and an indexing mechanism; and migrating to the hierarchical storage medium according to the timeliness of the data, and executing regular cleaning and backup.
  2. 2. The method of claim 1, wherein the data source types in step S1 include satellite telemetry data, drone data, ground sensor device data, and Internet data.
  3. 3. The method according to claim 2, wherein the dynamically generating data acquisition strategy in step S1 comprises: Aiming at satellite remote sensing data, dynamically adjusting acquisition frequency according to the orbit period of a satellite, data updating frequency and monitoring requirements, adopting a data transmission protocol to ensure quick data transmission, and accurately analyzing data with different wave bands and resolutions by utilizing a special satellite data analyzer; aiming at unmanned aerial vehicle data, when detecting a new flight mission or monitoring area change, the intelligent perception module automatically adjusts an acquisition strategy, adjusts the data transmission rate in real time to adapt to network conditions, and automatically identifies and screens effective image data by utilizing an image identification technology; aiming at the ground sensor equipment data, when new equipment is accessed or equipment parameters are changed, automatically configuring communication parameters, generating corresponding data analysis rules according to the sensor types, and ensuring accurate data acquisition; Aiming at internet data, an intelligent algorithm bypasses a website anti-crawling mechanism, and an acquisition tool and an analysis mode are selected according to data types, so that the legality and the integrity of the data are ensured.
  4. 4. The method according to claim 2, wherein the semantic annotation class in step S2 specifically comprises: the satellite remote sensing data is subjected to spectral feature analysis, and semantic relation is established between the data in different wave bands and the environmental elements; the ground sensor data are semantically marked according to the corresponding relation between the monitoring parameters and the environmental concepts; the unmanned aerial vehicle image data is characterized in that ground feature features are identified by utilizing an image identification technology, and corresponding semantic tags are given to the ground feature features; and extracting key information and carrying out semantic classification on the internet text data through natural language processing.
  5. 5. The method of claim 1, wherein dynamically assigning subtasks in step S3 comprises assigning computation-intensive subtasks to high CPU performance compute nodes and data-intensive subtasks to high storage resource compute nodes.
  6. 6. The method of claim 1, wherein the parallel policy in step S3 includes employing data partition parallelism for large-scale spatial data and task pipeline parallelism for image processing tasks.
  7. 7. The method of claim 1, further comprising lifecycle management, the full lifecycle management comprising: in the original data acquisition stage, ensuring the integrity and accuracy of data, and performing preliminary data cleaning; in the data processing and analyzing stage, the data are efficiently calculated and analyzed to generate valuable information; In the data archiving stage, data which is not used for a long time or is not accessed frequently any more is migrated to a low-cost storage medium for storage, and meanwhile, a small amount of key index information is reserved in a relational database so as to be capable of rapidly locating and recovering the data when required, data cleaning and backup are carried out regularly, and the safety and usability of the data are ensured, wherein the data which is not used for a long time or is not accessed frequently any more is historical order data, and the low-cost storage medium is a tape library.
  8. 8. A system based on the method of any one of claims 1-7, comprising: the multi-source heterogeneous data self-adaptive acquisition module is used for identifying the type, format and transmission protocol of a data source in real time and dynamically generating a data acquisition strategy; The multi-domain data semantic depth fusion module is used for realizing data semantic annotation and association fusion through a cross-domain universal semantic model library; a distributed intelligent optimization parallel computing framework, which executes parallel computing based on task decomposition and resource scheduling strategies; The mixed data storage management system adopts a relational database, a non-relational database and a distributed file system to store data in a hierarchical manner.
  9. 9. The system according to claim 8, wherein: The multi-source heterogeneous data self-adaptive acquisition module comprises an intelligent sensing unit, a dynamic strategy library, an adaptive analyzer, a data processing unit and a data processing unit, wherein the intelligent sensing unit is used for identifying data source attributes through protocol detection and file header analysis; The multi-domain data semantic depth fusion module comprises a semantic annotation unit, a knowledge graph construction unit, a semantic mapping unit, a data processing unit and a data processing unit, wherein the semantic annotation unit utilizes a named entity recognition technology to extract environmental element entities; The distributed parallel computing framework comprises a task decomposition unit, an intelligent scheduler, a parallel optimization unit, a data partition parallel and task pipeline parallel strategy, wherein the task decomposition unit is used for splitting and processing tasks according to data types and logics; The hybrid data storage management system comprises a storage selection unit, an index management unit and a life cycle management unit, wherein the storage selection unit distributes storage media according to a data structure and access frequency, the index management unit establishes a composite index for relational/non-relational data, and the life cycle management unit migrates to hierarchical storage media according to data timeliness.

Description

Dynamic fusion processing method and system for multi-source multi-domain heterogeneous data Technical Field The invention relates to the technical field of data processing, in particular to a dynamic fusion processing method and system for multi-source multi-domain heterogeneous data. Background In the current digital age, data in various fields presents explosive growth and has complex characteristics of multisource, multiservice and isomerism. Taking the field of environmental monitoring as an example, the data sources comprise large-area macroscopic environmental data acquired by satellite remote sensing equipment, local high-resolution data acquired by unmanned aerial vehicle low-altitude flight, point location data monitored by ground sensor equipment in real time, various relevant texts, images, statistical data and the like on the Internet. The satellite remote sensing data has the characteristics of wide coverage range and strong periodicity, can provide environmental information of global or large-area areas, such as vegetation coverage, land utilization, water distribution and the like, but has relatively low resolution, and is difficult to accurately capture local detail information. The unmanned aerial vehicle data has the advantages of high maneuverability and high resolution, can flexibly observe specific areas and acquire detailed information such as landforms, vegetation health conditions and the like, however, the observation range is limited, and the data acquisition is greatly limited by flight conditions. The ground sensor equipment can accurately monitor the environmental parameters of the point location, such as temperature, humidity, air quality and the like, but can only reflect the situation of the local point, and lacks overall space information. The internet data has wide sources, comprises environment-related pictures and texts published by users on social media, research reports and statistical data of professional websites and the like, has various data formats and uneven quality, and has larger difference in semantics and structure with professional monitoring data. Current conventional data processing techniques expose a number of problems in the face of such complex multi-source multi-domain heterogeneous data. In the data acquisition link, a unified, efficient and self-adaptive acquisition mechanism is lacking for different types of data sources. For example, satellite remote sensing data receiving equipment is usually designed aiming at specific satellites and data formats, new satellite data sources or data format changes are difficult to adapt quickly, unmanned aerial vehicle data acquisition depends on manual operation or specific flight planning software and cannot respond to complex and changeable monitoring requirements in real time, ground sensor equipment data acquisition is limited by communication protocols and equipment compatibility, equipment data of different factories are difficult to acquire simultaneously, and Internet data acquisition faces challenges such as data crawling legitimacy, website anti-crawling mechanisms and data format diversity, so that data acquisition efficiency is low and integrity is difficult to guarantee. In the data fusion stage, the existing fusion method is mostly operated based on a simple data structure or format conversion, and cannot deeply mine the inherent semantic association among data in different fields. Different field data are different in definition, description modes and data expression modes of the same concept, such as in environment monitoring, satellite remote sensing data describe vegetation through spectral features, unmanned plane data analyze vegetation conditions from image textures, ground sensor data measure vegetation according to indexes such as biomass, internet text data refer to vegetation related information by natural language, and the traditional fusion method is difficult to effectively fuse the data in different expression modes, so that the fused data cannot fully play potential values, and comprehensive and accurate support cannot be provided for decision making. With the dramatic increase in data throughput from the standpoint of data processing efficiency, conventional stand-alone or simple parallel computing modes have been struggled when dealing with large-scale multi-source multi-domain heterogeneous data. The large-scale characteristics of the multi-source heterogeneous data cause data storage and transmission to become bottlenecks, the processing requirements of different types of data are different, the traditional computing mode cannot fully utilize distributed computing resources, efficient parallel processing aiming at different data types cannot be realized, the processing time is long, and application scenes with high real-time requirements are difficult to meet, for example, in disaster early warning, the multi-source heterogeneous data need to be analyzed in time to quickly make