CN-121996929-A - Normalized data acquisition system based on big data

CN121996929ACN 121996929 ACN121996929 ACN 121996929ACN-121996929-A

Abstract

The invention relates to the technical field of big data acquisition and processing, in particular to a big data-based normalized data acquisition system, which comprises a probe sensing module, a portrait generation module, an acquisition scheduling module, a data conversion module and a distribution routing module. The method comprises the steps of carrying out real-time sensing on multi-source heterogeneous data through a deployed distributed probe to generate an original data stream, analyzing the data stream by a portrait generation module and extracting multidimensional features to construct a dynamic portrait of a data source, dynamically establishing and adjusting an acquisition logic frame by an acquisition scheduling module according to the portrait to guide the cooperative acquisition behavior of the probe, carrying out pattern recognition and structure remodeling on the original data by a data conversion module under the guidance of the frame through a unified pipeline to output a data set with unified pattern specification, and injecting the data set into a data distribution network by a distribution routing module and delivering the data set to a designated warehouse. The system realizes intelligent perception and self-adaptive scheduling of dynamic data sources, and completes real-time and efficient normalization processing of data from the sources.

Inventors

LI XIANGYANG
Yao Qiuhui
ZHANG JIAMING
LI AILIN

Assignees

深圳中科研人工智能大数据有限公司

Dates

Publication Date: 20260508
Application Date: 20260126

Claims (10)

1. A big data based normalized data acquisition system, the system comprising: The probe sensing module is used for deploying a plurality of data probes in a distributed network environment, and the data probes monitor and sense data sources of different protocols and structures in real time so as to form an original sensing data stream; the portrait generation module analyzes and captures the characteristics of the original perceived data stream, extracts multi-dimensional characteristic information representing the inherent characteristics of the data source, and generates a dynamic portrait of the data source according to the multi-dimensional characteristic information; The acquisition scheduling module is used for establishing a dynamically adjusted data acquisition logic frame according to the dynamic image of the data source, and the data acquisition logic frame is used for guiding the cooperative acquisition behavior of the data probe; The data conversion module is used for executing synchronous or asynchronous collection actions by a plurality of data probes under the guidance of the data collection logic framework, sending collected primary data into a unified data conversion pipeline, and carrying out pattern recognition and structure remodeling on the primary data by the unified data conversion pipeline to convert heterogeneous primary data into a data set with unified pattern specification; And the distribution routing module is used for injecting the data set with the unified mode specification into a data distribution network with buffering and routing functions, and finally delivering the data to a designated data warehouse by the data distribution network.
2. The big data based normalized data acquisition system of claim 1, wherein the parsing and feature capturing of the raw perceived data stream extracts multidimensional feature information characterizing intrinsic characteristics of a data source, and generates a dynamic representation of the data source based on the multidimensional feature information, specifically comprising: separating header information and load information of a data packet from the original perceived data stream, performing protocol decoding on the header information, and performing shallow syntax analysis on the load information so as to obtain a basic descriptor of the data stream; Carrying out time serialization processing on the basic descriptor to construct a time sequence signal describing the data arrival interval, the data packet size change and the evolution of the data content entropy value along with time; Filtering the time sequence signal by using a signal processing method to inhibit noise, and detecting a periodic mode, a trend component and mutation point information in the time sequence signal; fusing the detected periodic pattern, trend component and mutation point information to construct a composite feature vector capable of reflecting the activity, stability and burstiness tendency of the data source; meanwhile, analyzing a data generation context implicit in the original perceived data stream, wherein the data generation context comprises a physical deployment position of a data source, a logical network topological relation and an interaction trace with an application layer; and performing association mapping on the composite feature vector and the data generation context to form a data source description model containing static attributes and dynamic behavior features, wherein the data source description model is a dynamic representation of a data source.
3. The big data based normalized data acquisition system according to claim 2, wherein the filtering the time series signal using the signal processing method to suppress noise and detect the periodic pattern, the trending component and the mutation point information in the time series signal specifically includes: Smoothing the time sequence signal by adopting an adaptive filter, wherein parameters of the adaptive filter are automatically adjusted according to local statistical characteristics of the time sequence signal; performing spectrum analysis on the smoothed time series signal, and finding a hidden periodic pattern by identifying a significant peak in the spectrum; Meanwhile, a trend decomposition algorithm is used for separating the time series signal into a long-term trend component, a seasonal component and a residual component; applying a mutation detection algorithm to the residual component, and identifying short-time and severe change points in the signal, wherein the change points correspond to sudden events of a data source; And carrying out structural record on the identified periodic pattern parameters, mathematical expressions of long-term trend components and seasonal components obtained by a trend decomposition algorithm, and the position and intensity information of the mutation points.
4. The big data based normalized data acquisition system according to claim 2, wherein the dynamically adjusted data acquisition logic framework is established according to the dynamic representation of the data source, and the data acquisition logic framework is used for guiding the collaborative acquisition behavior of the data probe, and specifically comprises: analyzing a data source description model contained in the dynamic image of the data source, and extracting quantitative indexes about the expected service life of the data source, the estimated data value density and the data acquisition urgency from the data source description model; based on the quantization index, a multi-objective optimization function is designed, wherein the multi-objective optimization function aims at balancing the integrity and timeliness of data acquisition and the load pressure on a network and a data source; Solving the multi-objective optimization function to obtain a group of optimal acquisition strategy parameters aiming at dynamic portraits of different data sources, wherein the optimal acquisition strategy parameters comprise acquisition frequency, acquisition depth and acquisition parallelism; Matching the optimal acquisition strategy parameters with the real-time state information of the data probes, wherein the real-time state information of the data probes comprises a processing capacity margin, a network bandwidth occupation condition and a current task queue length, and assigning a specific acquisition instruction set for each data probe according to a matching result, and prescribing a communication protocol and a data format for performing task coordination and state synchronization among the data probes; All the data probes assigned with the collection instruction sets form a logically unified and physically scattered collection cluster under the constraint of the communication protocol and the data format, and the overall behavior rule of the collection cluster is a dynamically adjusted data collection logic framework.
5. The big data based normalized data acquisition system according to claim 4, wherein the matching the optimal acquisition policy parameter with real-time status information of the data probes, the real-time status information of the data probes including a processing capability margin, a network bandwidth occupancy, and a current task queue length, and assigning a specific acquisition instruction set to each data probe according to the matching result, specifically includes: Establishing a resource matching matrix, wherein the rows of the resource matching matrix represent data probes, and the columns represent acquisition task demand dimensions defined by optimal acquisition strategy parameters; quantifying the real-time state information of each data probe into a resource supply vector, and quantifying the optimal acquisition strategy parameters of each acquisition task into a resource demand vector; Calculating an adaptation degree score between each resource supply vector and all resource demand vectors, wherein the adaptation degree score comprehensively considers capability matching degree, load balancing degree and geographic adjacency; Distributing the acquisition task to the most suitable data probe according to the fitness score by adopting a task distribution algorithm to form a preliminary task distribution scheme; Evaluating the achievement degree of the overall target of the data acquisition logic framework by the preliminary task allocation scheme, and if the achievement degree is lower than a threshold value, adjusting parameters of a task allocation algorithm or adding constraint conditions to carry out iterative optimization until the achievement degree meets the requirement; The finalized task allocation scheme is compiled into an acquisition instruction set executable by each data probe, the acquisition instruction set including a target data source identification, an acquisition protocol, a parameter configuration, and a task scheduling plan.
6. The big data based normalized data acquisition system according to claim 5, wherein the task allocation algorithm is adopted to allocate the acquisition task to the most suitable data probe according to the fitness score to form a preliminary task allocation scheme, and specifically includes: Arranging the fitness score between the resource supply vector of each data probe and the resource demand vector of each acquisition task into a fitness score matrix; in the adaptation degree score matrix, selecting a data probe with the highest adaptation degree score for each acquisition task as a candidate execution probe of the acquisition task; Checking the number of the acquisition tasks allocated to each candidate execution probe, and if the number of the acquisition tasks allocated to the candidate execution probe exceeds the upper limit borne by the processing capacity margin, marking the acquisition task with the lowest adaptation score on the candidate execution probe as a task to be reallocated; searching a data probe with the highest next fitness score for each task to be reassigned, and if the data probe is not assigned with the task currently or the number of assigned tasks does not reach the upper limit, assigning the task to be reassigned to the data probe; The checking and reassigning process is repeated until all the acquisition tasks are assigned to one data probe, and the number of the acquisition tasks assigned to each data probe does not exceed the upper limit of the processing capacity margin thereof, at which time a preliminary task assignment scheme is obtained.
7. The big data based normalized data collection system according to claim 4, wherein said plurality of data probes perform synchronous or asynchronous collection actions under the direction of said data collection logic framework, and send collected raw data into a unified data conversion pipeline, specifically comprising: each data probe establishes a connection session with a target data source according to an assigned acquisition instruction set, and performs data extraction operation according to the designated acquisition frequency, acquisition depth and acquisition parallelism; In the process of executing data extraction operation, the data probe continuously monitors the session state and the data stream quality, and feeds the monitored session state and data stream quality information back to the dynamically adjusted data acquisition logic framework in real time; the dynamically adjusted data acquisition logic framework dynamically adjusts an acquisition instruction set of a related data probe according to the feedback session state and data stream quality information to form closed loop control; The data probe carries out local caching on the original data block which is successfully extracted, and attaches a metadata tag, wherein the metadata tag records the source, acquisition time, session identification and data format fingerprint of the data block; when the locally cached data block reaches a certain threshold value or a transmission instruction is received, the data probe packages the data block with the metadata tag into a transmission unit according to the communication protocol and the data format; the transmission units generated by the data probes are orderly pushed to the entrance buffer area of the unified data conversion pipeline through an asynchronous message mechanism or a synchronous call interface.
8. The big data based normalized data collection system of claim 7, wherein the unified data conversion pipeline performs pattern recognition and structural remodeling on the native data to convert heterogeneous native data into a data set with unified pattern specification, specifically comprising: the entrance buffer zone of the unified data conversion pipeline receives the transmission units from the plurality of data probes, and the transmission units are unpacked to separate the original data blocks and metadata tags attached to the original data blocks; Calling a preset or dynamically loaded format analyzer according to the data format fingerprint in the metadata tag, performing deep analysis on the original data block, and identifying the structure, the field, the type and the constraint relation in the data; matching and mapping the identified internal structure, field, type and constraint relation of the data with a global unified mode definition library, wherein the global unified mode definition library defines a data model of a target specification; According to the matching and mapping result, performing operations of cleaning, transcoding, splitting, merging or calculating derived fields on the data in the original data block so as to eliminate ambiguity, fill the deletion, correct errors and meet the constraint of a target data model; Reorganizing the data processed by the operations of cleaning, transcoding, splitting, merging or calculating the derived fields into new data records according to a target data model, and generating a globally unique identifier and a version stamp for each new data record; All new data records conforming to the target data model are batch assembled into data blocks that together constitute a data set having a uniform pattern specification.
9. The big data based normalized data collection system according to claim 8, wherein the operations of cleaning, transcoding, splitting, merging or calculating derived fields for the data in the original data block according to the result of the matching and mapping specifically include: For the data with field missing or value domain abnormality, carrying out data complement according to a default value rule defined in the global unified mode definition library or a filling model learned from historical data; for text data with inconsistent coding formats, uniformly converting the text data into specified standard codes by a code table or character set detection technology; For composite data which is too deep to nest or too complex in structure, splitting the composite data into a plurality of associated simple data records according to flattening requirements of a target data model; For the data which logically belong to the same entity but are physically dispersed in different original data blocks, merging according to entity analysis rules to generate a complete entity record; according to business rules or statistical analysis requirements, a new derivative field is generated by calculating a predefined function or script by utilizing a basic field in an original data block; All operations to clean, transcode, split, merge or compute derived fields are recorded in a data trace log that is associated with the new data record that is ultimately generated.
10. The big data based normalized data collection system according to claim 9, wherein said injecting the data set with the unified schema specification into a data distribution network with buffering and routing functions, the data distribution network delivering the data to the designated data warehouse, specifically comprises: Dividing the data set with the unified mode specification into data fragments with moderate size, and attaching a routing tag containing a target data warehouse address, data priority and delivery aging requirements to each data fragment; distributing the data fragments to transmission queues with different service levels in a data distribution network according to the data priority and delivery aging requirements in the routing label; A routing node in a data distribution network continuously detects a network path state leading to a target data warehouse, wherein the network path state comprises delay, packet loss rate and available bandwidth; combining the service grade of the transmission queue and the network path state detected in real time, the routing node dynamically selects the optimal next hop node or transmission link for each data fragment; In the transmission process, if network congestion or path failure is detected, the routing node can reroute the data fragments to the standby path according to a preset strategy or temporarily store the data fragments in a local persistent storage for waiting to be recovered; When the data fragments successfully arrive at the receiving gateway of the target data warehouse, the receiving gateway checks the integrity of the data fragments and sends a confirmation receipt, and the data fragments are stored in the appointed storage position of the data warehouse after unpacking and recombination.

Description

Normalized data acquisition system based on big data Technical Field The invention relates to the technical field of big data acquisition and processing, in particular to a normalized data acquisition system based on big data. Background In the current big data environment, data originates from numerous heterogeneous systems, devices and applications, which differ significantly in terms of protocol, structure and real-time status. Conventional data acquisition schemes typically rely on pre-statically configured data source connection parameters and acquisition rules. The acquisition logic is either fixed or periodically triggered, lacking real-time perceptibility of the intrinsic characteristics and state changes of the data source at run-time. The static acquisition mode shows stiffness and hysteresis in the face of data source traffic bursts, protocol behavior changes or data structure evolution. The drawbacks of this technical route directly lead to a disconnection of the acquisition process from the actual state of the data source. The method can not accurately identify and adapt to the real-time characteristics of different data sources, so that the acquisition behavior can miss key data or generate a large amount of invalid data to influence the integrity and timeliness of data acquisition, and the distributed acquisition points with different configurations are difficult to perform global coordination, so that the resource allocation can not be dynamically optimized according to the real-time load, and the uneven load or the resource waste of the acquisition nodes are easily caused. Meanwhile, due to lack of deep understanding of the characteristics of the data sources, the subsequent integration and normalization processing of heterogeneous data often face challenges such as format confusion and semantic ambiguity, and the complexity and delay of a data processing link are increased. A technology capable of sensing dynamic characteristics of a data source in real time and intelligently scheduling acquisition behaviors according to the dynamic characteristics is needed to overcome the limitation of static configuration, realize more efficient, accurate and self-adaptive data acquisition and front end normalization, and provide high-quality and consistent data input for downstream data storage and analysis. Disclosure of Invention The invention aims to solve the defects in the prior art, and provides a normalized data acquisition system based on big data. In order to achieve the above purpose, the invention adopts the following technical scheme that the normalized data acquisition system based on big data comprises: The probe sensing module is used for deploying a plurality of data probes in a distributed network environment, and the data probes monitor and sense data sources of different protocols and structures in real time so as to form an original sensing data stream; the portrait generation module analyzes and captures the characteristics of the original perceived data stream, extracts multi-dimensional characteristic information representing the inherent characteristics of the data source, and generates a dynamic portrait of the data source according to the multi-dimensional characteristic information; The acquisition scheduling module is used for establishing a dynamically adjusted data acquisition logic frame according to the dynamic image of the data source, and the data acquisition logic frame is used for guiding the cooperative acquisition behavior of the data probe; The data conversion module is used for executing synchronous or asynchronous collection actions by a plurality of data probes under the guidance of the data collection logic framework, sending collected primary data into a unified data conversion pipeline, and carrying out pattern recognition and structure remodeling on the primary data by the unified data conversion pipeline to convert heterogeneous primary data into a data set with unified pattern specification; And the distribution routing module is used for injecting the data set with the unified mode specification into a data distribution network with buffering and routing functions, and finally delivering the data to a designated data warehouse by the data distribution network. As a further aspect of the present invention, the analyzing and feature capturing are performed on the original perceived data stream, multi-dimensional feature information characterizing intrinsic characteristics of a data source is extracted, and a dynamic image of the data source is generated according to the multi-dimensional feature information, which specifically includes: separating header information and load information of a data packet from the original perceived data stream, performing protocol decoding on the header information, and performing shallow syntax analysis on the load information so as to obtain a basic descriptor of the data stream; Carrying out time serialization processing