CN-121350330-B - High-precision time sequence data extraction method and system based on modularized design

CN121350330BCN 121350330 BCN121350330 BCN 121350330BCN-121350330-B

Abstract

A high-timeliness time sequence data extraction method and system for a multi-source heterogeneous network. The system adopts an asynchronous message queue to realize low-coupling distributed cooperative work of five functional modules, namely web crawling, structural analysis, data cleaning and mapping, task scheduling and data warehousing. The system solves the adaptability problem of heterogeneous data sources by integrating static and dynamic rendering and a multi-mode grabbing engine called by an API (application program interface). The structured analysis module adopts a mixed analysis mode of a rule base and a BERT-based natural language processing model, so that the entity extraction intelligent level and accuracy of unstructured text are remarkably improved. The data cleaning module executes abnormal value dynamic detection of time sequence data based on a statistical Z-Score algorithm, so that the reliability of the warehouse-in data is ensured. The task scheduling module supports an incremental update and failure retry mechanism, and system throughput and operation stability are greatly improved. The invention realizes automatic, highly reliable and highly time-efficient time-series data acquisition, cleaning and structured storage through an advanced modularized architecture, an AI-driven analysis technology and a statistical data quality control mechanism.

Inventors

GUO JIANMIN
LI YAZHOU
DAI PENGJIE
PAN KE
DU HAIJUN
LIU JIAYUAN

Assignees

金智东博(北京)教育科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251020

Claims (7)

1. A high timeliness time series data extraction system based on modular design, the system comprising: The webpage crawling module is used for automatically acquiring original data from the internet multi-source heterogeneous webpage according to the task instruction issued by the task scheduling module, and the crawling module can dynamically select a static HTML crawling mode, a dynamic page rendering mode or an API interface calling mode according to the type of the data source; The structural analysis module is used for analyzing the original data acquired by the webpage crawling module, wherein the structural analysis module is combined with the BERT-based natural language processing model through a configurable rule base to realize the mixed analysis of key entity fields in the unstructured bulletin text and generate a structural data object, so that the intelligent degree and the accuracy of the data analysis are remarkably improved; The data cleaning and mapping module calculates a hash value of a key field for each record for deduplication, unifies a date format into an ISO 8601 standard time format to ensure data consistency, and detects an abnormal value based on a statistical Z-Score algorithm, wherein the Z-Score algorithm detects the abnormal value comprises the steps of calculating a mean mu and a standard deviation sigma of a time sequence data history sequence, substituting a current value X to be detected into a formula: Z = (X μ) / σ; when the |Z| is larger than a preset threshold value 3, marking the record as abnormal and adding an abnormal score to realize active identification and risk early warning of abnormal data; The task scheduling module is used for automatically generating a timing task based on the Cron expression and realizing increment updating by reading the timestamp successfully captured last time in the database, and is provided with a failure automatic retry mechanism, and the failure task can be placed into a delay queue and re-executed after a preset time interval, so that the continuity and the integrity of data capture are ensured; The data warehousing module is used for writing the cleaned time sequence data into the time sequence database and the relational database, and automatically generating a system data quality report containing the total number of the warehousing records, the abnormal data quantity and the field deletion rate index after warehousing; and each module performs data interaction through the asynchronous message queue to form a high-cohesion and low-coupling data processing pipeline from top to bottom.
2. The system of claim 1, wherein the web page crawling module comprises a static HTML crawling sub-module, a dynamic page rendering sub-module, and an API interface calling sub-module, capable of automatically selecting an optimal crawling mode to accommodate multi-source heterogeneous data sources.
3. The system of claim 1, wherein the structured parsing module performs hybrid parsing based on XPath, regular expression, and BERT named entity recognition model to automatically extract data source identification, key personnel information, information release time, and core timing value fields from unstructured text.
4. The time sequence data extraction method based on the modularized design is characterized by comprising the following steps of: s1, task scheduling and task generation: The task scheduling module is used for triggering the data grabbing task at regular time based on a time sequence rule, reading a last successful update time stamp from a metadata database before the task is generated, and generating a task instruction only aiming at incremental data after the time stamp; S2, data capture: The webpage crawling module executes corresponding strategies according to a crawling mode of task instructions, wherein static pages directly analyze an HTML structure by adopting requests and lxml, dynamic pages simulate user behaviors through a headless browser and wait for JavaScript rendering to be completed; S3, structural analysis: the structural analysis module combines the rule base and the BERT named entity recognition model to perform field recognition and text extraction, generates a structural JSON object and pushes the structural JSON object to the downstream module; s4, data cleaning and abnormality detection: the data cleaning and mapping module receives the structured JSON object, performs deduplication and format unification operation, and performs statistical outlier detection by the following steps: s41, implementing data deduplication based on the key field hash; S42, unifying the numerical value and the time format into a standardized unit and an ISO 8601 format; s43, detecting abnormal values through a Z-Score algorithm, wherein the method comprises the steps of calculating the mean value mu and the standard deviation sigma of the time sequence data history sequence, substituting the current value X to be detected into a formula Z= (X) Marking the record as abnormal and adding an abnormal score when |Z| is greater than a preset threshold value 3, thereby actively identifying abnormal data and preventing the database from being polluted by error values; s5, data warehouse entry and quality audit: The time sequence data after cleaning is written into the time sequence database and the relational database; The data warehousing module generates a data quality report after warehousing is completed, and the number of the warehouses, the abnormal data rate and the field deletion rate are counted; S6, manual auditing and final data confirmation: For marked abnormal data, the system is automatically pushed to a manual auditing interface, an administrator performs verification and correction, all manual operations are recorded, and data traceability is achieved.
5. The method of claim 4 wherein the task scheduling module places the task into a delay queue after a task execution failure and automatically retries three times after a preset time interval.
6. The method of claim 4, wherein the data warehouse entry module pushes the report to a monitoring system or an administrator terminal after generating the data quality report, and records a manual audit operation log to implement data full-flow traceability.
7. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the time series data extraction method according to any of claims 4-6.

Description

High-precision time sequence data extraction method and system based on modularized design Technical Field The invention relates to the technical field of computer data processing, high throughput data acquisition and large data quality control. And more particularly to a method and system for automatically extracting, cleaning, verifying and structuring stored time series data from a multi-source heterogeneous network. Background With the rapid development of information technology and distributed computing, the demands of various industries for highly time-efficient time series data are increasing explosively. Such data (e.g., sensor readings in the industrial control field, network device performance metrics, or medical health monitoring data, collectively referred to as time series data) are typically distributed on a network in a diverse format, but the prior art faces a series of technical bottlenecks in processing these complex data sources. The target data source comprises traditional static HTML, dynamic pages (AJAX loads) generated by dynamic rendering of client JavaScript, and contents returned by the structured API interface. Traditional data crawlers or analysis scripts can only adapt to a single format, so that the adaptability to multi-source heterogeneous data is poor, the maintenance cost is high, and the data crawlers or analysis scripts are easy to fail due to the structural change of websites. In addition, the directly captured raw data often has problems of data redundancy, non-uniformity of format and units, and inclusion of errors or outliers. The existing method generally lacks a high-efficiency systematic data quality verification process, and particularly has weak adaptability to time sequence data with large fluctuation, so that the traditional static threshold detection method pollutes a database by error data, and the accuracy of downstream application is affected. Again, the lack of efficient task scheduling and incremental update mechanisms results in a system that often employs a full-scale grab approach, resulting in wasted resources and delayed data updates. Meanwhile, the simple grabbing failure processing mechanism cannot guarantee the integrity and continuity of data under the conditions of network fluctuation and the like. Finally, a systematic monitoring and auditing mechanism for data quality is generally lost in the traditional data extraction flow, and when a data problem is found, the error source is difficult to quickly locate or manual intervention in the data processing process is difficult to trace, so that the reliability of the whole data is reduced. In view of the foregoing, there is a strong need in the data service field for a data extraction method and system with fine design, advanced technology, stability, intelligence and high scalability. The system can automatically extract key information from complex and diverse webpage documents with high precision, and carry out strict cleaning, verification and standardization treatment on data, so as to provide a solid and reliable data base for downstream computing applications such as high-frequency computing, quantitative analysis, risk modeling and the like. . Disclosure of Invention The invention aims to systematically solve a series of key technical problems of poor analysis suitability, low original data quality, non-uniform format, insufficient timeliness caused by stiff data updating strategy, missing of data quality monitoring and tracing mechanisms and the like caused by different data source forms in the existing data acquisition. The invention realizes the following targeted technical effects and system performance improvement through a specific modularized architecture, an intelligent data analysis algorithm and a dynamic statistics school verification means. Specifically, the invention provides a high-timeliness time sequence data extraction system based on a modularized design, which is characterized by comprising the following components: The webpage crawling module is used for automatically acquiring original data from the internet multi-source heterogeneous webpage according to the task instruction issued by the task scheduling module, and the crawling module can dynamically select a static HTML crawling mode, a dynamic page rendering mode or an API interface calling mode according to the type of the data source; The structured analysis module is used for analyzing the original data acquired by the webpage crawling module, wherein the structured analysis module is combined with a BERT-based natural language processing (NER) model through a configurable rule base to realize the mixed analysis of key entity fields in the unstructured bulletin text and generate structured data objects, so that the intelligent degree and accuracy of data analysis are remarkably improved; the data cleaning and mapping module is used for carrying out deduplication, format unification and unit normalization on the s