CN-121996645-A - Real-time cleaning and fusion method for financial heterogeneous data
Abstract
The invention discloses a real-time cleaning and fusion method of financial heterogeneous data, which comprises the following steps of S1, acquiring multi-source, multi-time-dimension and multi-granularity financial heterogeneous original data, establishing a data access channel and executing format standardization processing, S2, constructing a cross-source, cross-time and cross-granularity financial index consistency constraint space, and dynamically learning systematic deviation distribution of each data source in different market states based on the constraint space to generate an implicit consistency deviation field. According to the real-time cleaning and fusion method for the financial heterogeneous data, a cross-source, cross-time and cross-granularity financial index consistency constraint space is constructed, systematic deviation distribution of each data source in different market states is dynamically learned based on the space, an implicit consistency deviation field is generated, the problem of consistency among multi-source heterogeneous data is effectively solved, and the accuracy and reliability of the data are improved.
Inventors
- SHEN HEPING
- HUANG CHANGFA
- WU SHUBIAO
- YANG RUI
- ZHANG XIAOLIANG
Assignees
- 深圳市紫金支点技术股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260126
Claims (10)
- 1. The real-time cleaning and fusion method for the financial heterogeneous data is characterized by comprising the following steps of: s1, acquiring multi-source, multi-time-dimension and multi-granularity financial heterogeneous original data, establishing a data access channel and executing format standardization processing; s2, constructing a cross-source, cross-time and cross-granularity financial index consistency constraint space, and dynamically learning systematic deviation distribution of each data source in different market states based on the constraint space to generate an implicit consistency deviation field; S3, calculating offset vectors of single financial data in the implicit consistency offset field in real time in a data real-time inflow stage; S4, carrying out continuous self-adaptive correction on single financial data according to the offset vector to finish data real-time cleaning; and S5, performing feature alignment and fusion processing on the cleaned financial heterogeneous data, and outputting standardized fusion data.
- 2. The method for real-time cleaning and fusing of financial heterogeneous data according to claim 1, wherein step S1 comprises the steps of: Synchronously collecting financial heterogeneous original data from a securities trading system, a futures quotation system, a bank financing platform, a financial information terminal and a third party data service provider through a distributed high concurrency data access interface, wherein the data cover stock, bonds, funds and derivative related data; The data types comprise structured transaction data, consignment queue data, account holding data, semi-structured financial report summary data, industry report keyword data, unstructured news text data and policy interpretation audio frequency transfer text data; The data granularity covers different time scales of high-frequency Tick level, millisecond level sampling, minute level, hour level, day level, zhou Du level and month level.
- 3. The method for real-time cleaning and fusing of heterogeneous financial data according to claim 2, wherein in step S1, the format normalization process specifically comprises: uniformly converting nonstandard timestamps of different data sources into an ISO 8601 standard time format through regular expression matching and protocol analysis technology; Performing precision normalization on various numerical data, uniformly converting the numerical data into floating point type representation with 6-bit decimal, and simultaneously processing abnormal format characters in the data; and extracting core financial information from unstructured text data by adopting entity identification, keyword extraction and semantic structuring technology in natural language processing, converting the core financial information into structured data in a key value pair form, and finally forming a standardized original data set.
- 4. The method for real-time cleaning and fusing of financial heterogeneous data according to claim 1, wherein step S2 comprises the steps of: before constructing a financial index consistency constraint space, screening a core financial index set which has clear business meaning and is comparable across data sources based on core business scene requirements of a financial market; the core financial indexes comprise daily gain rate, increment rate, annual fluctuation rate, market gain rate, net rate of the stock, expiration gain rate, long term and convexity of the bond, unit net rate of the fund, summer ratio and maximum withdrawal.
- 5. The method for real-time cleaning and fusing of financial heterogeneous data according to claim 4, wherein step S2 comprises the steps of: the financial index consistency constraint space is of a three-dimensional integrated structure and comprises a source constraint dimension, a time constraint dimension and a granularity constraint dimension; and establishing constraint rules from the data source, the time dimension and the granularity level respectively by the three dimensions, determining corresponding constraint thresholds by historical data statistical analysis by each dimension, and ensuring the rationality and adaptability of the constraint rules.
- 6. The method for real-time cleaning and fusing of financial heterogeneous data according to claim 5, wherein step S2 comprises the steps of: When the implicit consistency deviation field is generated, based on a historical standardized data set, a time sequence attention mechanism is adopted to capture the long-short-period time sequence dependency relationship of data, a Gaussian mixture model is combined to model the deviation distribution of each data source, the systematic deviation distribution of different data sources in different market states is dynamically learned, and then the discrete systematic deviation distribution is mapped into a continuous and smooth implicit consistency deviation field through a kernel density estimation method.
- 7. The method for real-time cleaning and fusing of heterogeneous financial data according to claim 1, wherein in step S3, when calculating the offset vector, the method comprises the steps of: Extracting each core financial index observation value from single piece of standardized data flowing in real time; inquiring deviation distribution parameters corresponding to the data based on the implicit consistency deviation field; calculating expected values of the core indexes by combining constraint rules in the consistency constraint space, and further obtaining the observation deviation of the core indexes; and carrying out normalization processing on the observed deviation to construct an offset vector, wherein the modular length of the offset vector comprehensively reflects the integral offset degree of single data.
- 8. The method for real-time cleaning and fusing of financial heterogeneous data according to claim 1, wherein in step S4, the method comprises the steps of: the self-adaptive correction is realized by constructing a dynamic self-adaptive correction function, the function dynamically adjusts the correction intensity according to the modular length of the offset vector, the correction coefficient is adjusted by adopting a hyperbolic tangent function in the correction process, the discretized abnormal data rejection operation is avoided, the time sequence continuity and the integrity of the data are maintained, and the corrected data are substituted into a three-dimensional consistency constraint space to carry out full-dimensional consistency verification.
- 9. The method for real-time cleaning and fusing of financial heterogeneous data according to claim 8, wherein the consistency verification comprises: Source constraint verification, time constraint verification and granularity constraint verification; and if the corrected data does not meet part of constraint conditions, dynamically adjusting the smooth coefficient of the correction function based on the constraint deviation degree, and carrying out iterative correction until the data meets all constraint conditions or reaches the preset maximum correction iteration number.
- 10. The method for real-time cleaning and fusing of financial heterogeneous data according to claim 1, wherein in step S5, the feature alignment process comprises the steps of: Establishing global unified feature dimensions according to the core financial index types, wherein each feature dimension clearly defines data types, calculation standards, precision requirements and business meanings; The fusion processing adopts a weighted fusion strategy based on the reliability of the data source, the reliability weight of the data source is calculated by combining the systematic deviation distribution variance of the past 180 days and the stability coefficient of the data source of the past 90 days, and the index value after fusion is obtained by a weighted average method; and carrying out format standardization processing on the fusion data to generate standardized fusion data containing complete source tracing information, and supporting real-time inquiry and batch export of multiple formats.
Description
Real-time cleaning and fusion method for financial heterogeneous data Technical Field The invention relates to the field of financial data processing, in particular to a real-time cleaning and fusion method for financial heterogeneous data. Background In the financial field, along with the rapid development of information technology and the increasing complexity of financial markets, financial data has the characteristics of multiple sources, multiple time dimensions and multiple granularity. The system has wide data sources including securities trading system, futures quotation system, bank financing platform, financial information terminal, third party data service provider, etc., various time dimensions including different time scales of high frequency Tick level, millisecond level sampling, minute level, hour level, date level, zhou Du level, month level, etc., different data granularity, rich data types including structured trading data, consignment queue data, account holding data, semi-structured financial summary data, industry research keyword data, unstructured news text data, policy interpretation audio transcription text data, etc. However, due to differences in data collection modes, storage formats, calculation standards and the like of different data sources, inconsistency exists in format, precision, semantics and the like of the financial heterogeneous data, and direct use of the data for analysis and decision may generate deviation, so that accuracy and reliability of results are affected. In addition, the dynamic change of the financial market makes the data have timeliness requirements, and acquiring and processing the data in real time is important for timely grasping market dynamics and making accurate decisions. The existing financial data processing method is difficult to simultaneously meet the real-time cleaning and fusion requirements of multi-source, multi-time dimension and multi-granularity heterogeneous data. Some methods may only focus on a single data source or specific types of data and cannot process complex and diverse financial heterogeneous data, other methods may lack effective constraints on data consistency in the processing process, so that deviation exists in the cleaned data, and some methods are insufficient in real-time performance and cannot meet the requirement of rapid change of a financial market. Therefore, a method for efficiently, accurately, and real-time cleaning and fusing financial heterogeneous data is needed. Disclosure of Invention The invention aims to provide a real-time cleaning and fusion method for financial heterogeneous data, which solves the problem that the existing financial data processing method is difficult to simultaneously meet the real-time cleaning and fusion requirements for multi-source, multi-time dimension and multi-granularity heterogeneous data. The invention realizes the aim through the following technical scheme that the real-time cleaning and fusion method for the financial heterogeneous data comprises the following steps: s1, acquiring multi-source, multi-time-dimension and multi-granularity financial heterogeneous original data, establishing a data access channel and executing format standardization processing; s2, constructing a cross-source, cross-time and cross-granularity financial index consistency constraint space, and dynamically learning systematic deviation distribution of each data source in different market states based on the constraint space to generate an implicit consistency deviation field; S3, calculating offset vectors of single financial data in the implicit consistency offset field in real time in a data real-time inflow stage; S4, carrying out continuous self-adaptive correction on single financial data according to the offset vector to finish data real-time cleaning; and S5, performing feature alignment and fusion processing on the cleaned financial heterogeneous data, and outputting standardized fusion data. Further, in step S1, the method includes the following steps: Synchronously collecting financial heterogeneous original data from a securities trading system, a futures quotation system, a bank financing platform, a financial information terminal and a third party data service provider through a distributed high concurrency data access interface, wherein the data cover stock, bonds, funds and derivative related data; The data types comprise structured transaction data, consignment queue data, account holding data, semi-structured financial report summary data, industry report keyword data, unstructured news text data and policy interpretation audio frequency transfer text data; The data granularity covers different time scales of high-frequency Tick level, millisecond level sampling, minute level, hour level, day level, zhou Du level and month level. Further, in step S1, the format normalization process specifically includes: uniformly converting nonstandard timestamps of different data sources