CN-121981239-A - Cross-organization health data space-time correlation analysis and visualization system and method
Abstract
The invention discloses a cross-mechanism health data space-time correlation analysis and visualization system and method, comprising a data acquisition and privacy module, a distributed space-time cube construction module, a medical knowledge graph module, a space-time constraint reasoning engine module, a dynamic rule engine module and a visualization analysis and decision support front-end module, wherein the distributed space-time cube with differential privacy protection is constructed by embedding a differential privacy mechanism at a data source end, and the cross-mechanism health data is upgraded from batch offline summarization to minute-level real-time aggregation in cooperation with an event-driven streaming increment method, so that a medical data island is broken under strict differential privacy mathematical guarantee; and further, the advanced-lagged relation, the complex causal relation chain and the dynamic rules among the multidimensional indexes are automatically found and verified by utilizing the space-time constraint reasoning of the fusion medical knowledge graph and the unsupervised dynamic rule mining based on time-lag mutual information.
Inventors
- QIU NING
- WANG WEN
- ZHENG XINLONG
- XU ZHICHENG
- Liao Ruiyi
- WANG JINHUA
- HUANG ZHENTAO
- ZENG YUCONG
- WANG HUI
- ZHOU YUAN
- WEI JIMING
Assignees
- 广东省城乡规划设计研究院科技集团股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251204
Claims (10)
- 1. The cross-mechanism health data space-time correlation analysis and visualization system is characterized by comprising a data acquisition and privacy module, a distributed space-time cube construction module, a medical knowledge graph module, a space-time constraint reasoning engine module, a dynamic rule engine module and a visualization analysis and decision support front-end module, wherein: The data acquisition and privacy module is deployed in a local server of each cooperative mechanism and is used for monitoring or regularly pulling newly added health event data in real time, adding noise to statistics needing to be aggregated in the newly added health event data, and transmitting the obtained noise added data stream to the distributed space-time cube construction module; the distributed space-time cube construction module is used for maintaining the space-time cube by using the noisy data stream from the data acquisition and privacy module; the medical knowledge graph module is used for structurally storing a large amount of priori knowledge in the medical field to form a medical knowledge graph; The space-time constraint reasoning engine module is used for carrying out anomaly detection on the space-time cube, extracting an abnormal event, fusing the abnormal event with the knowledge graph query result provided by the medical knowledge graph module and outputting a causal relationship chain; The dynamic rule engine module is used for monitoring and finding out the leading-lagging relation between different time sequences from the time sequence data of the space-time cube, so as to find out the dynamic rule; And the visual analysis and decision support front-end module is used for interacting the system with a user, correlating a causal relation chain output by the space-time constraint reasoning engine module with the dynamic rules found by the dynamic rule engine module, presenting the causal relation chain and the dynamic rules to the user in a visual form, and providing an analysis function.
- 2. The cross-mechanism health data spatiotemporal correlation analysis and visualization system of claim 1, wherein the method of denoising statistics to be aggregated is a laplace mechanism, and the statistics to be aggregated are denoised by applying the laplace mechanism and transmitted to a distributed spatiotemporal cube construction module, comprising the steps of: Capturing a newly-occurring health event record in real time by utilizing a data acquisition and privacy module, and recording the event record as a counting reference value C=1; Generating random noise N obeying Laplacian distribution Lap (0, 1/epsilon) by using the privacy budget parameter epsilon, adding noise to a counting reference value C by the random noise N, and outputting noise added data C', wherein the expression is as follows: C' = C + N transmitting the noisy data C' to a distributed space-time cube construction module; The distributed spatiotemporal cube construction module incrementally aggregates the noisy data C' into cells of the corresponding cube.
- 3. The across-institution health data spatiotemporal association analysis and visualization system of claim 2, wherein the privacy budget parameters are set according to a preset privacy preserving intensity, comprising the following rules: For personal information, limiting privacy budget parameters epsilon to be less than or equal to 1.0 so as to ensure that the processed information cannot identify specific natural persons and cannot be restored; For analysis tasks, privacy budget parameters are adjusted according to task precision requirements, and for macroscopic trend monitoring tasks, the value of the privacy budget parameters epsilon is 0.1-0.5, for fine causal reasoning tasks, the value of the privacy budget parameters epsilon is 0.8-1.5, and for emergency response tasks, the value of the privacy budget parameters epsilon is more than 1.0; For a non-analysis task, respectively corresponding high-sensitivity data, medium-sensitivity data and low-sensitivity data to different privacy budget parameter intervals according to a sensitivity level preset by a system, wherein the value of the privacy budget parameter epsilon corresponding to the high-sensitivity data is 0.1-0.5, the value of the privacy budget parameter epsilon corresponding to the medium-sensitivity data is 0.5-1.0, and the value of the privacy budget parameter epsilon corresponding to the low-sensitivity data is 1.0-2.0; the privacy budget allocation policy is as follows: Distributing total budget based on the data scale, sensitivity level and history contribution degree of the mechanism, dividing the pre-operator quota according to the event type and time period in the range of the privacy total budget, wherein the sum of two-dimensional degree division is smaller than or equal to the privacy total budget; In the running process of the system, each inquiry or stream update consumes a certain amount of privacy budget, the consumption is simultaneously added into the following key indexes of total budget consumption, budget consumption corresponding to an event type and budget consumption corresponding to a time period, budget control is implemented by monitoring the consumption indexes in real time, and when one of the following conditions occurs, a budget exhaustion protection mechanism is triggered, wherein the total budget consumption reaches the upper limit of an allocated total budget, the budget consumption of any event type reaches the upper limit of a subentry of the event type or the budget consumption of any time period reaches the upper limit of the subentry of the event type; The depletion protection mechanism includes: Suspending uploading, namely stopping uploading new data of the mechanism until the next budget period; a degradation mode, namely switching to a processing mode with lower precision and smaller privacy budget value, and uploading only noise adding data of the most key indexes; budget reset, namely automatically resetting the privacy budgets of all institutions according to a preset period.
- 4. The system of claim 1, wherein the space-time cube is a multi-dimensional data structure comprising dimensions and metrics, wherein the dimensions comprise a time dimension, a space dimension, and an event dimension, wherein the metrics comprise an event count, a growth rate, and a number of unique users, wherein each space-time cube cell can be uniquely identified by a triplet comprising a time identifier, a geographic location code, and an event code, and wherein the values stored in the space-time cube cell are corresponding metric values; maintaining the space-time cube by using a noisy data stream in combination with a stream-wise incremental computation method, comprising the steps of: partitioning the noisy data stream according to the geographic position codes of the data records, and dispatching the data with the same geographic position codes to the same computing node; Inside each computing node, maintaining a cube fragment corresponding to the responsible geographical partition in a memory or a cache by using a state operator, taking a time mark and an event code as keys and taking a count measurement as a value; when a new data record is entered into the stateful computing machine of the corresponding partition, the following incremental update operations are performed: a. Calculating time marks corresponding to different time granularities according to the time of the new data record, wherein the time marks comprise hours and days; b. Combining the time marks with different time granularity with event codes of the new data record respectively, searching corresponding entries in a state table, and if not, creating a new entry, wherein the value of a counter several times of the new entry is set to be 0; c. Performing atomic addition operation on the item, and updating the current count current_count of the item to current_count+noise_count, wherein noise_count is a noise count value of a new data record; And windowing the increment updating operation according to a preset time window, and outputting the cube cells updated in the window period to a space-time constraint reasoning engine module and a dynamic rule engine module when the window is closed.
- 5. The cross-institution health data spatiotemporal association analysis and visualization system of claim 1, wherein the medical knowledge graph module is constructed using existing standard medical knowledge graphs, and comprises one or more of a unified medical language system, a medical term system naming-clinical term, a constructed medical knowledge graph comprising entities and relationships, wherein the entities comprise diseases, symptoms, medicines, and geographical areas, the relationships comprise diseases-symptoms, medicines-diseases, medicines-symptoms, geographical areas-geographical areas, and the medical knowledge graph is stored using a graph database or resource description framework.
- 6. The cross-facility health data spatiotemporal correlation analysis and visualization system of claim 5, wherein the spatiotemporal constraint reasoning engine module is utilized to output a high confidence causal relationship tether comprising the steps of: precisely mapping the event in the distributed space-time cube and the entity in the medical knowledge graph module to obtain the mapping relation between the event and the entity; Performing anomaly detection on the space-time cube, and automatically identifying and triggering an anomaly event if detecting that the fluctuation of the counting of the cube cells exceeds a preset threshold value; inquiring the medical knowledge graph based on the mapping relation between the event and the entity by utilizing the abnormal event to acquire a candidate relation corresponding to the abnormal event; respectively performing forward reasoning and backward reasoning by utilizing the candidate relation, and generating a candidate hypothesis path set comprising a precursor event and a subsequent event on a knowledge graph; Carrying out path screening on the candidate hypothesis path set by combining time constraint and space constraint defined in the medical knowledge graph with data consistency constraint defined by a time-space cube to obtain a candidate path set; Inputting the candidate Path set into a preset weighted scoring model for confidence evaluation, calculating Path scores to obtain Path confidence scores Score (Path), wherein the expression is as follows: Score(Path)=w 1 * C(E 1 ,E 2 ) + w 2 * T(E 1 ,E 2 ) + w 3 * S(E 1 ,E 2 ) Wherein, C (E 1 ,E 2 ) is the data correlation strength of event E 1 ,E 2 , T (E 1 ,E 2 ) is the matching degree of the desired delay of the medical knowledge of event E 1 ,E 2 , S (E 1 ,E 2 ) is the rationality score of the spatial relationship of event E 1 ,E 2 , w 1 ,w 2 ,w 3 is the weight vector, and the weight vector is determined by using an analytic hierarchy process or a machine learning process; And comparing the confidence scores of the paths with a preset threshold value to obtain paths with confidence exceeding the threshold value as output causal relation chains.
- 7. The cross-facility health data spatiotemporal correlation analysis and visualization system of claim 6, wherein weight vectors are determined using analytic hierarchy process comprising the steps of: Comparing the data correlation C, the time matching degree T and the space rationality S between two different events in pairs to construct an importance degree judgment matrix, inviting K field experts to form an expert review group, and independently filling in the judgment matrix by each expert; Calculating a row geometric mean value of the judgment matrix by using a feature vector method, normalizing the judgment matrix to obtain an initial weight vector W k = [w 1k , w 2k , w 3k , and simultaneously solving a maximum feature value of the matrix; calculating a consistency ratio CR by using the maximum eigenvalue of the matrix, and if CR is less than 0.1, approving the judgment matrix, and if CR is more than or equal to 0.1, returning to be refilled by an expert; The geometric mean method is adopted to aggregate all W k which pass through the consistency test, so that a final weight vector W= [ W 1 ,w 2 ,w 3 ] is obtained; Determining a weight vector using a machine learning method, comprising the steps of: Collecting N validated causal chain examples from historical data analysis results as positive samples, collecting M examples confirmed by experts to be false association as negative samples, constructing a training sample set and calculating a feature vector X= [ C (E 1 ,E 2 ), T(E 1 ,E 2 ), S(E 1 ,E 2 ) ] for each sample example; and carrying out standardization processing on the feature vector to obtain a standardized feature vector, wherein the expression is as follows: X' = (X-μ)/σ wherein mu represents the mean value of each feature in the training set, sigma represents the standard deviation of the corresponding feature, and SMOTE oversampling or random undersampling technology is adopted for processing aiming at the problem of sample imbalance; A logistic regression or gradient lifting decision tree is selected as a base model, wherein the expression of the logistic regression model is as follows: P(Y|X) = 1 / (1 + exp(-(β 0 + β 1 C + β 2 T + β 3 S))) Wherein Y represents whether the path is a binary marker of a real causal chain, X represents a feature vector consisting of C, T, S, exp represents a natural exponential function; inputting the standardized feature vector into a model for training, and outputting a coefficient [ beta 1 , β 2 , β 3 ]; performing K-fold cross validation on the model, and optimizing the super parameters through grid search or Bayesian optimization to obtain an optimal model by using a maximum verification set F1-score; And normalizing the optimal model coefficient [ beta 1 , β 2 , β 3 ] to obtain a final weight vector W= [ W 1 ,w 2 ,w 3 ].
- 8. The across-facility health data spatiotemporal correlation analysis and visualization system of claim 1, wherein the dynamic rules are discovered using a dynamic rules engine module comprising the steps of: Extracting sequence window data from k key event time sequence data in the distributed space-time cube by utilizing a sliding window mechanism, setting window length W and step length S on each sequence, gradually advancing the sliding window, and outputting sequence fragments in each window; Within each sliding window, calculating time lag mutual information between any two event time sequence fragments TS_A (t) and TS_B (t) The expression is as follows: Where Δt is the time-lag parameter, MI represents the mutual information function, The time series a is represented by the value at time t, The time series B is represented by the value at time t + Δt, Representation of And P #, joint probability distribution ) And p% ) Respectively represent And Is a boundary probability distribution of (1); time-delay parameter delta t epsilon Δt_bound, +Δt_bound ], obtaining a mutual information curve and recording a maximum mutual information value max (TLMI) and a corresponding optimal time lag Δt_max; Determining a peak significance threshold of a maximum mutual information value max (TLMI) by using a random permutation test method, and filtering the maximum mutual information value max (TLMI) which is lower than the threshold; and if the fluctuation of the optimal time lag delta t_max corresponding to the filtered maximum mutual information value max (TLMI) in a plurality of continuous sliding windows is smaller than or equal to a preset threshold epsilon, generating human-readable dynamic rules for the TS_A (t) and the TS_B (t) by using the corresponding event time sequence fragments.
- 9. A cross-facility health data spatiotemporal correlation analysis and visualization system according to claim 8, characterized in that the peak significance threshold of the maximum mutual information value max (TLMI) is determined using a random permutation test, comprising the steps of: Constructing a null hypothesis using the two time sequences A, B; Keeping the time sequence of the time sequence A unchanged, randomly rearranging the time sequence of the time sequence B for M times, and outputting M data sets { A, B_m } under the zero hypothesis, wherein m=1, 2, & gt, M; Calculating time-lag mutual information of each sample pair { A, B_m } in the set, recording a maximum time-lag mutual information value set which can be generated by each replacement sample, and forming zero hypothesis empirical distribution by using the maximum time-lag mutual information value set; And arranging the zero hypothesis empirical distribution in an ascending order, and selecting the (1-alpha) quantile of the empirical distribution as a significance threshold value to be output, wherein alpha is a preset significance level.
- 10. A cross-institution health data spatiotemporal association analysis and visualization method applied to the cross-institution health data spatiotemporal association analysis and visualization system of any of claims 1-9, characterized by comprising the following steps: Monitoring or regularly pulling new health event data of each cooperative mechanism in real time, and carrying out noise adding on statistics needing to be aggregated in the data to obtain a noise added data stream; maintaining the space-time cube by using the noisy data stream; the prior knowledge in the massive medical fields is stored in a structured way to form a medical knowledge graph; Performing anomaly detection on the space-time cube, extracting an anomaly event, fusing the anomaly event with a query result of the medical knowledge graph, and outputting a causal relationship chain; Monitoring and finding the leading-lagging relation between different time sequence data from the time sequence data of the space-time cube, and further finding out the dynamic rule; The causal relationship chains are associated with dynamic rules, presented to the user in a visual form, and provide analysis functionality.
Description
Cross-organization health data space-time correlation analysis and visualization system and method Technical Field The invention belongs to the technical field of information processing, and particularly relates to a cross-mechanism health data space-time correlation analysis and visualization system and method. Background Currently, medical health data is stored in different hospitals, public health institutions, pharmacies and the like in a scattered manner, so that a data island is formed. The specific expression of the problem is (1) physical and logical isolation, different structures of information systems (HIS, LIS and the like) of each institution, different data standards (such as different ICD coding versions) and difficult direct intercommunication and integration of data, (2) policy and regulation limitation, due to consideration of patient privacy and data safety, related regulation strictly limits the cross-institution flow of original medical data, especially precious resources in government databases are difficult to obtain by researchers, and (3) technical and cooperation barriers, cross-institution data sharing is often hindered by the problems of system shutdown, complex interface integration, lack of unified technical support, personnel training and the like. The hazard of data islands is serious, which directly leads to the shortage of analysis sample size, and especially when researching diseases with low prevalence, data of a plurality of clinical centers must be integrated to obtain statistically significant results. More importantly, the method prevents macroscopic and continuous observation of dynamic processes such as disease transmission, drug efficacy and the like, so that dynamic mode mining and causal reasoning are difficult to develop effectively. To break up the data islands, researchers have proposed various methods. One is to build a centralized health data center, aggregate the data of each institution and analyze it. However, this approach faces serious challenges for data privacy and security, and centralized storage of patient-sensitive information is prone to risk of leakage. In addition, the data formats and standards of different institutions are different, and direct convergence requires complex cleaning and standardization processes, which are costly and update delays. Another approach is to train the model without moving the raw data using distributed machine learning techniques such as federal learning. Federal learning has great potential in the medical field, and can realize multi-center large-scale medical research and collaboration while ensuring data privacy and safety. However, the existing federal learning scheme is still insufficient in coping with the problem of independent co-distribution (Non-IID) specific to medical data, and the difference of patient population and diagnosis and treatment habits among different institutions can lead to the performance reduction of the aggregated global model. Meanwhile, most federal study research remains in an experimental stage, and methodological defects and deviations exist in practical clinical application, such as insufficient privacy protection, poor model generalization capability, large communication overhead and the like, so that the effectiveness of the method is seriously affected. At the data analysis level, the existing system mostly depends on static and descriptive statistical analysis, and lacks deep mining capability for dynamic space-time correlation. For example, they can show an epidemic map, but it is difficult to automatically reveal the propagation path of an epidemic from a-land to B-land and its time delay. In addition, the existing medical knowledge graph contains rich medical conceptual relations, but is mostly static, and is not fully fused with information of time and space dimensions, so that dynamic causal reasoning containing confounding factors is difficult to support. How to distinguish real causal relationships from false statistical correlations from observed data is a core challenge facing current technology. The prior art publication number CN114048515B discloses a medical big data sharing method based on federal learning and blockchain, a data user selects nodes according to node trust, applies data and pays use fees to intelligent contracts, a data provider publishes own calculation power and data sample size to be uplink after receiving the application and agreeing, the data user sends a model and convergence conditions to the intelligent contracts, the data provider downloads the model, when all the nodes are ready, federal learning is started, meanwhile, the intelligent contracts start overtime calculation, the intelligent contracts perform aggregation calculation and judge whether convergence is performed, the intelligent contracts update trust values of all the nodes and issue rewards according to contribution values, and the scheme has a trust mechanism and a model trai