CN-122025188-A - Multi-dimensional correlation analysis method for community diagnosis data based on multi-dimensional data cube

CN122025188ACN 122025188 ACN122025188 ACN 122025188ACN-122025188-A

Abstract

The application discloses a multi-dimensional correlation analysis method of community diagnostic data based on a multi-dimensional data cube, which relates to the technical field of data processing. Based on the weighted aggregation, hierarchical aggregation and cell index construction of the data cubes, the organization and the summarization of multidimensional data are efficiently realized, and the requirements of statistical analysis and cross-dimension combination analysis are considered. Building a correlation cluster by calculating comprehensive correlation strength, and building a local neighborhood reference recognition abnormal correlation mode to accurately capture cross-dimension complex correlation and abnormal information. The dimension contribution value is calculated by combining the abnormal association mode, and the association evolution path is generated, so that the accurate output of the analysis result is realized, a comprehensive and accurate decision support is provided for community diagnosis, and the problems that the prior art is difficult to adapt to multi-source isomerism of community diagnosis data and the cross-dimension relationship is complex are solved.

Inventors

LI YAPING
QIN SHUMIN
LIU XIAOSHENG
YANG DONGWEI

Assignees

山东维克特信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1.A method for multidimensional correlation analysis of community diagnostic data based on a multidimensional data cube, comprising: Receiving community diagnosis heterogeneous data, preprocessing the heterogeneous data to generate multidimensional member codes, and obtaining community diagnosis basic data; mapping each record in the community diagnosis basic data to a data cube cell based on the multidimensional member code, carrying out weighted aggregation and hierarchical aggregation, and establishing a cell index; Calculating the comprehensive association strength among the data cube cells according to the aggregation measure of the data cube cells and the relation among the data cube cells, and constructing an association cluster composed of a plurality of data cube cells according to the comprehensive association strength; respectively constructing local neighborhood references aiming at the data cube cells and the association clusters, calculating abnormal scores according to the deviation of the aggregation metrics of the corresponding data cube cells relative to the local neighborhood references, and identifying abnormal association modes; and calculating the contribution value of each preset dimension by combining the abnormal correlation mode, generating a correlation evolution path, and merging the contribution value, the abnormal score and the correlation evolution path result to carry out sequencing output.
2. The multidimensional correlation analysis method of community diagnostic data based on a multidimensional data cube according to claim 1, wherein the receiving community diagnostic heterogeneous data, preprocessing the heterogeneous data to generate multidimensional member codes, and obtaining community diagnostic basic data comprises the following steps: collecting community diagnosis related heterogeneous data in a community health service center, a basic medical institution, a checking and detecting system, a follow-up system and a resident health file system, and uniformly representing each original record as: Wherein: Represent the first A community diagnostic record of the strip, Representing the identity of the resident and, Representing the identification of the zone unit, The identity of the institution is indicated, The time information is represented by a time period, Representing the basic attribute vector(s), Representing the vector of the verification data, A diagnostic event vector is represented and is displayed, Representing original quality information; And performing cleaning, missing correction, quality weighting and standardization processing on the heterogeneous data, mapping the processed data into multi-dimensional member coding data according to a preset dimension system, and generating corresponding multi-dimensional embedded representations based on each dimension member information and hierarchical path information thereof, wherein the preset dimension system at least comprises a time dimension, a region dimension, a mechanism dimension, a crowd dimension, a diagnosis dimension, a symptom dimension, a check dimension and a follow-up ending dimension.
3. The method for multidimensional correlation analysis of community diagnostic data based on a multidimensional data cube according to claim 2, wherein the steps of performing cleaning, missing correction, quality weighting and standardization processing on the heterogeneous data, mapping the processed data into multidimensional member coded data according to a preset dimension system, and generating a corresponding multidimensional embedded representation based on each dimension member information and hierarchical path information thereof, comprise: For the first In the bar record The individual features perform a quality enhanced composite normalization resulting in enhanced normalized feature values: Wherein: representing the first after enhanced normalization The value of the characteristic is a value of, The original characteristic value is represented by a value of the original characteristic, Represent the first The median of the individual features, Represent the first The quartile moments of the individual features, To prevent a small constant with zero denominator, The mean value is represented as such, For a robust normalization of the duty cycle coefficients, Is the standard deviation of the two-dimensional image, In order to correspond to the coefficients of the basis functions, Represent the first The number of basis functions is chosen, Representing the number of the basis functions; the comprehensive quality weights are then generated for each record: Wherein: Represent the first The overall quality weight of the bar record, The degree of trustworthiness of the source is indicated, The rate of the missing is indicated by the term, As a function of the decay of time, Representing the time interval between the data and the current analysis time point, The time decay constant is represented as a function of time, The time decay constant is represented as a function of time, As the weight coefficient of the light-emitting diode, Is a Sigmoid function; after normalization and quality weighting is completed, the records are mapped into multi-dimensional member-encoded vectors including time dimension, region dimension, organization dimension, crowd dimension, diagnosis dimension, symptom dimension, inspection, and follow-up ending dimension: Wherein: Represent the first The multi-dimensional member of the bar record encodes a vector, Indicating that it is at the first The value of the member in the individual dimensions, Representing the total number of dimensions; Finally, the first The bar record is further represented as a multidimensional embedded vector such that the encoding contains both membership information and hierarchical path information: Wherein: Represent the first The multi-dimensional embedded vector of the strip record, The members representing the kth dimension are embedded in the matrix, A hierarchical embedding matrix representing the kth dimension, Representing members Is a hierarchical path code of (a) to (b), Representing the vector splice operator.
4. The method for multidimensional data cube based multidimensional correlation analysis of community diagnostic data of claim 3, wherein the mapping each record in the community diagnostic base data to a data cube cell based on the multidimensional member code, performing weighted aggregation, hierarchical aggregation and building a cell index comprises: mapping each record to a corresponding data cube cell based on the multi-dimensional member encoding vector, the cell identity being represented as: Wherein: Represent the first The bar records the corresponding cube cell identity, Representing a multidimensional mapping function; Calculating the attribution probability of the record to each cell through a soft allocation model: Wherein: Represent the first The bar record being assigned to a cell Is used for the soft allocation probability of (a), Representing cells Is defined by the center vector of (a), Indicating the dispensing temperature parameter(s), A set of cells is represented and, Representing the square of the two norms; After the soft allocation probability is obtained, tensor aggregation is carried out on each cell, and cell-level high-order expression is generated: Wherein: representing cells Is used to determine the aggregate tensor of (c), Representing the total number of records, A population attribute vector is represented and, Representing a diagnostic/symptomatic attribute vector, Representing the test/follow-up attribute vector, Representing a tensor outer product operator; The low rank decomposition is further performed on the cell tensors to enable extraction of potential structures from the high order tensors: Wherein: The decomposition rank is indicated as such, Represent the first The intensity of the individual potential factors is such that, Representing factor vectors on different modalities; an integrated aggregate metric vector is then generated for each cell: Wherein: representing cells Is used to determine the quality weighted sample size of the sample, , The average severity is indicated as being indicative of the average severity, , Represent the first The severity score of the bar record, Represents the average abnormal intensity of the light emitted from the light emitting element, , The intensity of the test anomaly is indicated, Representing the proportion of follow-up or review correlations, , Representing a follow-up/re-diagnosis mark, Representing potential structural strength; And finally, performing coiling and drilling aggregation on each cell aggregation measure according to the time level, the region level, the mechanism level and the diagnosis level, and establishing a cell sparse index structure by combining the sample size, the distribution complexity, the time freshness and the graph neighborhood difference.
5. The method of claim 4, wherein the last step of performing roll-up and roll-down aggregation on each cell aggregation metric according to a time hierarchy, a region hierarchy, a mechanism hierarchy, and a diagnosis hierarchy, and establishing a cell sparse index structure in combination with sample size, distribution complexity, time freshness, and graph neighborhood differences, comprises: and (3) carrying out roll-up aggregation on the sub-level cells aiming at the hierarchical relationship of the time dimension, the region dimension, the organization dimension and the diagnosis dimension to obtain the parent-level cell measurement: Wherein: representing the parent cell metric vector, Representing parent level cells Is provided in the sub-level cell set of (c), Representing sub-level cells For father level cell Is used to determine the aggregate weight of the (c) for the (c), Representing the regular coefficients of the graph, Representing graph connection weights between parent and child nodes, Representing a sub-level cell metric mean vector; the said Further expressed as: Wherein: Representing the sensitivity coefficient of the sample size, Representing the hierarchical distance of the child level to the parent level, Representing a hierarchical decay constant; sparse index activation scores are established for the cells, and the higher the scores, the higher the priority of the cells is to establish indexes so as to support subsequent quick retrieval and multi-granularity calculation: Wherein: Display unit cell Is a score of the index activation of (c), The maximum sample size is indicated and the maximum sample size is indicated, The intra-cell constituent entropy is represented, The time freshness interval is represented as a time freshness interval, The time decay constant is represented as a function of time, Representing the gradient of the neighborhood of the graph, Representing the weight coefficient.
6. The multi-dimensional data cube based community diagnostic data multi-dimensional association analysis method of claim 1, wherein the calculating the integrated association strength between data cube cells from the aggregate measure of the data cube cells and the relationship between data cube cells, and constructing an association cluster composed of a plurality of data cube cells from the integrated association strength comprises: after the data cube organization is completed, comprehensively considering the co-occurrence relationship, the time-lag relationship, the crowd overlapping relationship, the semantic similarity relationship, the graph diffusion relationship and the condition dependency relationship for any two analysis objects, and calculating the comprehensive association strength: Wherein: Representing objects With the object Is used for the combination of the strength of the correlation, Representing the normalized mutual information item, , 、 Representing objects respectively 、 Is set in the order of the margin probability of (2), The joint probability is represented as a function of the joint probability, To prevent a small constant with zero denominator, The number of co-occurrence times is indicated, Represents the co-occurrence stabilization constant and, Representing a natural logarithmic function; The term(s) associated with the time lag is represented, , A set of candidate time lags is represented, Representing the function of the correlation coefficient, 、 Representing objects And A sequence of metrics over time, The time decay constant is represented as a function of time, Representing a time lag penalty constant; Representing the overlapping terms of the population, Representing the semantically similar terms, And For objects And Is used in the vector representation of (a), The representation is of a graph of the diffusion term, , An adjacency matrix representing an object association graph, Representing a length of Is provided with a path diffusion matrix of (a), Represent the first The diffusion weight of the order is given, Representing the maximum diffusion order; Represents a condition-dependent item that is dependent on the condition, , Expressed in a set of conditions The following conditional mutual information is provided, Represents the joint entropy of the object pairs, A set of control dimensions is represented, Representing the fusion coefficient; After the comprehensive association strength among the objects is obtained, a high-order superside containing a plurality of analysis objects is constructed based on the comprehensive association strength, and then the high-order superside is clustered and scored to generate an association cluster reflecting the multi-dimensional linkage characteristics of the community diagnosis data.
7. The method for multidimensional association analysis of community diagnostic data based on a multidimensional data cube according to claim 6, wherein the constructing a higher-order superside containing a plurality of analysis objects based on the comprehensive association strength, and further clustering and scoring the higher-order superside, generating an association cluster reflecting multidimensional linkage characteristics of the community diagnostic data, comprises: constructing a high-order superside consisting of multiple objects for inclusion The hyperedge e of the individual object is defined as having the weight: Wherein: representing the weight of the over-edge, And Indicate the first in the superside And The number of objects to be processed is the number of objects, The saliency score of the object is represented, Representing the coefficient of significance compensation and, Representation of And Is a comprehensive association strength of (1); Forming an association cluster C based on the superside, and comprehensively scoring the association cluster: Wherein: representing the composite score of the associated cluster C, Representing the average associated weight of the image, Representing the density of the clusters and, A cross-dimensional coverage is represented and, The time stability is represented by the time stability, , Representing the degree of redundancy of the clusters, Representing the scoring coefficients; t represents the number of observation time windows, Represent the first The state of the associated cluster at each moment in time, The indication function is represented by a representation of the indication function, And For objects And Is a vector representation of (c).
8. The multi-dimensional data cube based community diagnostic data multi-dimensional correlation analysis method of claim 1, wherein the constructing a local neighborhood benchmark for the data cube cells and the correlation clusters, respectively, and calculating an anomaly score from deviations of the aggregate metrics of the corresponding data cube cells relative to the local neighborhood benchmark, and identifying an anomaly correlation pattern, comprises: After the association cluster and the cell structure are obtained, constructing a local neighborhood reference for any cell or analysis object, wherein the local neighborhood comprises a time adjacent unit, a space adjacent unit, a similar mechanism unit or a similar crowd unit; for the r aggregation metric, firstly calculating a normalized deviation value of the aggregation metric relative to the local neighborhood: Wherein: representing the normalized deviation value of object c over the r-th aggregate measure, An r-th aggregation metric value representing object c, And Respectively represent local neighborhoods The mean and standard deviation of the corresponding metrics, A small constant to prevent zero denominator; Based on the normalized deviation values, a joint anomaly score is constructed from the degree of deviation, the distribution mutation, the time drift, the map neighborhood incompatibility and the potential spatial offset: Wherein: representing the composite anomaly score for object c, R represents the aggregate measure of participation score, Representing the current distribution Relative to a reference distribution Is used for the distribution of KL of the formula (I), The term of time drift is indicated as such, , And Respectively representing the sample size or the aggregation strength of the object c at adjacent moments; Representing the neighborhood of the graph as an uncoordinated term, , Representing a set of neighbors of object c under the graph structure, Representing objects With neighborhood objects The right of the edge between the two, Respectively represent the corresponding aggregate metric vector, Representing the square of the two norms, Representing the potential spatial offset term(s), , Representing objects Is used to determine the potential embedded vector of (c), Representing the reference center vector of the reference, Representing the embedded scale parameter(s), Representing the corresponding weight coefficients.
9. The multidimensional correlation analysis method of community diagnostic data based on a multidimensional data cube according to claim 1, wherein calculating the contribution value of each preset dimension in combination with the abnormal correlation pattern, generating a correlation evolution path, and merging the contribution value, the abnormal score and the correlation evolution path result to perform sequencing output comprises: Let the contribution value of the kth dimension to the outlier object c be: Wherein, the Represents the marginal contribution value of the kth dimension to the outlier object c, Representing a set of dimensional arrangements, One of the arrangements is shown and, Representing a subset of dimensions that precede dimension k in the arrangement, An abnormality evaluation function is represented by a graph, The level and order of the representation are combined with the correction factors, , Representing the depth of the hierarchy in which the kth dimension is located, Representing the decay constant of the hierarchy, Representing the dimension subset size, K representing the total number of dimensions; Will be Normalization to achieve lateral comparison between different dimension contribution values: Wherein: For the normalized dimension contribution value, A small constant to prevent zero denominator; Then constructing a multidimensional association evolution path, describing propagation logic and evolution trend of the abnormal association mode, and defining a path score: Wherein: representing the composite score for the path, e represents the edges in the path, Representing the combined association strength of the edge corresponding object pair, Representing the optimal time-lag for edge e, An anomaly score representing the endpoint object, A term of transfer efficiency is represented as such, , Representing slave nodes To the node Is provided with a pair of side edges which are oriented in a direction, The side weight is indicated as such, The degree of a node is represented, The path length is indicated as such, A path loop-back penalty is represented, , Representing a path Is provided with a set of nodes in the network, Representing nodes On the path Is used to determine the number of occurrences of the picture, Representing the weight coefficient; And finally, merging the association cluster score, the comprehensive anomaly score, the marginal contribution value of each analysis dimension, the evolution path score and the result freshness, sequencing the anomaly association objects, and outputting the multidimensional association analysis result, the key influence dimension and the corresponding evolution path of the community diagnosis data.
10. The multi-dimensional data cube based community diagnostic data multi-dimensional correlation analysis method of claim 9, wherein the fusing correlation cluster scores, comprehensive anomaly scores, marginal contribution values of each analysis dimension, evolution path scores and result freshness, sorting anomaly correlation objects, and outputting multi-dimensional correlation analysis results of community diagnostic data, key influence dimensions and corresponding evolution paths thereof, comprises: Wherein: Representing the score of the associated cluster in which the object is located, Representing the score of the anomaly, Representing the normalized dimension contribution value of the object, A path score is indicated and is used to indicate, The freshness of the results is indicated, , Representing the time interval of the object from the current analysis point, Indicating the decay constant of freshness, Representing the fusion coefficient; Final basis And sequencing the abnormal associated objects, and outputting a key abnormal associated mode, a dominant dimension contribution value and an evolution path thereof.

Description

Multi-dimensional correlation analysis method for community diagnosis data based on multi-dimensional data cube Technical Field The application relates to the technical field of data processing, in particular to a community diagnosis data multidimensional association analysis method based on a multidimensional data cube. Background With the continuous construction of community health service centers, basic medical institutions, inspection and detection systems, follow-up visit management systems and resident electronic health record systems, community diagnosis related data has been expanded from single visit records into composite data sets covering multi-dimensional attributes of time, region, institution, crowd characteristics, diagnosis category, inspection results, follow-up visit results and the like. Such data not only has heterogeneous sources and obvious structural differences, but also has the requirements of statistical analysis and cross-dimensional combination analysis. Existing research has shown that data warehouse and online analytical processing OLAP can support interactive analysis of medical data. For example Hristovski et al propose to build the outpatient data as a data warehouse and support public health data exploration with OLAP, illustrating that medical health data has a basis for organization and querying using data warehouse and multidimensional analysis techniques. Kim et al further states that electronic medical records data cubes are essentially structures that aggregate and aggregate statistics under a full combination of attributes, illustrating that data cubes are suitable for multidimensional aggregate analysis that carries electronic medical records or similar medical data. In the prior art, one type of scheme focuses on organizing medical data according to a dimension table and a fact table, and completing multi-angle viewing and statistical analysis by utilizing OLAP operations such as slicing, dicing, drilling, reeling up, rotating and the like. For example, chinese patent CN108962394B proposes to construct a medical OLAP model by means of a dimension table and a fact table design, and slice, dice, drill and rotate the multidimensional organized data to support business analysis of hospital drug information, doctor's information, etc. Another type of scheme combines public health data with spatial data to construct an OLAP cube for community health assessment. The research of Scotch et al on SOVAT shows that the community health assessment needs to process large-scale numerical data and space data simultaneously, and the system realizes operations such as drill-down, drill-up and SLICE AND DICE by constructing community health OLAP cube and combining GIS and OLAP, so that the community health analysis efficiency is improved. The above description of the technology, existing solutions have enabled multi-dimensional organization, hierarchical browsing and visual analysis of medical or public health data. However, the above-mentioned prior art mainly solves the problems of organization, query and conventional statistical analysis of multidimensional data, and the analysis objects thereof are concentrated on hospital management information, public health overall indexes or space display results, and are difficult to adapt to the characteristics of complex multi-source heterogeneous and cross-dimensional relationships existing simultaneously in the community diagnosis data scene. Therefore, how to implement analysis processing of community diagnostic data which is heterogeneous and has a complex relationship is a technical problem to be solved in the field. Disclosure of Invention In order to solve the technical problems, the application provides the following technical scheme: In a first aspect, an embodiment of the present application provides a method for multidimensional association analysis of community diagnostic data based on a multidimensional data cube, including: Receiving community diagnosis heterogeneous data, preprocessing the heterogeneous data to generate multidimensional member codes, and obtaining community diagnosis basic data; mapping each record in the community diagnosis basic data to a data cube cell based on the multidimensional member code, carrying out weighted aggregation and hierarchical aggregation, and establishing a cell index; Calculating the comprehensive association strength among the data cube cells according to the aggregation measure of the data cube cells and the relation among the data cube cells, and constructing an association cluster composed of a plurality of data cube cells according to the comprehensive association strength; respectively constructing local neighborhood references aiming at the data cube cells and the association clusters, calculating abnormal scores according to the deviation of the aggregation metrics of the corresponding data cube cells relative to the local neighborhood references, and identifying abnormal asso