CN-121997284-A - Core party mining method and system based on improved association and compound clustering

CN121997284ACN 121997284 ACN121997284 ACN 121997284ACN-121997284-A

Abstract

The invention provides a core party mining method and system based on improved association and compound clustering, which relate to the technical field of data processing, wherein the method comprises the following steps of counting traditional Chinese medicine frequencies in a normalized traditional Chinese medicine data set; calculating four-degree association rule analysis indexes of binary medicine groups according to co-occurrence frequency of the binary medicine groups, screening a plurality of strong association binary medicine groups from the binary medicine groups by combining a preset four-degree association rule analysis index threshold value, carrying out standardization and vectorization processing on each strong association binary medicine group to construct structured data containing meridian return attributes and efficacy attributes, clustering the structured data through a four-order chain type compound clustering algorithm to determine an expanded cluster structure, correcting the expanded cluster structure by combining a traditional Chinese medicine constraint theory, outputting a traditional Chinese medicine constraint clustering result, and optimizing the traditional Chinese medicine constraint clustering result according to a preset core medicine screening rule to generate a hierarchical core square structure comprising a core layer, an association layer and a special layer.

Inventors

YANG YANG
LI YUANBAI
LIU FANGZHOU
LI MENG
DU YU
LI YIHAO
QIN QIN

Assignees

杨阳
李园白
中国中医科学院中医药信息研究所

Dates

Publication Date: 20260508
Application Date: 20251231

Claims (10)

1. The core party mining method based on improved association and compound clustering is characterized by comprising the following steps of: S1, collecting prescription original data; S2, preprocessing the prescription original data to obtain a normalized traditional Chinese medicine data set; s3, counting traditional Chinese medicine frequency in the normalized traditional Chinese medicine data set, wherein the traditional Chinese medicine frequency comprises single medicine frequency and binary medicine group co-occurrence frequency; S4, calculating four-degree association rule analysis indexes of the binary medicine group according to the co-occurrence frequency of the binary medicine group; S5, screening out a plurality of strongly-correlated binary drug groups from the binary drug groups by combining the four-degree association rule analysis index with a preset four-degree association rule analysis index threshold; S6, carrying out standardization and vectorization treatment on each strong-association binary medicine group, and constructing structured data containing meridian tropism attributes and efficacy attributes; s7, clustering the structured data through a fourth-order chain type compound clustering algorithm to determine an expansion cluster structure; S8, correcting the expanded cluster structure by combining with the traditional Chinese medicine constraint theory, and outputting a traditional Chinese medicine constraint clustering result; And S9, optimizing the traditional Chinese medicine constraint clustering result according to a preset core medicine screening rule to generate a hierarchical core square structure comprising a core layer, a correlation layer and a special layer.
2. The improved association and compound clustering based core mining method according to claim 1, wherein the four-degree association rule analysis index comprises a support degree, a confidence degree, a lifting degree and a certainty degree; The calculation formula of the support degree is specifically as follows: ; Wherein, the Represents the frequency of simultaneous occurrence of drug A and drug B in all formulations, Representing the frequency of simultaneous occurrence of drug A and drug B in the prescription data set, N representing all prescription numbers; the calculation formula of the confidence coefficient is specifically as follows: ; Wherein, the Indicating the conditional probability that drug B also appears when drug a appears, Representing the frequency of individual occurrences of drug A in the prescription dataset; the calculation formula of the lifting degree is specifically as follows: ; Wherein, the Indicating the degree of elevation of the probability of drug a occurring to drug B occurring, Represents the degree of support for the simultaneous appearance of drug a and drug B, The degree of support for the individual appearance of drug a is indicated, Representing the degree of support for drug B alone; the calculation formula of the certainty factor is specifically as follows: ; Wherein, the Representing the risk of reverse error of rule a occurring and B occurring, Indicating the conditional probability that drug B also appears when drug a appears.
3. The method for mining a core based on improved association and compound clustering as claimed in claim 1, wherein said S6 specifically comprises: s601, constructing a symmetrical matrix for each strongly-correlated binary drug group; S602, merging the two-way repeatedly recorded drug pairs into a single record based on the symmetry matrix to obtain a plurality of strong-association binary groups; S603, performing frequency normalization processing on each strong-association binary group, and determining a plurality of relative frequencies; S604, adding vectorization attributes for each traditional Chinese medicine according to the relative frequency and combining a preset meridian numbering system and a preset traditional Chinese medicine efficacy standard table to obtain the structured data, wherein the vectorization attributes comprise vectorization attributes and efficacy vectorization attributes.
4. The method for mining a core based on improved association and compound clustering as claimed in claim 1, wherein said S7 specifically comprises: S701, identifying the structured data based on a dynamic support threshold and a lifting threshold through an improved Apriori algorithm to obtain a high-frequency strong-correlation medicine pair; S702, based on the high-frequency strong-association drug pair, fusing co-occurrence intensity and channel tropism similarity by a Ward hierarchical clustering definition method, and constructing a distance matrix: ; Wherein Distance represents the integrated Distance between drugs A and B, alpha represents the weight of co-occurrence frequency, Representing the co-occurrence of drugs a and B, Represents the maximum value of co-occurrence frequency of all drug pairs, The similarity of the meridian tropism of the medicines A and B is shown; S703, according to the distance matrix, based on a variance increment minimization principle, laminating the adjacent medicines layer by layer to form a plurality of initial core clusters: ; Wherein n A ,n B ,n C represents the number of drugs contained in cluster a, cluster B, and cluster C, respectively, d (a, C), d (B, C), d (a, B) represents the original distance between clusters a and C, clusters B and C, clusters a and B, respectively, and d (AB, C) represents the distance between the new cluster AB and another cluster C after merging; S704, calculating an average efficacy vector of each cluster in the initial core clusters; S705, calculating the functional similarity between the low-frequency medicine pair in the strongly-correlated binary medicine group and the initial core cluster based on the average efficacy vector by improving a DBSCAN algorithm, and classifying the low-frequency medicine pair with the functional similarity larger than a functional similarity threshold into the correlated medicine to determine the structure of the extended cluster.
5. The method for mining a core based on improved association and compound clustering as claimed in claim 4, wherein said S701 specifically comprises: s7011, setting a dynamic support degree threshold value and a lifting degree threshold value; S7012, calculating the support degree and the lifting degree of the binary medicine group based on the dynamic support degree threshold value and the lifting degree threshold value; s7013, determining the binary medicine group with the support degree larger than a dynamic support degree threshold and the lifting degree larger than a lifting degree threshold or with the support degree equal to the dynamic support degree threshold and the lifting degree larger than the lifting degree threshold as the high-frequency strong-correlation medicine pair.
6. The method for mining a core based on improved association and compound clustering as claimed in claim 4, wherein S705 specifically comprises: s7051, setting parameters of the improved DBSCAN algorithm; S7052, respectively carrying out efficacy vectorization treatment on each drug in the low-frequency drug pair and each initial core cluster by combining a preset traditional Chinese medicine efficacy standard table through an improved DBSCAN algorithm after parameter setting to obtain a low-frequency drug efficacy vector and an initial core cluster efficacy vector; S7053, calculating the functional similarity of the low-frequency medicine pair and each initial core cluster based on the low-frequency medicine efficacy vector and the initial core cluster efficacy vector through a cosine similarity formula: ; Wherein, the The functional similarity of the drugs A and B based on the efficacy vector is represented, n represents the total dimension of the efficacy vector, A i represents the component value of the efficacy vector of the ith dimension clustering cluster, and B i represents the component value of the efficacy vector of the ith dimension clustering drug which needs to cross clusters; And S7054, classifying the low-frequency medicine pair with the functional similarity larger than a functional similarity threshold into the associated medicine, and determining the expanded cluster structure.
7. The method for mining a core based on improved association and compound clustering as claimed in claim 1, wherein said S8 specifically comprises: The traditional Chinese medicine constraint theory comprises hard constraint and soft constraint; S801, performing hard constraint inspection on the expanded cluster structure based on a preset Chinese medicine compatibility tabu rule; S802, based on a hard constraint checking result, combining the soft constraint, and correcting a medicine combination in the extended cluster structure, which violates the traditional Chinese medicine compatibility tabu rule; S803, calculating the main channel occupancy ratio of each medicine based on the corrected expansion cluster structure: ; Wherein, the Representing the main meridian tropism ratio of the medicine, The frequency of main meridian tropism is indicated, Indicating the total frequency of all the channels; S804, screening out candidate medicines with scattered channels by combining the main channel allocation ratio with a preset main channel allocation ratio threshold; S805, calculating the functional similarity between the candidate medicine and each expansion cluster, and distributing the candidate medicine with the functional similarity larger than the functional similarity threshold and the functional similarity equal to the functional similarity threshold to the corresponding expansion cluster in a cross-cluster manner; s806, outputting the traditional Chinese medicine constraint clustering result corrected by the hard constraint and the soft constraint according to the distribution result.
8. The method for mining a core based on improved association and compound clustering as claimed in claim 1, wherein said S9 specifically comprises: The preset core drug screening rules comprise preset double threshold rules and preset monarch drug and ministerial drug definition rules; S901, screening out core medicines from each medicine cluster based on the preset double threshold rule, wherein the double threshold rule is that the number of high-frequency medicines participated in the medicines is not lower than a first threshold, and the main channel allocation ratio of the medicines is not lower than a second threshold; S902, defining monarch drugs and ministerial drugs in the core drugs according to the preset monarch drug and ministerial drug definition rules to form the core layer; S903, calculating the functional similarity between the non-core layer medicine and the medicine cluster, merging the medicine with the functional similarity not lower than a fifth threshold value into the associated medicine of the medicine cluster, and forming the associated layer; S904, reserving drug pairs with the functional similarity lower than the fifth threshold and not included in any drug cluster as special combinations to form the special layer; and S905, outputting a hierarchical core structure comprising the core layer, the association layer and the special layer.
9. A core mining system based on improved association and compound clustering, comprising: A processor; A memory having stored thereon computer readable instructions which, when executed by the processor, implement the improved association and composite cluster based core mining method of any of claims 1 to 8.
10. A readable storage medium, wherein a program or instructions is stored on the readable storage medium, which when executed by a processor, implements the steps of the improved association and compound clustering based core mining method of any one of claims 1 to 8.

Description

Core party mining method and system based on improved association and compound clustering Technical Field The invention relates to the technical field of data processing, in particular to a core party mining method and system based on improved association and compound clustering. Background The traditional Chinese medicine is taken as a unique sanitary resource in China, and the prescription compatibility rule excavation is a key link for inheriting and innovating the diagnosis and treatment experience of the traditional Chinese medicine. The prescription is used as a core carrier for dialectical treatment of traditional Chinese medicine, and the efficacy of the prescription is comprehensively reflected by the synergistic effect of the prescription medicines. The accurate excavation of effective core medicine combinations (i.e. core prescription) in the prescription is a key foundation for deep understanding of prescription rules, inheriting medical experience of traditional Chinese medicine historic medical treatment and assisting new creation. The core prescription mining method deeply fuses the core theory of monarch, minister, assistant and guide compatibility logic, meridian tropism efficacy and the like of the traditional Chinese medicine with the advantages of modern algorithms such as improved association rules, compound clustering and the like, and gets rid of the surface layer dependence of the traditional method on high-frequency statistics and single clustering. The method not only accurately accords with the overall compatibility thought of traditional Chinese medicine, ensures the theoretical rationality and clinical interpretability of core combination, but also breaks through the limitation of low-frequency effective medicine on omission and unidirectional medicine distribution, accurately captures the low-frequency effective compatibility of special symptoms, and supports the clustering multi-role attribution of medicines. The method provides a solid support with both professionality and practicability for deep decoding of Chinese medicine prescription rules, clinical medication differentiation optimization and scientific research and development of innovative prescriptions. However, the existing technology based on the core combination of the traditional Chinese medicine big data mining prescription is mainly realized through shallow frequency statistics and a traditional clustering algorithm (such as K-means and the like). The shallow frequency statistics only focuses on high-frequency single medicines, the analysis level is too shallow to reveal the core association of the prescription structure, while the traditional clustering algorithm (such as K-means) groups according to the similarity of medicine characteristics, ignores the actual association strength, and leads to the mechanical separation of effective combinations into different clustering results, so that the excavated 'combinations' lose the actual clinical or prescription significance. Disclosure of Invention The invention provides a core party mining method and system based on improved association and compound clustering, aiming at solving the technical problems that the prior art excessively depends on high-frequency statistics, so that low-frequency effective medicine pairs are omitted, medicines can only belong to single clustering categories, multiple roles cannot be reflected and unidirectional distribution defects of medicines exist. The technical scheme provided by the embodiment of the invention is as follows: The core party mining method based on improved association and compound clustering provided by the first aspect of the embodiment of the invention comprises the following steps: s1, collecting prescription original data. S2, preprocessing the raw data of the prescription to obtain a normalized traditional Chinese medicine data set. And S3, counting traditional Chinese medicine frequency in the standardized traditional Chinese medicine data set, wherein the traditional Chinese medicine frequency comprises single medicine frequency and binary medicine group co-occurrence frequency. And S4, calculating four-degree association rule analysis indexes of the binary medicine group according to the co-occurrence frequency of the binary medicine group. And S5, screening a plurality of strongly-correlated binary drug groups from the binary drug groups by combining the four-degree association rule analysis index and a preset four-degree association rule analysis index threshold. And S6, carrying out standardization and vectorization processing on each strongly-associated binary drug group, and constructing structured data containing meridian tropism attributes and efficacy attributes. And S7, clustering the structured data through a fourth-order chain type compound clustering algorithm to determine an extended cluster structure. And S8, correcting the expanded cluster structure by combining with the Chinese medicine constr