CN-121979864-A - Construction method of structural domain interaction database based on multi-source data fusion
Abstract
A construction method of a structural domain interaction database based on multisource data fusion belongs to the field of bioinformatics, and comprises the steps of firstly defining a structural domain boundary by adopting a multi-algorithm consensus mechanism according to PINDER data sets, combining ECOD homologous group annotation and multi-dimensional interface feature verification, identifying a high-confidence cross-chain DDI, secondly evaluating interface rigidity and assisting in geometric screening according to AFDB data sets by taking a TED structural domain set as input, mining a stable intra-chain DDI, integrating a 3d database again and eliminating peptide mediated interaction, then constructing a heterogeneous type key, utilizing an inertial spindle geometric vector fingerprint extraction and density clustering identification combination mode, generating a sequence pair key through sequence level redundancy elimination of a full library to serve as a unified anchor point of a cross-source entity, and finally establishing a standardized storage architecture for reserving original quality attributes of all sources to generate a unique non-redundant main key. The invention obviously improves the coverage rate and accuracy of the structural domain interaction identification.
Inventors
- ZHANG GUIJUN
- Ye Enjia
- WEI PENGCHENG
- XIE LEI
- ZHANG TIANYOU
- WANG HAODONG
- LIANG FANG
Assignees
- 浙江工业大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260403
Claims (10)
- 1. A construction method of a structural domain interaction database based on multisource data fusion is characterized by firstly defining a structural domain boundary by adopting a multi-algorithm consensus mechanism aiming at PINDER large-scale experimental data sets, combining ECOD homologous group annotation and multi-dimensional interface feature verification, identifying a high-confidence cross-chain DDI, secondly evaluating interface rigidity by using a TED structural domain set as input aiming at AFDB prediction data and assisting in geometric screening by using a PAE matrix of AlphaFold, mining stable intra-chain DDI, integrating a 3d database and eliminating peptide mediated interaction, then implementing a hierarchical fusion and redundancy elimination strategy, namely constructing heterogeneous keys compatible with ECOD/CATH/Pfam, utilizing an inertial spindle geometric vector fingerprint extraction and density clustering identification combination mode, generating a global normalized sequence pair key through sequence level redundancy elimination of a full library, serving as a unified anchor point of a cross-source entity, and finally establishing a standardized storage architecture which reserves original quality attribute of each source, generating a unique non-redundancy main key, and completing the construction of the database.
- 2. The method for constructing a domain interaction database based on multi-source data fusion according to claim 1, wherein the method comprises the steps of: Step 1) cross-chain domain interface identification based on PINDER data, namely aiming at PINDER experimental compound data set, adopting a multi-algorithm consensus mechanism to carry out domain division, combining ECOD homologous group annotation, and identifying cross-chain domain interaction through interface geometric features and biological correlation probability screening; step 2) identifying an intra-chain domain interface based on AFDB data, namely taking a TED domain set derived from AFDB as input, based on CATH annotation pairing, utilizing AlphaFold to predict an alignment error matrix PAE to evaluate interface rigidity and assisting in geometric screening, and excavating intra-chain domain interaction; Step 3) multisource data integration and preprocessing, namely introducing a 3d database as a structure template source, eliminating peptide-mediated interaction, and integrating PINDER source examples, AFDB source examples and 3d source examples into a unified example-level candidate set; Step 4) redundant control and sequence redundancy elimination of structural domain to the level, namely constructing heterogeneous interaction type keys compatible with different classification systems, extracting and identifying a combination mode by utilizing geometrical vector fingerprints of an inertial spindle and density clustering, and generating a global normalized sequence pair key by sequence level redundancy elimination of a full library range, thereby being used as a unified anchor point crossing source entities to realize merging; and 5) establishing a standardized storage architecture for reserving original quality attributes of all sources, generating a unique non-redundant main key, and completing database construction.
- 3. The method for constructing a domain interaction database based on multi-source data fusion according to claim 2, wherein the process of step 1) is as follows: 1.1 Obtaining the whole protein dimer complex structure from PINDER database and performing quality screening, only preserving the complex with resolution better than 2.5A, eliminating the structure with more than 10% of atomic coordinate deletion of interface region, eliminating the short peptide fragment with less than 5 residues, and removing redundant items with the sequence completely consistent with the structure identifier; 1.2 Using Merizo, uniDoc and Chainsaw to divide the domains of receptor chain and ligand chain, using Merizo output non-domain residue as mask to eliminate unordered region, and calculating the cross-ratio of different algorithm prediction results Wherein A and B respectively represent two kinds of algorithmically predicted domain residue sets, and only the region with at least two algorithmically predicted results satisfying IoU not less than 0.8 is reserved as a highly-trusted consensus domain.
- 4. The method for constructing a domain interaction database based on multi-source data fusion according to claim 3, wherein the process of step 1) further comprises: 1.3 Aligning the consensus domain sequence with a hidden Markov model library of an ECOD database by HHsearch and allocating ECOD homology group identifications, and traversing the interaction between the complex chains to establish candidate cross-chain domain pairs; 1.4 Interface screening and true interaction judgment are carried out on candidate cross-chain domain pairs, wherein the interface screening and true interaction judgment comprises the steps of calculating an interface contact area and requiring no less than 600A 2 , calculating an interface contact area BSA, defining interface heavy atom pairs and counting heavy atom contact numbers based on any heavy atom distance no more than 5.0A, defining an interface heavy atom pair set C and the heavy atom contact numbers, requiring no less than 10 At the same time, interface residue pair sets are defined accordingly If any pair of heavy atoms in the receptor residue x and the ligand residue y meet the distance threshold value less than or equal to 5.0A, the receptor residue x and the ligand residue y are recorded as And predicting the interface biological correlation probability by using PRODIGY-cryst, and only preserving the domain pairs with the probability larger than 0.5 as the interaction examples of the verified cross-chain domains.
- 5. The method for constructing a domain interaction database based on multi-source data fusion according to claim 4, wherein the procedure of step 2) is as follows: 2.1 Selecting a subset of 50% sequence non-redundant domains from the AFDB TED release as input sources and obtaining CATH family annotations generated based on Foldseek, performing quality filtering on the domain instances, including retaining only domain instances having an average pLDDT of not less than 90, domain length limited to 40 to 400, and domain instances having valid CATH annotations; 2.2 Determining a set of interface residue pairs of the candidate domain pairs based on any heavy atom distance not greater than 5A, calculating an interface average PAE based on a AlphaFold output predicted alignment error PAE matrix, and reserving the candidate domain pairs with the interface average PAE not greater than 4A as rigid in-chain domain interaction candidates; 2.3 Performing a geometric review of the intra-chain candidate domain pairs filtered by PAE, multiplexing the geometric constraints in 1.4), requiring an interface contact area of no less than 600 a 2 and a heavy atom contact number of no less than 10, and writing the domain pairs meeting the conditions to the validated intra-chain domain interaction instance set.
- 6. The method for constructing a domain interaction database based on multi-source data fusion according to any one of claims 1 to 5, wherein the process of step 3) is as follows: 3.1 The verified cross-chain domain interaction instance obtained in the step 1) and the verified intra-chain domain interaction instance obtained in the step 2) are gathered into a unified candidate set, and an instance recording unit is used as the minimum object of subsequent processing; 3.2 A 3d database is introduced as a third source, domain interaction data of the 3d database are obtained, only template entries marked as domain-domain interactions are extracted, domain-linear peptide fragments or domain-motif interaction records are removed, the template entries are analyzed to obtain structure identifiers, chain identifiers and domain ranges, corresponding Pfam domain pair identifiers are obtained in an upward trace manner to construct interaction type bonds of the 3d source, and meanwhile, a Score is scored on the basis of INTERPRETS experience potential provided by the 3d and a normalized Score Z-Score is subjected to significance screening, and only template examples which simultaneously meet Score of not less than 1.0 and Z-Score of not less than 1.8 are reserved; 3.3 Introducing a topological mode level for 3d source data, dividing the instances in the same interaction type into subgroups according to the topological mode, and selecting a unique representative structure instance for each subgroup to enter a subsequent flow; 3.4 PINDER source examples, AFDB source examples and 3d source representative examples are aligned and combined according to unified core fields to form a unified example candidate set, and blank strategies are adopted on source unsuitable fields to achieve consistent storage carried by the same table and interpreted according to sources.
- 7. The method for constructing a domain interaction database based on multi-source data fusion according to any one of claims 1 to 5, wherein the process of step 4) is as follows: 4.1 Defining uniform interaction type bonds for different data sources, wherein the type bonds of PINDER source examples are obtained by ECOD homologous group annotation combination of two side domains and are recorded as The type bond of AFDB source examples is obtained by CATH annotation combination of two side domains and is recorded as The type bond of the 3d source instance was obtained from the Pfam family annotation combination of the two side domains, noted as Performing dictionary sequence ordering on the annotation elements on two sides to eliminate type redundancy caused by difference of recording directivity; 4.2 The combination mode based on inertial spindle geometric vector fingerprint characterization in the same type of barrel comprises extracting structural domain skeleton Atomic coordinates form a point cloud Computing point cloud geometric center And translating the origin of coordinates Calculating radius of gyration For scale normalization, construction of inertial tensors And feature decomposition to obtain orthogonal principal axis vectors And performing flip disambiguation on the main axis direction to form a regular rotation matrix Calculating normalized displacement vectors of ligand domains relative to receptor domains And relatively rotate and convert into unit quaternion And the normalized displacement vector and the normalized quaternion are spliced to form a low-dimensional geometric fingerprint vector 。
- 8. The method for constructing a domain interaction database based on multi-source data fusion according to claim 7, wherein the step 4) further comprises: 4.3 For the AFDB and PINDER source examples, the translation distance is defined in the geometric fingerprint feature space Distance from rotation angle Weighting the formed instance difference measurement, adopting HDBSCAN to carry out density clustering on the instances in the same type of barrel to obtain combined mode clusters, and distributing a unique mode identifier for each combined mode cluster; 4.4 Performing global sequence level redundancy elimination in a full library range, including collecting source representative examples to construct a global example candidate pool, extracting domain sequences at two sides of each example and performing word order normalization to obtain a standard left domain and a standard right domain, respectively performing homologous aggregation on a standard left domain sequence set and a standard right domain sequence set by taking 30% sequence consistency and 80% coverage as thresholds to obtain sequence clusters, calculating sequence consistency Id, mapping the sequences at two sides of the example to corresponding sequence cluster identifications and generating sequence pair keys, performing word order normalization on the sequence pair keys to obtain global normalized sequence pair keys, defining sequence pair keys, and establishing a cross-source sequence equivalence class and index mapping relation by the global normalized sequence pair keys.
- 9. The method for constructing a domain interaction database based on multi-source data fusion according to any one of claims 1 to 5, wherein the process of step 5) is as follows: 5.1 Establishing a globally unique non-redundant main key by taking an interaction type key-combination mode identifier-global standardization sequence pair key triple as a reference, adopting a flattening index strategy to take the non-redundant main key as a main index column of a list-making separation value TSV data table, and simultaneously recording an example member set covered by the non-redundant main key to support backtracking; 5.2 The low-dimensional geometric fingerprint vector is subjected to persistent storage and configured to support retrieval based on vector cosine similarity, and meanwhile, physical and chemical characteristics such as interface contact area, interface hydrogen bond number, salt bridge number, interface shape complementation score and the like are inherited and stored.
- 10. The method for constructing a domain interaction database based on multi-source data fusion according to claim 9, wherein the step 5) further comprises: 5.3 Allocating a primary source attribute and a secondary source attribute for each non-redundant primary key entry, wherein the primary source attribute is used for distinguishing experimental determination entries from prediction inference entries, the secondary source attribute is used for identifying a data source set of the entries, and the primary quality attribute fields with different dimensions are filled according to source labels; 5.4 The TSV file is used as a core storage medium, an interactive interface based on a terminal command line is provided, the interactive interface is configured to receive an index key and a quality threshold, identify a source tag in a line through stream scanning, dynamically apply differential check logic, and output an item meeting a condition to a standard output stream or generate a sub-data set.
Description
Construction method of structural domain interaction database based on multi-source data fusion Technical Field The invention belongs to the technical field of protein structure and bioinformatics, and particularly relates to a structural domain interaction database construction method based on multi-source data fusion. Background Domains are fundamental units of proteins with independent folding and function, and interactions between different protein domains play a key role in biological processes such as signal transduction, molecular complex assembly, etc. Some domain interaction databases have been established in the prior art, such as iPfam database that records interactions between Pfam domains based on the three-dimensional structure of protein structure database (PDB) experimental analysis, and 3d database that collects domain pair interaction templates that appear in known high resolution crystal or cryoelectron microscope structures. These databases rely primarily on manual annotation or published limited experimental crystal structures, and therefore suffer from limited coverage and late update. On the one hand, limited by the number of experimentally resolved protein complex structures, traditional domain interaction databases only cover thousands to tens of thousands of pairs of domain interactions, far from covering the potential interactions at the proteomic level, and on the other hand, these databases are typically based on predefined sequence domains, lack mechanisms to screen interaction interfaces from structural level systems, and atypical domain partitioning or potential interactions may be missed. In addition, due to the lack of unified biological function verification and structure reliability filtering standards, interaction quality of domains with different data sources is uneven, and it is difficult to ensure that interaction interfaces recorded in a database and corresponding interactions have biological significance. In recent years, with the breakthrough of a deep learning structure prediction tool such as AlphaFold and the like, the amount of protein structure data has been increased explosively. AlphaFold structural databases published a highly trusted structural prediction model of over 1.8 hundred million protein sequences worldwide, covering the vast majority of known proteins. However, currently mainstream domain interaction databases have not fully utilized these large-scale prediction resources, and there is still a lack of screening mechanisms to fuse experimental complexes with prediction monomers under a unified framework. Therefore, a new technical scheme is needed, which can break barriers of heterogeneous classification systems, cover wider structural space by utilizing complementarity of data from different sources, introduce geometric fingerprints and sequence deduplication mechanisms based on physical properties, systematically fuse multi-source data, efficiently screen real structural domain interactions, and construct a structural domain interaction database with high coverage rate and high reliability so as to support research such as downstream new drug target discovery, protein function prediction and the like. Disclosure of Invention In order to overcome the defects of limited coverage range, lack of systematicness in screening and difficulty in integrating the latest structure prediction data of the existing structure interaction database, the invention provides a construction method of a cross-chain and intra-chain structure interaction database based on multi-source heterogeneous data fusion, which fuses multi-source data and adopts structure interaction database construction with uniform screening standard to obviously improve coverage rate and accuracy of structure interaction identification. The technical scheme adopted for solving the technical problems is as follows: A construction method of a structural domain interaction database based on multisource data fusion comprises the steps of firstly defining a structural domain boundary by adopting a multi-algorithm consensus mechanism aiming at PINDER large-scale experimental data sets, combining ECOD homologous group annotation and multi-dimensional interface feature verification, identifying a high-confidence cross-chain DDI, secondly, aiming at AFDB predicted data, using a TED structural domain set as input, utilizing a PAE matrix of AlphaFold to evaluate interface rigidity and assist in geometric screening, mining stable intra-chain DDI, integrating a 3d database and eliminating peptide-mediated interaction, then implementing a hierarchical fusion and redundancy elimination strategy, namely constructing heterogeneous keys compatible with ECOD/CATH/Pfam, utilizing inertial spindle geometric vector fingerprint extraction and density cluster identification combination modes, generating a global normalized sequence pair key through sequence level redundancy elimination of a full library, and finally,