Search

CN-115691656-B - Acceleration method and device for large system evolutionary tree

CN115691656BCN 115691656 BCN115691656 BCN 115691656BCN-115691656-B

Abstract

The invention discloses an acceleration method and device of a large system evolutionary tree, the method comprises the steps of obtaining attributes of all nodes in the system evolutionary tree, clustering leaf nodes of the system evolutionary tree based on transverse coordinates and longitudinal coordinates to obtain outliers and node clusters, taking a node as a new leaf node under the condition that all descendant leaf nodes of one node belong to one node cluster and the number of the descendant leaf nodes is not smaller than a number threshold, setting the visible state of the descendant leaf node as the current node to be invisible to obtain a new system evolutionary tree, and aiming at the new system evolutionary tree, obtaining the visible state of each node in the leaf branch by judging whether each leaf branch is invisible due to coverage of other nodes so as to generate an acceleration result of the system evolutionary tree. The invention avoids the phenomenon of bundling and folding during the visualization of the large system evolutionary tree.

Inventors

  • MENG ZHEN
  • ZHANG BO
  • ZHENG LINGLU
  • CHEN YAN
  • HU ZHILONG
  • WANG JIAJIA

Assignees

  • 中国科学院计算机网络信息中心

Dates

Publication Date
20260505
Application Date
20221011

Claims (8)

  1. 1. A method for accelerating a large phylogenetic tree, the method comprising: the method comprises the steps of obtaining attributes of all nodes in a system evolution tree, wherein the attributes comprise transverse coordinates, longitudinal coordinates, father nodes, child nodes, visible states and radiuses, and the initial value of the visible states is that the current node is visible; clustering leaf nodes of the phylogenetic tree based on the transverse coordinates and the longitudinal coordinates to obtain outliers and node clusters; When all the descendant leaf nodes of a node belong to one node cluster and the number of the descendant leaf nodes is not smaller than a number threshold, taking the node as a new leaf node, and setting the visible state of the descendant leaf node as invisible to the current node to obtain a new system evolution tree, wherein the radius of the new leaf node is obtained based on the number of the descendant leaf nodes; Aiming at the new phylogenetic tree, obtaining the visible state of each node in each leaf branch by judging whether each leaf branch is invisible due to coverage of other nodes; Generating an acceleration result of the phylogenetic tree based on the visible state of each node in each leaf branch; the step of obtaining the visible state of each node in each leaf branch by judging whether each leaf branch is invisible due to coverage of other nodes aiming at the new phylogenetic tree comprises the following steps: Traversing all leaf nodes aiming at the new phylogenetic tree, and establishing a sequence coverage TABLE VISIBLE_TABLE of the leaf nodes; for each leaf node, checking the branch from the leaf node to the parent node, and marking the coverage TIMES cover_mes as 0; extending the NODE_SIZE SIZE from the leaf NODE toward the parent NODE; Acquiring a new extension position of the last extension position; Searching leaf nodes in an abscissa opening interval between the last extending position and the new extending position in the sequence coverage TABLE VISIBLE_TABLE, and recording the leaf nodes as neighbor nodes; If at least one of the neighbor nodes is in the coverage of the last extending position or the new extending position, adding 1 to the coverage TIMES cover_TIMES of the corresponding branch, and jumping to judge whether the coverage TIMES cover_TIMES are equal to the actual extending TIMES or not and whether the coverage TIMES cover_TIMES extends beyond the father node or not; If any NODE in the neighbor NODEs is not in the coverage of the last extending position or the new extending position, after the coverage TIMES cover_mes of the corresponding branches are unchanged and all descendant internal NODEs and leaf NODEs under the corresponding branches of the last extending position are set to be invisible, selecting the next leaf NODE in the sequence coverage TABLE visible_table, returning to the step of starting from the leaf NODE, and extending the SIZE of the node_size towards the direction of the father NODE; Judging whether the coverage TIMES cover_mes are equal to the actual extension TIMES or not and whether the coverage TIMES cover_mes are extended beyond a root node or not: When the coverage TIMES cover_mes are equal to the actual extension TIMES and do not extend beyond the root node, taking the new extension position as a last extension position, and returning to the new extension position for acquiring the last extension position so as to obtain a continuous longest invisible path; When the coverage TIMES cover_mes are unequal to the actual extension TIMES or extend beyond the root NODE, setting all descendant internal NODEs and leaf NODEs under the branch corresponding to the last extension position to be invisible, selecting the next leaf NODE in the sequence coverage TABLE visible_table, returning to the step of obtaining, starting from the leaf NODE, and extending the NODE_size towards the direction of the father NODE; Marking the visible states of all offspring internal nodes and leaf nodes in the continuous longest invisible path as invisible to the current node.
  2. 2. The method of claim 1, wherein the obtaining attributes of all nodes in the phylogenetic tree comprises: reading Newick format files of the phylogenetic tree and converting the files into a JSON structure; initializing a system evolution tree based on the JSON structure, traversing all nodes from a root node of the system evolution tree, and recording a father node corresponding to each node; Setting a transverse scaling factor SCALE_X and a longitudinal scaling factor SCALE_Y; Setting the leaf node position as (X, Y) = (total_DISTANCE. SCALE_X, COUNT. SCALE_Y-BIAS_Y), wherein total_DISTANCE represents the sum of path lengths of all paths from the current node to the root node, COUNT is the number of leaf nodes currently traversed, and BIAS_Y is the offset in the Y-axis direction; Backtracking from the leaf node to traverse the parent node to the root node, and the currently traversed node is located at (X, Y) = (total_distance_scale_x, (min_son_y+max_son_y)/2), wherein min_son_y and max_son_y are the minimum coordinate value and the maximum coordinate value of the child node in the Y direction, respectively; And traversing all nodes and branches in a hierarchy, and setting all nodes to be in a visible state.
  3. 3. The method of claim 1, wherein clustering leaf nodes of the phylogenetic tree based on the lateral coordinate and the longitudinal coordinate to obtain outliers and clusters of nodes comprises: Setting a first super parameter MIN_ SAMPLES and a second super parameter EPS, and setting and inputting the first super parameter MIN_ SAMPLES as the minimum sample number in one cluster and the second super parameter EPS as a scanning radius; and executing a DBSCAN algorithm on all the leaf nodes by using the first super parameter MIN_ SAMPLES and the second super parameter EPS to obtain a plurality of node clusters and outliers, wherein each node cluster comprises a plurality of leaf nodes.
  4. 4. The method of claim 1, wherein in a case where all descendant leaf nodes of a node belong to one of the node clusters and the number of descendant leaf nodes is not less than a number threshold, the node is taken as a new leaf node, and the visible state of the descendant leaf node is set to be invisible to a current node, to obtain a new phylogenetic tree, comprising: setting a quantity threshold MIN_NODES, and initializing a LIST TEMP_LIST to be empty; For the NODES in each class, starting from the root node and traversing the hierarchy, if all the descendant leaf NODES of a branch belong to a node cluster and the number of the descendant leaf NODES is greater than or equal to the MIN_NODES number of points, setting the visible state of the node to be changed into a temporary state For the nodes in each node cluster, starting hierarchical traversal from the root node; if all the descendant leaf NODES of a node belong to one node cluster and the number of the descendant leaf NODES is greater than or equal to the number threshold MIN_NODES, modifying the visible state of the node to a temporary state and then adding the node to the LIST TEMP_LIST; traversing the LIST TEMP_LIST, modifying the visible state of the node to be visible to the current node, setting the radius of the node based on the number of the descendant leaf nodes of the node, and taking the node as a new leaf node to obtain a new phylogenetic tree.
  5. 5. The method of claim 1, wherein after generating the acceleration result for the phylogenetic tree based on the visible state of the nodes in each leaf branch, further comprising: And visualizing the acceleration result.
  6. 6. An acceleration apparatus for a large phylogenetic tree, said apparatus comprising: The system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring attributes of all nodes in a system evolution tree, wherein the attributes comprise transverse coordinates, longitudinal coordinates, father nodes, child nodes, visible states and radiuses, and initial values of the visible states are visible to a current node; The clustering module is used for clustering leaf nodes of the phylogenetic tree based on the transverse coordinates and the longitudinal coordinates to obtain outliers and node clusters; The system comprises a first filtering module, a second filtering module and a third filtering module, wherein the first filtering module is used for taking a node as a new leaf node and setting the visible state of the descendant leaf node as invisible to a current node so as to obtain a new system evolutionary tree when all descendant leaf nodes of the node belong to one node cluster and the number of the descendant leaf nodes is not smaller than a number threshold value; the secondary filtering module is used for acquiring the visible state of each node in each leaf branch aiming at the new phylogenetic tree by judging whether each leaf branch is invisible due to coverage of other nodes; The result generation module is used for generating an acceleration result of the phylogenetic tree based on the visible state of each node in each leaf branch; the step of obtaining the visible state of each node in each leaf branch by judging whether each leaf branch is invisible due to coverage of other nodes aiming at the new phylogenetic tree comprises the following steps: Traversing all leaf nodes aiming at the new phylogenetic tree, and establishing a sequence coverage TABLE VISIBLE_TABLE of the leaf nodes; for each leaf node, checking the branch from the leaf node to the parent node, and marking the coverage TIMES cover_mes as 0; extending the NODE_SIZE SIZE from the leaf NODE toward the parent NODE; Acquiring a new extension position of the last extension position; Searching leaf nodes in an abscissa opening interval between the last extending position and the new extending position in the sequence coverage TABLE VISIBLE_TABLE, and recording the leaf nodes as neighbor nodes; If at least one of the neighbor nodes is in the coverage of the last extending position or the new extending position, adding 1 to the coverage TIMES cover_TIMES of the corresponding branch, and jumping to judge whether the coverage TIMES cover_TIMES are equal to the actual extending TIMES or not and whether the coverage TIMES cover_TIMES extends beyond the father node or not; If any NODE in the neighbor NODEs is not in the coverage of the last extending position or the new extending position, after the coverage TIMES cover_mes of the corresponding branches are unchanged and all descendant internal NODEs and leaf NODEs under the corresponding branches of the last extending position are set to be invisible, selecting the next leaf NODE in the sequence coverage TABLE visible_table, returning to the step of starting from the leaf NODE, and extending the SIZE of the node_size towards the direction of the father NODE; Judging whether the coverage TIMES cover_mes are equal to the actual extension TIMES or not and whether the coverage TIMES cover_mes are extended beyond a root node or not: When the coverage TIMES cover_mes are equal to the actual extension TIMES and do not extend beyond the root node, taking the new extension position as a last extension position, and returning to the new extension position for acquiring the last extension position so as to obtain a continuous longest invisible path; When the coverage TIMES cover_mes are unequal to the actual extension TIMES or extend beyond the root NODE, setting all descendant internal NODEs and leaf NODEs under the branch corresponding to the last extension position to be invisible, selecting the next leaf NODE in the sequence coverage TABLE visible_table, returning to the step of obtaining, starting from the leaf NODE, and extending the NODE_size towards the direction of the father NODE; Marking the visible states of all offspring internal nodes and leaf nodes in the continuous longest invisible path as invisible to the current node.
  7. 7. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-5.
  8. 8. A computer device comprising a memory and a processor, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1-5.

Description

Acceleration method and device for large system evolutionary tree Technical Field The invention belongs to the technical field of applied bioinformatics, and relates to an acceleration method and device for a large system evolutionary tree. Background Phylogenetic tree is indispensable in biological research, and it relates species or population in hierarchical structure and may be used to clarify the evolutionary form of organism, the relatedness of organisms of different classes and the dynamic change rule of organism. Phylogenetic trees have positive significance on the application level, such as for predicting which species are dying, from which people can make corresponding precautionary measures, help identify relevant members with close relationships of species with pharmacological significance, identify and classify various microorganisms including bacteria, and the like. Currently, more than 8 tens of thousands of species have been analyzed for evolutionary relationships, while up to more than 100 tens of thousands of species have not yet been analyzed, indicating that the market for analyzing species evolutionary relationships remains enormous. Many tools for phylogenetic tree visualization such as iTOL v, phyD, phyloTree and the like exist in the field of bioinformatics, and are excellent in metadata visualization, software portability and function expandability. However, due to the development of high-throughput sequencing technology and artificial intelligence technology, massive biological information data are created, the scale of the phylogenetic tree is continuously increased, and in order to improve the visualization efficiency of the large-scale phylogenetic tree, acceleration methods are needed. How to design the acceleration method is a problem to be solved. When the existing phylogenetic tree tool visualizes the phylogenetic tree, all nodes and branches are always drawn, and the advantage of the existing phylogenetic tree tool is that all the characteristics of the original phylogenetic tree can be reserved and distortion caused by scaling is avoided. However, the efficiency of the method is low, especially in the scene of processing large phylogenetic tree, the size of the phylogenetic tree facing researchers is increased from less than 1000 leaf nodes to more than 10 ten thousand leaf nodes, and the visualization efficiency of the large phylogenetic tree is low, so that the analysis requirement of the researchers on the large phylogenetic tree cannot be met. In order to improve the visualization efficiency of the large phylogenetic tree, the bioinformatics field explores the same for a long time, and the current scheme mainly focuses on folding subtrees, namely, by folding certain larger subtrees of the large phylogenetic tree, the number of nodes and the number of branches to be rendered are reduced. Although the method can realize the acceleration of large-scale evolutionary trees, the folded subtrees often have more characteristics, for example iTOL, the subtrees with more than 200 nodes can be automatically folded, the pricking-pushing folding can enable a large number of characteristics contained in the subtrees to disappear, the leaf nodes are very important for the analysis of the phylogenetic relationship from the global aspect, and the phylogenetic tree after the folding method is accelerated can not embody the characteristics of the leaf nodes of the original phylogenetic tree. In addition, the classification relation among species is inferred by using a clustering method by the biological chemistry, and the method for grouping and accelerating the leaf nodes of the large-scale evolutionary tree by combining the clustering method has wide application prospect. Therefore, in the face of the current situation that the system evolution tree is larger and larger in scale and the large system evolution tree is more and more in visual scene, the method for accelerating the large system evolution tree and keeping the leaf node characteristics as much as possible is significant. Disclosure of Invention Aiming at the situation that the visualization efficiency of the large-scale phylogenetic tree is too low, the invention provides an acceleration method and an acceleration device of the large-scale phylogenetic tree, which are mainly applied to the large-scale phylogenetic tree acceleration method in the related fields of large-scale phylogenetic tree visualization, large-scale phylogenetic relationship exploration, large-scale species division and the like. The technical content of the invention comprises: a method of accelerating a large phylogenetic tree, the method comprising: the method comprises the steps of obtaining attributes of all nodes in a system evolution tree, wherein the attributes comprise transverse coordinates, longitudinal coordinates, father nodes, child nodes, visible states and radiuses, and the initial value of the visible states is that the current node is v