CN-115565603-B - SAXS data-aided two-stage multi-domain protein assembly method
Abstract
Under the framework of an evolutionary algorithm, firstly, initializing a population, carrying out mutation and intersection on conformations in two stages, secondly, calculating a model similarity difference by utilizing SAXS experimental data to assist a DEMO energy function to select a solution, and simultaneously, maintaining the diversity of the conformations in the selection process through a Monte Carlo probability receiving criterion. Finally, a model which may contain multiple states is generated by simulation optimization. The invention provides the SAXS data-assisted two-stage multi-domain protein assembly method with high sampling efficiency and high prediction assembly precision.
Inventors
- LIANG FANG
- PENG CHUNXIANG
Assignees
- 阳泉市三禾氧化锌有限责任公司
- 浙江工业大学
Dates
- Publication Date
- 20260505
- Application Date
- 20220913
Claims (5)
- 1. A method of SAXS data assisted two-stage multi-domain protein assembly, the method comprising the steps of: 1) Giving full-length sequence of target multi-domain protein and small-angle X scattering experimental data SAXS, wherein the SAXS comprises a scattering vector q exp , a spectral intensity I exp (q) and an experimental error e exp (q) of the protein obtained by measuring the small-angle X scattering experiment; 2) Performing domain segmentation on a target protein sequence by utilizing a DomBpred domain segmentation server, performing structural modeling on a domain structure by utilizing AlphaFold according to the domain sequence to obtain a single domain structure model of the multi-domain protein, and linking an N end and a C end of the single domain structure model together according to the sequence; 3) Setting parameters, namely, a first stage population size NP 1 , a second stage population size NP 2 , a first stage crossing factor CR 1 , a second stage crossing factor CR 2 , a first stage temperature factor beta 1 , a second stage temperature factor beta 2 , a current algebra g, and setting an iteration algebra g=0; 4) Population initialization, wherein the movement of each domain model is represented by a rotation vector and a translation vector, the solution form of domain assembly is represented as (x 1 ,y 1 ,z 1 ,θ 1 ,φ 1 ,ω 1 ,…,x n ,y n ,z n ,θ n ,φ n ,ω n ),, wherein x n ,y n ,z n represents the translation vector of the nth domain, and θ n ,φ n ,ω n represents the rotation vector of the nth domain; 5) Randomly generating NP 1 initial solutions C i ,i={1,2,…,NP 1 within the solution space S 1 ; 6) Solution C i in the population is performed as follows: 6.1 C i is set as the target solution C target , and three solutions C a 、C b and C c that are different from each other are randomly selected from the population NP 1 , Randomly selecting a rotation translation matrix of different domains from C b 、C c respectively, and replacing the corresponding positions of C a respectively to generate a variation solution C mutant ; 6.2 Random number pCR, where pCR e (0, 1), if pCR < CR 1 , then randomly select the translation and rotation vectors for a field from C target , replace to the position corresponding to C mutant to generate a test solution C trial ', Otherwise, directly marking C mutant as C trial '; 6.3 If C trial 'is in the solution space S 1 , then C trial ' is denoted as C trial , otherwise a solution is randomly generated in the solution space S 1 as C trial ; 6.4 Using FoXS server to calculate the scattering vector q target 、q trial of the full-length model generated after C target and C trial rotation and translation, the spectrum intensity I target (q)、I trial (q) and the experimental error e target (q)、e trial (q); 6.5 Calculating the full-length model similarity difference generated by rotation and translation of C target and C trial according to equation (1), where M represents the number of scatter vectors, And The spectrum intensity of the SAXS experiment corresponding to the mth scattering vector, the spectrum intensity of the full-length model after C target transformation obtained by calculation of FoXS server and the spectrum intensity of the full-length model after C trial transformation are represented respectively; 6.6 If S is less than or equal to 0, C target is reserved; 6.7 If S > 0), computing the energy E DEMO (C trial )、E DEMO (C target of the full-length model generated after transformation by C target 、C trial using the DEMO energy function), if E DEMO (C trial )<E DEMO (C target ), Then C trial replaces C target , otherwise according to probability Receiving the constellation with a monte carlo criterion; 7) g=g+1, iteratively running step 6) until the best solution in the population of consecutive g tolerance generations no longer changes, then performing step 8); 8) The difference solution space S 2 is set and a new population NP 2 is selected for the excellent individuals of the preceding NP 1 /2 in the population NP 1 , and then the following is performed for each individual C i in the population NP 2 : 8.1 Setting the solution C i in the population NP 2 as the target C target , randomly selecting two solutions C a and C b which are different from each other from the population NP 2 , Randomly selecting rotation translation vectors of the same domain from the C a 、C b respectively, and then respectively making difference to the rotation translation difference vector of the position of the domain from the C a 、C b ; 8.2 If the vector is then in the difference solution space S 2 , adding the difference vector to the corresponding position of C target to generate a test solution C trial , otherwise, randomly generating a difference rotation translation vector in the solution space and adding the difference rotation translation vector to the corresponding position of C target to generate a test solution C trial ; 8.3 Using FoXS server to calculate the scattering vector q target 、q trial of the full-length model generated after C target and C trial rotation and translation, the spectrum intensity I target (q)、I trial (q) and the experimental error e target (q)、e trial (q); 8.4 Calculating the difference in the similarity of the full-length models generated by rotation and translation of C target and C trial according to formula (1), if S is less than or equal to 0, and C target is reserved; 8.5 If S >0, calculating energy E DEMO (C trial )、E DEMO (C target ) of C target 、C trial using a DEMO energy function), if E DEMO (C trial )<E DEMO (C target ), C trial replaces C target , otherwise according to probability Receiving a solution with a monte carlo criterion; 9) Iteratively operating step 8) until the lowest solution of the consecutive m generations E DEMO is not changed any more, and outputting the first n solutions of the lowest E DEMO as the final result.
- 2. The SAXS data assisted two-phase multi-domain protein assembly method of claim 1, wherein in step 3), the first phase population size NP 1 =100, the second phase population size NP 2 =NP 1 /2, the first phase crossover factor CR 1 =0.5, the second phase crossover factor CR 2 =0.2, the first phase temperature factor β 1 =10, and the second phase temperature factor β 2 =2.
- 3. A SAXS data assisted two-phase multi-domain protein assembly method according to claim 1 or 2, wherein in step 7), g tolerance =50.
- 4. A SAXS data assisted two-phase multi-domain protein assembly method according to claim 1 or 2, wherein in step 5) the upper limit of the solution space S 1 is (100.0,100.0,100.0,2 pi, 2 pi,) 100.0,100.0,100.0,2 pi, 2 pi, the lower limit of the solution space S 1 is (-100.0, -100.0, -100.0,0,0,0, -100.0, -100.0, -100.0,0,0,0,), and in step 8) the upper limit of the difference solution space S 2 is (1.0,1.0,1.0,0.5,0.5,0.25, # 1.0,1.0,1.0,0.5,0.5,0.25) and the lower limit is (-1.0, -1.0, -1.0,0,0,0, # 1.0, -1.0, -1.0,0,0,0.
- 5. A SAXS data assisted two-phase multi-domain protein assembly method according to claim 1 or 2, wherein m is 100 and n is 5 in step 9).
Description
SAXS data-aided two-stage multi-domain protein assembly method Technical Field The invention relates to the fields of bioinformatics and computer application, in particular to a SAXS data-aided two-stage multi-domain protein assembly method. Background Proteins are the main contributors to life activities, supporting almost all functions of life, and most reactions occurring within cells depend on proteins. The manner and function of a protein depends on its unique three-dimensional structure, i.e., what we often call a "structure determining function". Protein structure prediction is the main research content of structural bioinformatics. In the year start publication of the journal of science 2005, "can it be predicted that protein folding? one of the 125 most challenging scientific front-end problems. How protein molecular machines spontaneously assemble to form specific functional structures is one of the most critical left-behind problems in the complete view of the biological center laws, and is one of the major fundamental scientific problems that has not been solved in the field of life science. In addition, it is one of the great engineering technical problems of the industry of innovative drug development, vaccine design, accurate diagnosis, medical treatment and the like. In 2018, google first enters the field of protein structure prediction, and then Facebook, microsoft, amazon and domestic and foreign high-tech enterprises such as Tengxun, huazhi, baidu and Byte jumping compete for technological high points of intelligent drug design in order to lay out new drug research and development, and challenge the problem of protein structure prediction. In 2020, "AI protein structure prediction" enrolled CB Insights to evaluate twelve major industries that changed world game rules. In 2020, alphaFold developed by DeepMind team under google flag in global protein structure prediction competition (CASP 14) gets the first name of total score, alphaFold2 makes the protein structure prediction of this leading-edge basic research problem enter the field of view of the masses from the scientific hall, becoming a current "hot-melt" direction. This suggests that deep cross-fusion of computer technology, information technology and structural biology fields will effectively drive and accelerate new discoveries of science and create new economic growth points. However, the method of predictive assembly of multi-domain protein structures is currently faced with a number of difficulties and challenges. The end-to-end prediction method is directly based on a deep learning model, and the structure of the target protein can be directly deduced through a large amount of training by utilizing protein structure information existing in a protein structure database. However, this end-to-end machine learning method predicts the full length structure of multi-domain proteins by first solving the three problems of (1) the small number of multi-domain protein structures in the protein structure database relative to single domain proteins, and difficulty in meeting the overall training requirements of the deep learning method, which directly affects the accuracy of its predictive model. (2) For large multi-domain proteins, this end-to-end prediction approach requires high computational costs, resulting in difficulty for the average device to meet its training and inference requirements. (3) The multi-domain protein often has a plurality of states, and the end-to-end direct prediction method often only can capture the structure of one state, so that the diversity of results is poor; the inherent complexity of multi-domain protein prediction makes it a very challenging topic of research in the field of protein structure prediction. In order to be able to find available protein structural models in huge sampling spaces using a computer, efficient conformational space assembly optimization algorithms have to be designed to translate them into practical computational problems. Since the proposal of the differential evolution algorithm (DE) by Price and Storn in 1995, the differential evolution algorithm has the advantages of simple structure, easy realization, strong robustness, high convergence speed and the like, and has wide application in the field of protein conformational space optimization. However, as amino acid sequences grow, so does the degree of freedom of protein molecular systems, it is a challenging task to sample global optimal solutions for large-scale protein conformational space using conventional population algorithms. Therefore, the existing protein structure prediction method has defects in sampling efficiency and prediction accuracy, and needs improvement. Disclosure of Invention In order to solve the problems of unbalanced detection and enhancement stages and low prediction precision of the existing protein structure prediction method in the sampling process, the invention provides a SAXS data-assisted two-sta