CN-121999864-A - Protein phosphorylation site function evaluation method based on combination of kinetic model and artificial intelligence algorithm

CN121999864ACN 121999864 ACN121999864 ACN 121999864ACN-121999864-A

Abstract

The invention relates to the field of biological information, and discloses a protein phosphorylation site function evaluation method based on a kinetic model combined with an artificial intelligence algorithm. The method comprises the steps of S1 obtaining intracellular histology data of cells under N conditions, S2 taking the histology data under each condition as a reference one by one, simultaneously substituting the data of S1 into a dynamics model, estimating model parameters by introducing a Bayesian algorithm, simultaneously using likelihood ratio test to evaluate whether a certain protein phosphorylation site has a function under the current reference condition, S3 repeating the step S2 for N times until each condition is referred, collecting results of the certain protein phosphorylation site under N times of iteration, and estimating the probability that the potential protein phosphorylation site has a regulation function under the N conditions by adopting a maximum likelihood estimation method. The invention fully utilizes the capacity of estimating model parameters of an artificial intelligent algorithm, integrates not only protein abundance data, but also phosphorylation abundance data, so that the model solving is more accurate, and the occurrence of false positive is reduced.

Inventors

XIA JIANYE
CHEN MIN
ZHUANG YINGPING
WANG XIAOYI

Assignees

中国科学院天津工业生物技术研究所

Dates

Publication Date: 20260508
Application Date: 20241108

Claims (9)

1. A protein phosphorylation site function evaluation method based on a kinetic model and an artificial intelligence algorithm is characterized by comprising the following steps: S1, acquiring intracellular histology data of cells under N conditions, wherein N is more than or equal to 5; S2, constructing a kinetic model J p ＝f(M x EΩ) containing no protein phosphorylation site and a kinetic model J p ＝g(M x E P Ω) containing protein phosphorylation site; S3, taking the histology data under each condition as a reference one by one, substituting the data of the S1 into the model of the S2, and estimating model parameters by introducing a Bayesian algorithm; s4, using likelihood ratio test to evaluate whether the introduction of a certain protein phosphorylation site under the current reference condition can obviously improve model fitting, namely p value; S5, repeating the steps S3 and S4 for N times, wherein N is the number of experimental groups, collecting p values of each potential protein phosphorylation site with functions for each evaluation, obtaining a p value set of whether each potential protein phosphorylation site significantly improves model fitting after traversing N times, correcting the p value set through error discovery rate, and evaluating the probability that the potential protein phosphorylation site functions under the given N conditions by counting the number of elements smaller than 0.1 in the p value set and further adopting a maximum likelihood method.
2. The method of claim 1, wherein in step S1, the intracellular omic data is selected from the group consisting of metabolic flux omic data, metabolite concentration data, proteomic data, and 5 or more of phosphorylated protein abundance data.
3. The method of claim 1, wherein in step S2, J p M x E P. OMEGA.is the predicted metabolic flux, metabolite concentration, enzyme abundance data, phosphorylated protein abundance, and model parameter set, respectively, wherein parameter set OMEGA.comprises two parameters, k ' x and θ i , wherein k' x represents the contribution of the substrate and product to the change in metabolic flux, θ i is the ratio of the catalytic efficiency of phosphorylated protein to the catalytic efficiency of the enzyme, θ i >1 represents the phosphorylating activatable enzyme activity of the enzyme, and θ i <1 represents the phosphorylating activatable enzyme activity of the enzyme.
4. The evaluation method according to claim 1, wherein in step S3, a bayesian method is used to estimate the model parameter set Ω, specifically as follows: Pr(Ω|J o M x P E)＝Pr(J o M x P E|Ω)*Pr(Ω)/Pr(J o M x P E) Wherein J o is the metabolic flux observed by experiment, pr (J o M x P E |Ω) is a likelihood function, pr (Ω) is a priori distribution, and Pr (J o M x P E) is a constant.
5. The evaluation method according to claim 1, wherein the posterior probability of the parameter set Ω is implemented by the following method based on the bayesian formula: (1) Setting prior distribution of a model parameter set omega, wherein the k' x reflects the contribution of a substrate and a product to metabolic flux change, and the positive and negative values respectively indicate that the corresponding substrate or product has a promoting or inhibiting effect on flux; (2) Setting likelihood function, assuming model predictive flux And experimental flux Are independently and uniformly distributed. Its obeying mean value is The variance is a normal distribution of 0.01. Thus, when For point estimation, the likelihood function of the parameter set Ω may be expressed as follows, Where l is the likelihood function and where, Is a normal probability density function; (3) Calculating posterior probability of parameter set Ω The posterior probability of the parameter set Ω is solved by the following formula Pr(Ω|J o M x P E)＝Pr(J o M x P E|Ω)*Pr(Ω)/Pr(J o M x P E) After the posterior probability of the parameter set omega is determined, a Monte Carlov Markov chain is adopted to judge whether the posterior value of the parameter set omega is received or not; Preferably, the algorithm is executed at least 100000 times, for example 100000 to 1000000 times, and the posterior value of the corresponding number of parameter sets Ω is taken therefrom.
6. The evaluation method of claim 1, wherein the algorithm is performed 2 times and the Gelman-Rubin algorithm is used to compare 2 time markov chains, proving that all parameters converge to the same posterior distribution.
7. The method of claim 1, wherein in step S4, the kinetic model J p ＝g(M x E P Ω) containing protein phosphorylation sites is compared with the kinetic model J p ＝f(M x EΩ) not containing protein phosphorylation sites, and potential functional phosphorylation sites are screened one by checking whether the introduction of the phosphorylation sites would significantly improve the fit between the model output value and the observed value, and in particular, the degree of perfection of the model fit is determined by using likelihood ratio test to determine whether each potential protein phosphorylation site significantly improves the p-value of the model fit.
8. The evaluation method according to claim 1, wherein in step S5, the p-value set is converted into a set containing only two elements of "possible functionality" or "impossible functionality"; assuming that the probability that each potential protein phosphorylation site is functional under a given condition can be assumed to be p, an assessment is made as to whether the potential protein phosphorylation site is functional N times, the number of potential regulatory factors assessed as functional obeys a binomial distribution, Since a sample set of multiple evaluations of potential protein phosphorylation sites has been obtained, p can be estimated by maximum likelihood that potential protein phosphorylation sites with p less than 0.6 will be removed, ultimately yielding a set of candidate protein phosphorylation sites, requiring p >0.6, meaning that these candidate phosphorylation control sites are functional at a probability of 60% or more under given conditions, i.e., are all functional phosphorylation control sites.
9. The method according to claim 1, wherein in step S5, the protein phosphorylation site is recognized to be functional under a given condition when P≥0.6 and the variation of the root mean square error is greater than 5%.

Description

Protein phosphorylation site function evaluation method based on combination of kinetic model and artificial intelligence algorithm Technical Field The invention relates to the field of biological information, in particular to a protein phosphorylation site function evaluation method based on a kinetic model combined with an artificial intelligence algorithm. Background Protein phosphorylation is a common and critical post-translational modification within cells that covalently binds the phosphate group of ATP to specific amino acids (serine, threonine, and tyrosine) through protein kinases, thereby regulating protein structure, activity, stability, and protein-protein interactions. Such modifications are critical to various physiological functions of the cell, and can regulate protein activity and stability, protein interactions, cell physiology, and the like. Currently, 20% of proteins found in E.coli have about 2000 phosphorylation sites. In addition, more than 10000 protein phosphorylation sites were detected in Saccharomyces cerevisiae, covering about 75% of the proteins, which are also involved in almost all of their physiological metabolic processes. Thus, a detailed understanding of the effect of protein phosphorylation on protein function may facilitate improvements in design blocks of synthetic biology, thereby supporting the development of cell factory building techniques. Currently, methods for identifying protein phosphorylation site function mainly include methods based on experimental operation and methods based on integration of multiple sets of chemical data. The basic idea, among other things, based on experimental manipulation is to compare whether the protein properties of the phosphorylation site mutants (removing or enhancing the phosphorylation sites) have changed significantly with those of the non-mutants. Based on this, a number of studies have identified the function of protein phosphorylation sites by comparing in vitro kinetic parameters (such as Tm values, 3D structure, and enzyme activity, etc.) of phosphorylated and non-phosphorylated mutant proteins with physiological metabolic parameters (such as cell growth and intracellular metabolite concentration) of mutant and non-mutant strains. The basic idea behind the use of multiple sets of chemical integration is to integrate a large number of sets of chemical data, in particular the abundance of phosphorylated proteins, by using specific mathematical expressions (e.g. enzyme kinetic equations). The existing multiple groups of chemical data integration methods for identifying the functions of the phosphorylation sites mainly comprise two methods, namely 1) presuming the influence of the phosphorylation of each amino acid residue of the protein on the activity of the protein by comparing the relativity of the abundance of the phosphorylated protein, metabolic flux and metabolite concentration under different conditions, and 2) educing a phosphorylation analysis method based on a layered regulation analysis idea, wherein the influence of the phosphorylation of each amino acid residue of the protein on the activity of the protein is presumed by comparing the phosphorylation regulation coefficients and the numerical values of the protein abundance regulation coefficients. The method based on experimental operation or the method based on multi-group data integration to identify the functions of protein phosphorylation sites has the following problems that 1) although the method based on the experiment can effectively identify the protein phosphorylation sites with regulatory functions, intracellular phosphorylation sites are more, each phosphorylation site is removed or enhanced one by one through the experimental method so as to study the functions of the protein phosphorylation sites, and various used molecular reagents are expensive and are not beneficial to the development of high-throughput experiments. 2) Although the method based on the integration of multiple groups of chemical data can identify the protein phosphorylation sites with functions in high flux, the research of the method still only stays on a simple mathematical model (only simple correlation analysis and separation regulation analysis between two groups are performed), and other complex intracellular regulation networks are not considered. Thus, a large number of false positives are introduced based on this approach, for example, contradictory to functional prediction of the same phosphorylation site in different states. Thus, how to recognize functional protein phosphorylation sites with high throughput while reducing false positive recognition is a currently encountered bottleneck. In order to solve the problem, the invention creatively provides a method for combining a dynamic model with an artificial intelligence algorithm, which can realize the function of identifying each phosphorylation site in a high-flux manner and reduce false positives. Disclosure of I