CN-121983140-A - Strain-specific medium prediction method, medium and computer equipment

CN121983140ACN 121983140 ACN121983140 ACN 121983140ACN-121983140-A

Abstract

The invention discloses a strain-specific culture medium prediction method, medium and computer equipment, wherein the method comprises the steps of obtaining strain gene function annotation data and encoding the annotation data into a high-dimensional feature vector, carrying out principal component analysis dimension reduction on the high-dimensional feature vector, obtaining candidate culture medium component information and encoding the candidate culture medium component information into a binary vector, splicing low-dimensional strain feature representations and the culture medium component vector to form a joint input feature vector, inputting the joint input feature vector into a pre-trained depth residual network model, outputting growth compatibility probability scores and sequencing the candidate culture medium. According to the invention, the high-throughput genome annotation data and the known culture medium formula are integrated, so that a data-driven intelligent prediction model is constructed, the intelligent prediction model is used for efficiently recommending the optimal culture medium composition suitable for a specific microorganism strain, and the prediction precision and screening speed of microorganism isolated culture are remarkably improved.

Inventors

Bao Jiexi
CHEN XINGGUO
ZHANG YING

Assignees

南京邮电大学

Dates

Publication Date: 20260505
Application Date: 20260401

Claims (10)

1. A method for predicting a strain-specific medium, comprising the steps of: S1, acquiring gene function annotation data of a microorganism strain to be detected, and encoding the gene function annotation data into a high-dimensional binary feature vector; S2, performing principal component analysis dimension reduction treatment on the high-dimensional binary feature vector, and reserving principal components with accumulated variance contribution rate exceeding 95% to obtain low-dimensional strain feature representation; s3, acquiring component information of all candidate culture mediums and encoding the component information into binary component vectors, wherein each dimension in the binary component vectors represents the existence of a certain culture medium component; S4, splicing the low-dimensional strain characteristic representation obtained in the S2 with the culture medium component vector obtained in the S3 to form a combined input characteristic vector; S5, inputting the combined input feature vector into a pre-trained depth residual error network model, wherein the depth residual error network model comprises a plurality of full-connection residual error blocks with jump connection and is used for learning a complex nonlinear relation between a strain and a culture medium; S6, outputting a prediction result by the depth residual error network model, wherein the prediction result represents whether the current microorganism strain to be detected can grow in the candidate culture medium, and the output prediction result value is used as a probability score of growth compatibility after being activated by a Sigmoid function; And S7, sequencing the candidate culture mediums according to probability scores, and recommending a culture medium with Top-K optimal probabilities supporting the growth of the current microorganism strain to be tested.
2. The method of claim 1, wherein in S1, the gene function annotation data comprises the presence or absence information of KEGG Orthology identifiers, and the gene function annotation data is encoded into a high-dimensional binary feature vector, wherein each dimension corresponds to one KEGG Orthology identifier, and a value of 1 indicates that the KEGG Orthology identifier is present in the genome of the strain and a value of 0 indicates that the KEGG Orthology identifier is not present.
3. The method of claim 1, wherein in S1, the gene function annotation data is derived from a KEGG database, and the high-dimensional binary feature vector has a dimension of 9747 dimensions, covering key metabolic function features of the strain.
4. The method for predicting strain-specific medium according to claim 1, wherein in S2, the feature representation of the low-dimensional strain after the principal component analysis and dimension reduction is 1598 dimensions, and the critical biological information of the functional metabolic differences among strains is retained to the maximum extent while compressing the feature space by retaining the principal component with the cumulative variance contribution rate exceeding 95%, so as to alleviate the problem of dimension disasters.
5. The method of claim 1, wherein in S3, the composition information of the candidate medium is derived from MEDIADIVE database, the binary composition vector has dimensions 491, each dimension corresponds to a unique medium composition comprising at least carbon source, nitrogen source, vitamins, inorganic salts, and growth factors.
6. The strain-specific medium prediction method according to claim 1, wherein in S5, the depth residual network model comprises three residual blocks, each consisting of two full-connection layers, a ReLU activation function and a Dropout layer, and is provided with a jump connection structure for alleviating the gradient vanishing problem in the deep network and enhancing the learning ability of high-dimensional sparse biological data.
7. The method of claim 1, wherein in S5, the depth residual network model is trained using a maximum interval loss function that introduces positive and negative pairs of samples and sets a marginal hyper-parameter γ to enhance classification robustness by distinguishing between compatible and incompatible strain-medium pairing samples, optimizing the discrimination ability of the model in the embedding space.
8. The method for predicting the strain-specific medium according to claim 1, wherein in S7, a Top-K accuracy is selected as an evaluation index of the depth residual network model, for a certain strain, the strain information is combined with all n candidate media one by one to generate n pieces of data, the n pieces of data are input into the trained depth residual network model to obtain n corresponding m-dimensional output vectors, L2 norm distances between the n m-dimensional output vectors and m-dimensional full 1 vectors are calculated respectively, and the first K candidate media with the smallest distances are selected as Top-K recommendation results.
9. A computer readable storage medium, characterized in that a computer program is stored, which computer program, when being executed by a processor, causes the processor to perform the steps of the strain-specific medium prediction method according to any one of claims 1-8.
10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the strain-specific medium prediction method of any one of claims 1-8.

Description

Strain-specific medium prediction method, medium and computer equipment Technical Field The invention relates to the technical field of intersection of bioinformatics and artificial intelligence, in particular to a strain-specific culture medium prediction method, medium and computer equipment for fusing functional characteristics of a microbial genome and deep residual error learning. Background Microorganisms play an irreplaceable role in the fields of medicine development, food fermentation, environmental remediation, synthetic biology and the like. However, more than 99% of microorganisms in nature cannot be successfully cultured under laboratory conditions, which is called "great plate count abnormality" (GREAT PLATE Count Anomaly), and severely restricts the development and utilization of microbial resources. Traditional media optimization relies on empirical trial and error methods such as single factor rotation (OFAT) and Response Surface (RSM). These methods are time consuming, costly, and difficult to capture nonlinear interactions between nutrients, especially for uncultured microorganisms with complex nutritional requirements. In recent years, machine learning techniques have been introduced into this field, attempting to predict appropriate culture conditions from historical experimental data. The modeling is performed by adopting a random forest, a support vector machine or a shallow neural network, and certain progress is made in a specific scene. However, the above prediction method introducing machine learning has the following common bottlenecks: (1) The high-dimensional sparsity problem is remarkable in that genome feature vector dimensions based on KEGG Orthology (KO) and other functional annotations can reach tens of thousands, so that dimension disasters are extremely easy to cause, and the over-fitting and generalization capacities of the model are reduced. (2) The lack of efficient modeling of complex relationships, microbial growth being the result of highly nonlinear coupling between strain metabolic potential and environmental nutrient supply, traditional models have difficulty fully mining deep associations between the two. (3) Absent systematic dimensional reduction comparisons and biological interpretability support, most studies simply employ PCA without demonstrating its superiority. Disclosure of Invention The strain-specific medium prediction method, medium and computer equipment provided by the invention can effectively cope with high-dimensional genome data and accurately model a strain-medium matching relationship, and at least one of the technical problems can be solved. In order to solve the technical problems, the invention adopts the following technical scheme: a method of strain-specific medium prediction comprising the steps of: S1, acquiring gene function annotation data of a microorganism strain to be detected, and encoding the gene function annotation data into a high-dimensional binary feature vector; S2, performing principal component analysis dimension reduction treatment on the high-dimensional binary feature vector, and reserving principal components with accumulated variance contribution rate exceeding 95% to obtain low-dimensional strain feature representation; s3, acquiring component information of all candidate culture mediums and encoding the component information into binary component vectors, wherein each dimension in the binary component vectors represents the existence of a certain culture medium component; S4, splicing the low-dimensional strain characteristic representation obtained in the S2 with the culture medium component vector obtained in the S3 to form a combined input characteristic vector; S5, inputting the combined input feature vector into a pre-trained depth residual error network model, wherein the depth residual error network model comprises a plurality of full-connection residual error blocks with jump connection and is used for learning a complex nonlinear relation between a strain and a culture medium; S6, outputting a prediction result by the depth residual error network model, wherein the prediction result represents whether the current microorganism strain to be detected can grow in the candidate culture medium, and the output prediction result value is used as a probability score of growth compatibility after being activated by a Sigmoid function; And S7, sequencing the candidate culture mediums according to probability scores, and recommending a culture medium with Top-K optimal probabilities supporting the growth of the current microorganism strain to be tested. Further, in the step S1, the genetic function annotation data includes the existence information of KEGG Orthology identifiers, the genetic function annotation data is encoded into a high-dimensional binary feature vector, wherein each dimension corresponds to one KEGG Orthology identifier, a value of 1 indicates that the KEGG Orthology identifier exists in the geno