CN-122024838-A - Amino acid sequence generation and screening system and method based on machine learning and multi-objective evolution strategy
Abstract
The invention provides an amino acid sequence head-on design method and system based on machine learning and multi-strategy evolution algorithm, which are suitable for the fields of gene therapy, antibody or protein drug development, in particular for intelligent design of multifunctional viral capsid proteins. The method takes adeno-associated virus (AAV) capsid protein as a preferable implementation object, and realizes the de novo generation and optimization of an amino acid sequence with high production fitness and high tissue targeting specificity through the synergistic effect of a pre-training language model, a sequence generation model and a dual-stage rational evolution module.
Inventors
- WANG XUHUA
Assignees
- 南湖脑机交叉研究院
Dates
- Publication Date
- 20260512
- Application Date
- 20260105
Claims (16)
- 1. A design method of an amino acid sequence is characterized by realizing de novo generation and optimization of the amino acid sequence based on a mode of combining machine learning and rational design, and comprises the steps of learning grammar characteristics and semantic characteristics of the amino acid sequence through a pre-trained sequence characterization model to obtain prior representation reflecting internal structural rules and functional association relations of the amino acid sequence, generating a plurality of candidate amino acid sequences through a sequence generation model on the basis, inputting the candidate amino acid sequences into a rational design module, respectively predicting production fitness and targeting of the candidate amino acid sequences through a multi-objective prediction model, iteratively optimizing the candidate amino acid sequences through a multi-mode evolution strategy based on the prediction results of the production fitness and the targeting, so as to obtain an amino acid sequence set balanced and optimized on a plurality of functional indexes, and screening and sequencing the amino acid sequence set to obtain the objective amino acid sequence meeting preset functional requirements.
- 2. The method of claim 1, further comprising the step of constructing a pre-training dataset comprising a plurality of unlabeled amino acid sequences from a public database or a self-built database and pre-training the sequence characterization model prior to generating the amino acid sequences, wherein the unlabeled amino acid sequences are trained to learn generic biological grammatical features and semantic association features of the amino acid sequences to provide a priori constraints for subsequent sequence generation and optimization.
- 3. The method according to claim 1 or 2, wherein the step of generating a plurality of candidate amino acid sequences comprises performing semantic fine-tuning training on a sequence generation model based on an amino acid sequence data set with specific function tags, and enabling the sequence generation model to generate candidate amino acid sequences with synthesizable properties, basic biological activity and potential target functional characteristics by introducing conditional constraints, thereby constructing an initial candidate sequence library for subsequent rational design.
- 4. The method of claim 1, wherein the multi-objective predictive model comprises at least a production fitness predictive model for predicting the producibility of an amino acid sequence to produce a target protein or viral vector and a targeting predictive model for predicting the enrichment capacity or binding capacity of an amino acid sequence in a target tissue, a target organ or a target protein to enable joint assessment of the multi-functional properties of the amino acid sequence.
- 5. The method of claim 1, wherein the multimodal evolution strategy comprises at least one addition mode or a minimum mode, wherein the addition mode is used for comprehensively scoring candidate amino acid sequences based on the addition of the production fitness predictor and the targeting predictor to mine sequences with high comprehensive potential, and wherein the minimum mode is used for scoring candidate amino acid sequences based on the smaller of the production fitness predictor and the targeting predictor to screen amino acid sequences without obvious short panels on each functional index.
- 6. The method of claim 5, wherein the multimodal evolution strategy further comprises a novel constraint mechanism that applies a penalty to candidate amino acid sequences that are too close to a known sequence based on similarity or sequence distance between the candidate amino acid sequences and the known reference sequence, thereby driving the spatial expansion of sequences to unexplored sequences while maintaining functional rationality.
- 7. The method of claim 1, wherein the amino acid sequence of interest is used in the design of a viral capsid protein, the viral capsid protein being an adeno-associated viral capsid protein, and the amino acid sequence comprising a targeting peptide inserted at a specific position in the capsid protein to enhance tissue targeting, transduction efficiency or ability across the blood brain barrier of a viral vector.
- 8. The amino acid sequence design device is characterized by comprising a pre-training module, a sequence generation module, a rational design module and a screening and sorting module, wherein the pre-training module is used for training a sequence characterization model based on a large number of unlabeled amino acid sequences, the sequence generation module is used for generating candidate amino acid sequences based on the sequence characterization model, the rational design module is used for optimizing the candidate amino acid sequences based on a multi-objective prediction model and a multi-mode evolution strategy, and the screening and sorting module is used for screening and sorting the optimized amino acid sequences to obtain target amino acid sequences.
- 9. The apparatus of claim 8, wherein the rational design module comprises a production fitness prediction sub-module, a targeting prediction sub-module, and an evolutionary optimization sub-module, wherein the evolutionary optimization sub-module is configured to iteratively optimize candidate amino acid sequences based on multi-objective predictions.
- 10. The apparatus according to claim 8 or 9, wherein the evolution optimization submodule adopts a two-stage evolution mechanism, comprising a functional fusion stage and a discovery diversity stage, wherein the functional fusion stage is used for guiding evolution of candidate amino acid sequences towards a direction with basic functional characteristics, and the discovery diversity stage is used for enhancing diversity and novelty of the amino acid sequences in a sequence space.
- 11. The apparatus of claim 8, wherein the screening ranking module is configured to comprehensively score amino acid sequences based on a sum pattern or a minimum pattern and output preferred amino acid sequences that meet a target functional requirement based on the scoring result.
- 12. The apparatus of claim 8, further comprising an algorithm result verification module for performing biosynthesis and experimental verification of the preferred amino acid sequence and feeding the verification result back to the rational design module to achieve closed-loop optimization of the design flow.
- 13. The device of any one of claims 8-12, wherein the device is used for de novo design of adeno-associated viral capsid proteins, and the designed capsid proteins comprise engineered capsid proteins into which targeting peptide fragments are inserted.
- 14. An electronic device comprising at least one processor and a memory communicatively coupled to the processor, wherein the memory has stored therein instructions executable by the processor, which when executed, cause the processor to perform the amino acid sequence design method of any one of claims 1-7.
- 15. A non-transitory computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the amino acid sequence design method of any one of claims 1-7.
- 16. A computer program product comprising a computer program which, when executed by a processor, implements the amino acid sequence design method of any one of claims 1 to 7.
Description
Amino acid sequence generation and screening system and method based on machine learning and multi-objective evolution strategy Technical Field The invention relates to the field of amino acid sequence design, and is particularly applied to the directions of gene therapy, antibody or protein drug development and the like. In particular, the invention discloses a system named ALICE-X for realizing the slave design of an amino acid sequence based on a machine learning method. The system utilizes a dual prediction model (comprising fitness prediction and targeting prediction) and combines a multi-mode evolution strategy (such as a summation mode and a minimum value mode), and under the condition of scarce samples, the de novo design of a novel multifunctional amino acid sequence with high production adaptability and high targeting specificity is realized. The system comprises an amino acid sequence generation module, an amino acid sequence rationalization design module and an algorithm result verification module, and provides a calculation framework of the system for multifunctional protein design and optimization. Background Gene therapy is a revolutionary breakthrough in modern medicine, and is providing a fundamental therapeutic scheme for intractable diseases such as cystic fibrosis, malignant tumor, autoimmune diseases, nervous system degenerative diseases and the like through genetic engineering technology. The core mechanism mainly comprises the introduction of normal genes to repair defects (such as repairing CFTR gene functions) and the regulation of pathological related gene expression (such as the introduction of cancer suppressor genes). However, widespread use in this field is limited by the efficacy and safety of the delivery vehicle. Although adeno-associated viruses (AAV) have become clinically preferred gene delivery vectors by virtue of their non-pathogenic, low immunogenicity, broad host range, and long-lasting expression capabilities, significant "multi-objective optimization" dilemma is still faced in practical applications. In particular, existing natural serotypes or simply engineered AAV vectors often have difficulty in achieving all the key indicators required in clinic, high tissue specificity (e.g., ability to cross the blood brain barrier) is often accompanied by a significant decrease in production titer, while high-yield vectors often lack precise targeting ability and are easily enriched in non-targeted organs (e.g., liver) to cause toxic side effects. In addition, the pre-existing immune response that is widely present in the population further impairs the therapeutic effect. Therefore, how to design a "perfect" capsid sequence that has high targeting specificity, high production fitness and can escape immune surveillance is a technical bottleneck currently in urgent need. With the intervention of artificial intelligence technology, deep learning-based protein design provides a new path for solving the above-mentioned problems. However, conventional computer aided design or directed evolution methods still have significant limitations. Firstly, the existing AI model is optimized by adopting a single objective function, so that the algorithm is easy to converge to a local optimal solution prematurely, the generated sequence is often a simple imitation of a known high-score sequence and lacks real novelty, secondly, the conventional algorithm is difficult to effectively balance the relation between 'utilizing (Exploitation)' and 'exploring (Exploration)' in the face of a high-dimensional amino acid sequence space, namely, the unknown sequence landscape is deeply mined while the excellent biological function is maintained, and secondly, massive high-throughput screening data (such as DNA bar code screening data) cannot be deeply mined to construct a robust sequence-function mapping relation. The current industry pain is that there is a lack of an intelligent evolutionary system that can simulate the "course learning" mechanism in the biological evolutionary process and resolve multi-objective conflicts (e.g., contradictions between high targeting and high productivity) through complex rewarding strategies. The development of the system has great strategic significance for greatly shortening the design period of the carrier, reducing the trial-and-error cost and promoting the clinical transformation of gene medicines. Disclosure of Invention In order to improve the effectiveness, functional diversity and synthesizable preparation of amino acid sequence design of artificial intelligence technology, compared with the traditional amino acid sequence design, the invention uses a pre-training model to learn the priori knowledge of the amino acid sequence before designing the amino acid sequence, uses the amino acid sequence obtained by experiments to train the learning ability of the model for a single task again, uses rational design to directionally reform the amino acid sequence gener