CN-122024863-A - Method and device for generating CDRH3 sequence generation model

CN122024863ACN 122024863 ACN122024863 ACN 122024863ACN-122024863-A

Abstract

The application discloses a generation method of a generation model of a CDRH3 sequence, which comprises the following steps of obtaining an initial training set, calculating values of the CDRH3 sequences in a limited attribute set to obtain a first training data set with a limited attribute label, generating a Prior GPT model based on a conditional GPT architecture by utilizing the first training data set, adopting a reinforcement learning strategy by taking the trained Prior GPT model as an initialized agent, combining the limited attribute set and an expansion attribute set to obtain a second training data set with the limited attribute label and the expansion attribute label by combining multiple iterations and accumulating, and generating an enhanced GPT model by utilizing the second training data set. The application generates the optimization framework through two stages, and realizes effective balance among different target attributes under the condition of simultaneously meeting a plurality of target attributes.

Inventors

PENG JIAJIE
XIE DONGNA

Assignees

西北工业大学

Dates

Publication Date: 20260512
Application Date: 20260225

Claims (10)

1. The method for generating the generation model of the CDRH3 sequence is characterized by comprising the following steps of: Acquiring an initial training set, wherein the initial training set comprises a plurality of CDRH3 sequences; Calculating the value of each CDRH3 sequence in a limited attribute set to obtain a first training data set with a limited attribute label, and generating a Prior GPT model based on a conditional GPT architecture by training the first training data set; And taking the trained Prior GPT model as an initialized agent, adopting a reinforcement learning strategy, combining the limited attribute set and the extended attribute set to obtain a second training data set with the limited attribute tag and the extended attribute tag through multiple iterative accumulation, and training by using the second training data set to generate the reinforcement GPT model.
2. The method of claim 1, wherein the step of obtaining an initial training set comprising a plurality of CDRH3 sequences comprises: CDRH3 sequences are extracted from the natural antibody OAS dataset, and the de-duplicated CDRH3 sequences are obtained by a pretreatment comprising removal of sequences containing ambiguous residues and filtering sequences of a predetermined length.
3. The method of claim 1, wherein the limited set of attributes comprises at least one or more of an antibody variable region net charge, a variable region charge symmetry parameter, and a hydrophobicity index sum, and/or wherein the extended set of attributes comprises at least one or more of HER2 binding specificity, MHC II minimum percentile ranking.
4. The method of claim 1, wherein the step of "taking the trained Prior GPT model as an initialized agent, using a reinforcement learning strategy, combining a finite attribute set and an extended attribute set, and obtaining a second training data set with a finite attribute tag and an extended attribute tag through multiple iterative accumulation" includes introducing a reinforcement learning process, combining the finite attribute and the extended attribute, calculating a generated sequence, forming reward feedback to optimize the generation strategy, and accumulating a second training data set with the finite attribute tag and the extended attribute tag through multiple iterations.
5. The method of claim 1, wherein the step of generating a condition GPT architecture-based Prior GPT model using the first training dataset training and/or the step of generating an enhanced GPT model using the second training dataset training each employ a condition GPT sequence generation model, the condition GPT sequence generation model being a conditional autoregressive generation model, inputs to the condition GPT sequence generation model including amino acid residue embedding, position embedding, and conditional embedding mapped by an attribute tag, and/or the condition GPT sequence generation model being a transform decoder-based condition GPT sequence generation model, the condition GPT sequence generation model comprising a plurality of stacked masked attention layers and a feed forward network layer.
6. The method of claim 5, wherein the reinforcement learning is based on REINVENT algorithm, and the step of "using the trained primary GPT model as the initialized agent, and employing the reinforcement learning strategy, and combining the finite attribute set and the extended attribute set, to obtain the second training data set with the finite attribute tag and the extended attribute tag by multiple iteration accumulation" includes: the conditional GPT sequence generation model generates a CDRH3 sequence under the constraint of limited attributes, and a log-likelihood value of the CDRH3 sequence under the conditional GPT sequence generation model and a log-likelihood value under the Prior GPT model are calculated; Calculating the scores of all attributes of the sequences, normalizing the variable region charge symmetry parameters and the MHC II minimum percentile rank by a sigmoid function, normalizing the hydrophobicity index sum and the antibody variable region net charge by a double sigmoid function; Weighting and summing the normalized attribute scores to obtain a comprehensive score, wherein the weight of the HER2 binding specificity is the largest; And calculating an enhanced log likelihood value based on the comprehensive score, and optimizing the conditional GPT sequence generation model by taking the square difference between the log likelihood value obtained by the conditional GPT sequence generation model and the enhanced log likelihood value as a loss function.
7. The method of claim 6, wherein in step "weight sum normalized attribute score to obtain a composite score", the HER2 binding specificity is weighted to 3/7, and the weight of the sum of mhc II minimum percentile, antibody variable region net charge, variable region charge symmetry parameter, and hydrophobicity index is 1/7.
8. The method of claim 1, wherein a threshold value of the net charge of the antibody variable region during training with the first training data set to generate a prime GPT model based on a conditional GPT architecture is greater than a threshold value of the net charge of the antibody variable region during training with the second training data set to generate an enhanced GPT model.
9. A system for implementing generation and optimization of CDRH3 sequences of antibodies according to any one of claims 1 to 8, comprising a data acquisition module, an attribute evaluation module, a CDRH3 sequence labeling module, a conditional GPT sequence generation model, a conditional GPT model training module, and an enhancement generation model; the data acquisition module is used for acquiring an initial training set, and the initial training set comprises a plurality of CDRH3 sequences; the attribute evaluation module comprises a limited attribute evaluation unit for calculating the value of the limited attribute and an extended attribute evaluation unit for calculating the value of the extended attribute; The CDRH3 sequence labeling module is used for labeling the calculation result of the attribute evaluation module, generating a corresponding attribute label, and thus constructing a first training data set and a second training data set, wherein the first training data set is provided with a limited attribute label, and the second training data set is provided with a limited attribute label and an extended attribute label; the conditional GPT model training module comprises a limited training module and an enhanced training module; Modeling the CDRH3 sequence by adopting an autoregressive mode, wherein the input of the conditional GPT sequence generation model consists of amino acid residue embedding, position embedding and attribute-related conditional embedding, and predicts the next amino acid residue bit by bit, in the first stage, the conditional GPT sequence generation model trains on a CDRH3 data set with a limited attribute tag through a limited training module of a conditional GPT model training module and is used for learning the sequence under a limited attribute constraint condition to generate a Prior GPT model, in the second stage, the Prior GPT model is used as an initialized generation agent, the generated sequence is generated through reinforcement learning iteration, and the attribute evaluation module is combined to calculate the generated sequence in an extended attribute dimension to form a feedback signal for guiding the sequence generation process, so that CDRH3 sequence data comprising the limited attribute tag and the extended attribute tag is gradually accumulated; and the enhancement training module further trains the conditional GPT sequence generation model by utilizing CDRH3 sequence data comprising a limited attribute tag and an extended attribute tag to obtain an enhancement generation model.
10. An apparatus comprising a memory having stored therein at least one program instruction and a processor that, upon loading and executing the at least one program instruction, implements the method of any of claims 1 to 8.

Description

Method and device for generating CDRH3 sequence generation model Technical Field The invention relates to a method and a device for generating a generation model of a CDRH3 sequence, and belongs to the technical field of biological information processing. Background Antibody drugs occupy a central position in modern biological therapy and are widely used for targeted therapy of cancers, autoimmune diseases, infectious diseases and the like. CDRH3 (Complementarity Determining Region HEAVY CHAIN, antibody heavy chain complementarity determining region 3) is a core functional region in an antibody molecule that binds directly to an epitope of an antigen, and the length, composition, and spatial conception of its amino acid sequence directly determine the antigen binding specificity and binding affinity of the antibody, while also profoundly affecting various important parameters of the antibody. Thus, the precise design and optimization of CDRH3 sequences is a central link in the development of therapeutic antibodies. In recent years, conditional generative models have been introduced in the field of antibody sequence design. Successful training of the conditional generative model is then highly dependent on a large-scale, high quality labelling dataset, i.e. requiring that each CDRH3 sequence in the training data be labelled simultaneously with all target properties required. However, CDRH3 sequences that are simultaneously tagged with multiple target properties are often quite limited in antibody design tasks. Disclosure of Invention The invention aims to provide a generation model, a generation method and a generation device of a CDRH3 sequence, which are used for realizing efficient and accurate generation of a multi-attribute constraint antibody CDRH3 sequence on the premise of not depending on large-scale multi-attribute label data. In order to achieve the above purpose, the present invention provides the following technical solutions: a generation method of a generation model of a CDRH3 sequence comprises the following steps: Acquiring an initial training set, wherein the initial training set comprises a plurality of CDRH3 sequences; Calculating the value of each CDRH3 sequence in a limited attribute set to obtain a first training data set with a limited attribute label, and generating a Prior GPT model based on a conditional GPT architecture by training the first training data set; And taking the trained Prior GPT model as an initialized agent, adopting a reinforcement learning strategy, combining the limited attribute set and the extended attribute set to obtain a second training data set with the limited attribute tag and the extended attribute tag through multiple iterative accumulation, and training by using the second training data set to generate the reinforcement GPT model. Preferably, the step of "obtaining an initial training set comprising a plurality of CDRH3 sequences" comprises: CDRH3 sequences are extracted from the natural antibody OAS dataset, and the de-duplicated CDRH3 sequences are obtained by a pretreatment comprising removal of sequences containing ambiguous residues and filtering sequences of a predetermined length. Preferably, the limited set of attributes comprises at least one or more of an antibody variable region net charge, a variable region charge symmetry parameter, and a hydrophobicity index sum, and/or the extended set of attributes comprises at least one or more of HER2 binding specificity, MHC II minimum percentile ranking. Preferably, the step of using the trained Prior GPT model as an initialized agent, adopting a reinforcement learning strategy, combining a limited attribute set and an extended attribute set, and obtaining a second training data set with a limited attribute tag and an extended attribute tag through multiple iteration accumulation includes introducing a reinforcement learning process, combining the limited attribute and the extended attribute, calculating a generated sequence, forming reward feedback to optimize the generation strategy according to the result, and accumulating to obtain a second training data set with the limited attribute tag and the extended attribute tag through multiple iterations. Preferably, the step of "generating a Prior GPT model based on a conditional GPT architecture using the first training data set training" and/or the step of "generating an enhanced GPT model using the second training data set training" each employ a conditional GPT sequence generation model, which is a conditional autoregressive generation model, the inputs of which include amino acid residue embedding, position embedding, and conditional embedding mapped by attribute tags, and/or which is a conditional GPT sequence generation model based on a Transformer decoder, which includes a plurality of stacked masked multi-headed attention layers and a feedforward network layer. Preferably, the reinforcement learning is based on REINVENT algorithm, and the step