Search

CN-116153403-B - Lb2Cas12a protein mutant activity prediction method and system

CN116153403BCN 116153403 BCN116153403 BCN 116153403BCN-116153403-B

Abstract

The invention discloses a method and a system for predicting activity of an Lb2Cas12a protein mutant, wherein the method comprises the steps of obtaining an amino acid sequence of the Lb2Cas12a protein mutant and corresponding original data of enzyme activity, using AAindex indexes with the number of KARS160105 to digitally encode the amino acid sequence to obtain a digitized amino acid sequence, processing the digitized amino acid sequence to obtain an original dataset consisting of an average value of the digitized amino acid sequence and numbers of mutation positions, randomly dividing the original dataset into a training set, a verification set and a test set, training by data in the training set to obtain a preliminary model, adjusting the preliminary model by data in the verification set to obtain the generalization of the test model on the test set, and performing piece-wise processing on single mutants of all Lb2Cas12a proteins by using the constructed model to obtain a prediction result of the Lb2Cas12a protein activity. The method is convenient to operate.

Inventors

  • YIN LEI
  • LIU SHIQI
  • LIU HUAN
  • ZHOU JIN

Assignees

  • 武汉大学

Dates

Publication Date
20260508
Application Date
20221207

Claims (7)

  1. 1. A method of predicting activity of an Lb2Cas12a protein mutant, the method comprising: obtaining the amino acid sequence of the Lb2Cas12a protein mutant and the corresponding original data of enzyme activity; Digitally encoding the amino acid sequence using AAindex index number KARS160105 to obtain a digitized amino acid sequence; Processing the digitized amino acid sequence to obtain an original data set consisting of a first characteristic value and a second characteristic value, and randomly dividing the original data set into a training set, a verification set and a test set, wherein the average value of the digitized amino acid sequence is taken as the first characteristic value; Establishing a preliminary data model containing super parameters, repeatedly training and verifying the preliminary data model by utilizing data in a training set and a verification set to obtain an adjusted data set processing model, wherein the adjusted data set processing model is a random forest regression model, the super parameters of the adjusted data set processing model are random_state=42, n_evators=21, max_depth=7, min_samples_split=3, min_samples_leaf=1, and the rest super parameters are set as default values; Testing the adjusted data set processing model by utilizing data in the test set to obtain an Lb2Cas12a protein mutant activity prediction model; Obtaining a single mutant data set of the Lb2Cas12a protein according to the number of amino acids of the Lb2Cas12a protein and the possible mutation of each amino acid position into 20 amino acids; And processing the single mutant data set of the Lb2Cas12a protein piece by using the Lb2Cas12a protein mutant activity prediction model to obtain a prediction result of the single mutant activity of the Lb2Cas12a protein.
  2. 2. The method of claim 1, wherein the ratio of the number of samples in the training set, validation set, and test set is 7:1.5:1.5.
  3. 3. The method of claim 1, wherein the preliminary data model is a random forest regression model, the super parameters of the preliminary data model are random_state=42, n_detectors=21, max_depth=7, min_samples_split=3, min_samples_leaf=1, and the remaining parameters are set to default values.
  4. 4. The method for predicting the activity of the Lb2Cas12a protein mutant according to claim 1, wherein the step of performing piece-by-piece processing on the single mutant dataset of the Lb2Cas12a protein by using the Lb2Cas12a protein mutant activity prediction model to obtain a prediction result of the single mutant activity of the Lb2Cas12a protein specifically comprises: The single mutant data set of the Lb2Cas12a protein digitizes amino acids according to AAindex index with the number of KARS160105, then the characteristic extraction is carried out, the first characteristic is the average value of all the digitized amino acids, the second characteristic is the mutation position, and the two characteristic values are input into the Lb2Cas12a protein mutant activity prediction model to give a prediction result.
  5. 5. A system for predicting activity of an Lb2Cas12a protein mutant, the system comprising: The original data acquisition module is used for acquiring the amino acid sequence of the Lb2Cas12a protein mutant and the corresponding original data of enzyme activity; A digital coding module for digitally coding the amino acid sequence using AAindex index number KARS160105 to obtain a digital amino acid sequence; the data processing module is used for processing the digitized amino acid sequence to obtain an original data set consisting of a first characteristic value and a second characteristic value, and randomly dividing the original data set into a training set, a verification set and a test set, wherein the average value of the digitized amino acid sequence is taken as the first characteristic value; The system comprises an adjusted data set processing model construction module, a random forest regression model, a training set and a verification set, wherein the adjusted data set processing model construction module is used for constructing a preliminary data model containing super parameters, the preliminary data model is repeatedly trained and verified by utilizing data in the training set and the verification set to obtain an adjusted data set processing model, the adjusted data set processing model is the random forest regression model, the super parameters of the adjusted data set processing model are random_state=42, n_detectors=21, max_depth=7, min_samples_split=3, min_samples_leaf=1, and the rest super parameters are set as default values; The model test module is used for testing the adjusted data set processing model by utilizing the data in the test set to obtain an Lb2Cas12a protein mutant activity prediction model; A single mutant data set obtaining module of the Lb2Cas12a protein, configured to obtain a single mutant data set of the Lb2Cas12a protein according to the number of amino acids of the Lb2Cas12a protein and the possible mutation of each amino acid position to 20 amino acids; and the prediction module of the Lb2Cas12a protein mutant activity is used for processing the single mutant data set of the Lb2Cas12a protein piece by using the Lb2Cas12a protein mutant activity prediction model to obtain a prediction result of the Lb2Cas12a protein mutant activity.
  6. 6. A system for predicting activity of an Lb2Cas12a protein mutant, the system comprising: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the Lb2Cas12a protein mutant activity to be predicted to perform the steps of the method of any one of claims 1-4.
  7. 7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-4.

Description

Lb2Cas12a protein mutant activity prediction method and system Technical Field The invention relates to the technical field of biomedicine, in particular to a method and a system for predicting activity of an Lb2Cas12a protein mutant. Background CRISPR/Cas systems are now receiving attention from many scientific teams worldwide as the most widely used gene editing systems, and it is desirable to improve CRISPR/Cas systems from the three aspects of reducing off-target rates, improving Cas protein cleavage efficiency, and freeing PAM limitations. CRISPR/Cas12a possesses lower off-target than CRISPR/Cas9 in the CRISPR/Cas system and has comparable editing efficiency to CRISPR/Cas9, lb2Cas12a is one protein in the Cas12a family, lb2Cas12a has lower off-target effect and has comparable cleavage activity to the remaining three compared to other proteins AsCas f, lbCas12a and FnCas a in the Cas12a family. Can it not be possible to have the Lb2Cas12a protein possess higher editing efficiency? ideas we began to find mutants that made the Lb2Cas12a protein editing more efficient. In the process of searching for traditional protein mutants with higher efficiency, a great deal of expert knowledge is needed, the nature of amino acids and the rather specialized knowledge of interactions between amino acids and between amino acids are needed to select target sites for experimental verification, and the selected target sites are a tiny part of the amino acid sequence of the whole protein, so that important mutation sites can be omitted, and therefore, certain limitation exists. Therefore, there is a need to develop a method and system for predicting the activity of an Lb2Cas12a protein mutant with simple operation steps and high accuracy. Disclosure of Invention The invention aims to provide a method and a system for predicting the activity of an Lb2Cas12a protein mutant, which can be used for quickly performing preliminary screening when searching for an efficient mutant of the Lb2Cas12a protein, so that a large amount of manpower and material resources can be saved. In order to achieve the above purpose, the present invention adopts the following technical scheme: In a first aspect of the invention, there is provided a method of predicting Lb2Cas12a protein mutant activity, the method comprising: obtaining the amino acid sequence of the Lb2Cas12a protein mutant and the corresponding original data of enzyme activity; Digitally encoding the amino acid sequence using AAindex index number KARS160105 to obtain a digitized amino acid sequence; Processing the digitized amino acid sequence to obtain an original data set consisting of a first characteristic value and a second characteristic value, and randomly dividing the original data set into a training set, a verification set and a test set, wherein the average value of the digitized amino acid sequence is taken as the first characteristic value; Constructing a preliminary data model containing super parameters, and repeatedly training and verifying the preliminary data set processing model by utilizing the data in the training set and the verification set to obtain an adjusted data set processing model; Testing the adjusted data set processing model by utilizing data in the test set to obtain an Lb2Cas12a protein mutant activity prediction model; Obtaining a single mutant data set of the Lb2Cas12a protein according to the number of amino acids of the Lb2Cas12a protein and the possible mutation of each amino acid position into 20 amino acids; And processing the single mutant data set of the Lb2Cas12a protein piece by using the Lb2Cas12a protein mutant activity prediction model to obtain a prediction result of the single mutant activity of the Lb2Cas12a protein. Further, the use of AAindex index number KARS160105 to digitally encode the amino acid sequence, to obtain a digitized amino acid sequence, specifically includes: the numbers corresponding to amino acids A, L, R, K, N, M, D, F, C, P, Q, S, E, T, G, W, H, Y, I, V are respectively :1.00、5.00、8.12、7.00、5.00、5.40、5.17、7.00、2.33、4.00、5.86、1.67、6.00、3.25、0.00、11.10、6.71、8.88、3.25、3.25. In a second aspect of the invention, there is provided a system for predicting activity of an Lb2Cas12a protein mutant, the system comprising: The original data acquisition module is used for acquiring the amino acid sequence of the Lb2Cas12a protein mutant and the corresponding original data of enzyme activity; A digital coding module for digitally coding the amino acid sequence using AAindex index number KARS160105 to obtain a digital amino acid sequence; the data processing module is used for processing the digitized amino acid sequence to obtain an original data set consisting of a first characteristic value and a second characteristic value, and randomly dividing the original data set into a training set, a verification set and a test set, wherein the average value of the digitized amino acid sequence is taken as the first characteristic