CN-121122402-B - MRNA sequence scoring method and related equipment

CN121122402BCN 121122402 BCN121122402 BCN 121122402BCN-121122402-B

Abstract

The invention discloses an mRNA sequence scoring method and related equipment, the method comprises the steps of obtaining an mRNA sequence to be scored, extracting a key base sequence from the mRNA sequence, inputting the key base sequence into a target scoring model, and processing to obtain the score of the mRNA sequence in a target index, wherein the target scoring model comprises a multi-mode RNA sequence coding module and a large language model, and is obtained based on the unsupervised pretraining of transcription sample data and the supervised fine tuning of a score label data set corresponding to the target index. The invention adopts the key regions (the initial/stop codon context) of the integrated non-coding region and the coding region as input, and utilizes a large language model based on mass data training for processing, so that the invention can comprehensively consider various complex factors such as context information, secondary structure and the like, and can be widely applied to the technical field of gene sequence analysis.

Inventors

MIAO ZHICHAO
JIANG JIUHONG
Ning Guoquan

Assignees

广州国家实验室

Dates

Publication Date: 20260512
Application Date: 20250911

Claims (12)

1. A method for scoring mRNA sequences, the method comprising the steps of: Obtaining an mRNA sequence to be scored, wherein the mRNA sequence comprises a non-coding region and a coding region, and the non-coding region comprises a5 'end non-coding region positioned at the upstream of the starting end of the coding region and a 3' end non-coding region positioned at the downstream of the ending end of the coding region; Extracting a critical base sequence from the mRNA sequence, the critical base sequence comprising a start codon context and a stop codon context, the start codon context characterizing a codon sequence of a preset window length at a start codon, the stop codon context characterizing a codon sequence of the preset window length at a stop codon; Inputting the key base sequence into a target scoring model, and processing to obtain the score of the mRNA sequence in a target index; the target scoring model comprises a multi-mode RNA sequence coding module and a large language model, and is obtained based on unsupervised pre-training of transcription sample data and supervised fine tuning of a score label data set corresponding to the target index; wherein the extraction of the key base sequence from the mRNA sequence comprises the steps of: and taking the first nucleotide position of the start codon and the last nucleotide position of the stop codon as double-center reference points, and carrying out standardized alignment treatment on the mRNA sequence to obtain an alignment sequence.
2. The method of claim 1, wherein said extracting key base sequences from said mRNA sequences further comprises the steps of: Based on the alignment sequence, intercepting a sequence with a preset length at the upstream of the center and a sequence with the preset length at the downstream of the center aiming at the center corresponding to the start codon to form the start codon context, wherein the preset length is half of the preset window length; based on the alignment sequence, intercepting a sequence with the preset length at the upstream of the center and a sequence with the preset length at the downstream of the center aiming at the center corresponding to the stop codon to form the stop codon context; And marking and cutting off the starting codon context and the stopping codon context to form the key base sequence.
3. The method according to claim 1 or 2, wherein the multi-mode RNA sequence encoding module comprises a one-dimensional encoding unit and a two-dimensional encoding unit, the key base sequence is input into a target scoring model, and the scoring of the mRNA sequence at a target index is obtained through processing, and the method comprises the following steps: inputting the key base sequence into the multi-mode RNA sequence coding module, and obtaining coding information by integrating base composition information and secondary structural characteristics; wherein the base composition information is obtained based on the one-dimensional coding unit, and the secondary structural feature is obtained based on the two-dimensional coding unit; And processing and outputting the score of the mRNA sequence on the target index by using the large language model based on the coding information.
4. The method of claim 3, wherein the obtaining the encoded information by integrating the base composition information with the secondary structural features comprises the steps of: performing single-heat coding on the key base sequence by utilizing the one-dimensional coding unit to obtain the base composition information; Generating a two-dimensional base pairing matrix based on a base complementary pairing rule by using the two-dimensional coding unit based on the key base sequence as the secondary structural feature; And integrating the base composition information and the secondary structure characteristics to obtain the coding information.
5. The method according to claim 4, wherein the integration of the base composition information and the secondary structural features to obtain the encoded information comprises the steps of: Weighting the secondary structural features by using an attention matrix; and fusing the weighted result to the base composition information through matrix multiplication to obtain the coding information.
6. The method according to claim 1, characterized in that the method further comprises the steps of: Configuring an initial model based on the multi-modal RNA sequence encoding module and the large language model; The multi-mode RNA sequence coding module comprises a one-dimensional coding unit and a two-dimensional coding unit, wherein the one-dimensional coding unit is used for processing to obtain base composition information, and the two-dimensional coding unit is used for processing to obtain a secondary structural characteristic; Based on the transcription sample data, performing unsupervised pre-training on the initial model through a position perception mask prediction task so that the initial model learns the sequence rule of an mRNA sequence to obtain a pre-training model; the score tag data set corresponding to the target index comprises a plurality of mRNA sequence samples and score tags of each mRNA sequence sample on the target index; And performing supervised fine tuning on the pre-training model based on the mRNA sequence sample and the score label to obtain the target scoring model.
7. The method of claim 6, wherein the unsupervised pre-training of the initial model by a location aware mask prediction task based on the transcription sample data comprises the steps of: Constructing a mask probability based on two gaussian distributions centered around the start codon and the stop codon; Masking a partial region of the transcription sample sequence in the transcription sample data based on the masking probability to obtain Gaussian weighted masking data; the two-dimensional coding unit is connected with the secondary structural features of the integrated sequence through an attention mechanism, gate control fusion and residual error; And performing unsupervised pre-training on the initial model based on the Gaussian weighted mask data and structural characteristics of the two-dimensional coding unit.
8. The method of claim 6, wherein said performing a supervised fine tuning of said pre-trained model based on said mRNA sequence samples and said score labels comprises the steps of: Inputting a key base sequence extracted based on the mRNA sequence sample into the pre-training model as input data, so that the pre-training model processes and outputs a prediction score of the corresponding mRNA sequence sample; And constructing a loss function according to the prediction score and the score label corresponding to the mRNA sequence sample, and performing fine adjustment on parameters of the pre-training model by using the loss function.
9. An mRNA sequence scoring apparatus, the apparatus comprising: the sequence acquisition module is used for acquiring an mRNA sequence to be scored, wherein the mRNA sequence comprises a non-coding region and a coding region, and the non-coding region comprises a 5 'end non-coding region positioned at the upstream of the initial end of the coding region and a 3' end non-coding region positioned at the downstream of the final end of the coding region; a sequence extraction module for extracting a key base sequence from the mRNA sequence, the key base sequence comprising a start codon context and a stop codon context, the start codon context characterizing a codon sequence of a preset window length at a start codon, the stop codon context characterizing a codon sequence of the preset window length at a stop codon; the scoring module is used for inputting the key base sequence into a target scoring model, and obtaining the score of the mRNA sequence in a target index through processing; the target scoring model comprises a multi-mode RNA sequence coding module and a large language model, and is obtained based on unsupervised pre-training of transcription sample data and supervised fine tuning of a score label data set corresponding to the target index; wherein the extraction of the key base sequence from the mRNA sequence comprises the steps of: and taking the first nucleotide position of the start codon and the last nucleotide position of the stop codon as double-center reference points, and carrying out standardized alignment treatment on the mRNA sequence to obtain an alignment sequence.
10. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 8 when the computer program is executed by the processor.
11. A computer storage medium in which a processor executable program is stored, which when executed by a processor is for implementing the method of any one of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 8.

Description

MRNA sequence scoring method and related equipment Technical Field The invention relates to the technical field of gene sequence analysis, in particular to an mRNA sequence scoring method and related equipment. Background MRNA vaccines have great potential in the fields of respiratory disease, tumor, rare disease treatment, and the like. The effect depends on the translation efficiency of the mRNA sequence and the protein expression level. The current method for screening high-quality mRNA sequences depends on the traditional experimental technology, is time-consuming and labor-consuming, has the problems of low flux, difficulty in comprehensively considering multiple factors and the like, and is difficult to meet the high-efficiency requirements of practical application. The prior art has the defect of evaluating the effectiveness of mRNA sequences, and is difficult to comprehensively and accurately screen sequences with higher translation efficiency and protein expression quantity, which limits the progress of the fields of gene therapy, protein drug research and development and the like to a certain extent. Specifically, limitations of the experimental method are as follows: (1) The traditional experimental method, such as a fluorescent reporter gene method, generally needs to detect the translation efficiency of single mRNA sequences one by one, and is difficult to detect and screen a large number of mRNA sequences in a high-throughput manner, which limits the rapid finding of sequences with high translation efficiency in a large-scale mRNA sequence library. (2) The experimental conditions are greatly different, and the comparability of experimental results is poor due to the fact that different experiments are different in terms of operation conditions, cell types, detection instruments and the like, so that the translation efficiency of different mRNA sequences under unified standards is difficult to accurately evaluate, and the comprehensive and objective comparison and analysis of the mRNA translation efficiency are not facilitated. The limitations of bioinformatics methods are as follows: (1) The model is simple, the model based on the early bioinformatics method is simpler, the training data is limited, and the influence of complex structures and regulatory elements in mRNA sequences on the translation efficiency cannot be fully covered, so that the prediction accuracy of the translation efficiency is not high. (2) The consideration of sequence context information is inadequate, as some methods focus on only local sequence features when analyzing mRNA sequences, which makes the assessment less accurate and comprehensive. (3) The traditional bioinformatics method can only analyze based on single or few experimental indexes, but in practice, a plurality of factors influencing translation efficiency are involved, and various experimental data are involved, so that the method is difficult to integrate various experimental indexes effectively to comprehensively evaluate mRNA translation efficiency. Disclosure of Invention The embodiment of the invention mainly aims to provide an mRNA sequence scoring method, an mRNA sequence scoring device, electronic equipment, a storage medium and a program product, which aim to solve at least one problem in the prior art. To achieve the above object, an aspect of an embodiment of the present invention provides a method for scoring mRNA sequences, including: obtaining an mRNA sequence to be scored, wherein the mRNA sequence comprises a non-coding region and a coding region, and the non-coding region comprises a 5 'end non-coding region positioned at the upstream of the initial end of the coding region and a 3' end non-coding region positioned at the downstream of the termination end of the coding region; Extracting a key base sequence from the mRNA sequence, the key base sequence comprising a start codon context and a stop codon context, the start codon context characterizing a codon sequence of a preset window length at the start codon, the stop codon context characterizing a codon sequence of a preset window length at the stop codon; Inputting the key base sequence into a target scoring model, and processing to obtain the score of the mRNA sequence in a target index; The target scoring model comprises a multi-mode RNA sequence coding module and a large language model, and is obtained based on unsupervised pre-training of transcription sample data and based on supervised fine tuning of a score label data set corresponding to a target index. In some embodiments, extracting the key base sequence from the mRNA sequence comprises the steps of: Taking the first nucleotide position of the start codon and the last nucleotide position of the stop codon as double-center reference points, and carrying out standardized alignment treatment on the mRNA sequence to obtain an alignment sequence; Based on the alignment sequence, intercepting a sequence with a preset length at the u