CN-121999857-A - Gene splice site identification method and system based on small molecule regulation

CN121999857ACN 121999857 ACN121999857 ACN 121999857ACN-121999857-A

Abstract

The embodiment discloses a gene splice site identification method and system based on small molecule regulation, wherein the method comprises the steps of obtaining gene expression quantity of each candidate gene sequence to select a plurality of reliable gene sequences, obtaining candidate pseudo-exon insertion events of each reliable gene sequence under the treatment of each small molecule compound, selecting reliable pseudo-exon insertion events from the plurality of candidate pseudo-exon insertion events, training a pre-constructed deep learning model according to the reliable gene sequences and the reliable pseudo-exon insertion events to obtain a splice site identification model, obtaining a target gene sequence, obtaining a base position characteristic sequence corresponding to the target gene sequence, inputting the base position characteristic sequence into the splice site identification model, predicting the probability that each base is a splice site through the splice site identification model, obtaining splice site coordinates corresponding to the target gene sequence according to the probability, and improving the accuracy of splice site identification.

Inventors

LIU YANG
LI JUNLIN
ZENG YING
ZHANG JIACHENG
LI JINXING
LI YANG

Assignees

溪砾科技(深圳)有限公司

Dates

Publication Date: 20260508
Application Date: 20251231

Claims (10)

1. A method for identifying splice sites of genes based on small molecule regulation, comprising: Obtaining the gene expression quantity of each candidate gene sequence, so as to select a plurality of reliable gene sequences from a plurality of candidate gene sequences according to the gene expression quantity; obtaining candidate pseudo-exon insertion events of each reliable gene sequence under the treatment of each small molecule compound, and obtaining splicing inclusion rate and sequencing data of pseudo-exons in each candidate pseudo-exon insertion event; Selecting a reliable pseudo-exon insertion event corresponding to each of the reliable gene sequences from a plurality of the candidate pseudo-exon insertion events according to the splicing inclusion rate and the sequencing data; Training a pre-constructed deep learning model according to the reliable gene sequence and the reliable pseudo-exon insertion event to obtain a splice site recognition model; Obtaining a target gene sequence, and coding the position of each base in the target gene sequence to obtain a base position characteristic sequence corresponding to the target gene sequence; And inputting the base position characteristic sequence into the splice site recognition model so as to predict the probability of each base as a splice site through the splice site recognition model, and obtaining the splice site coordinates corresponding to the target gene sequence according to the probability.
2. The method for identifying a splice site of a gene based on small molecule control according to claim 1, wherein said obtaining the gene expression level of each candidate gene sequence to select a plurality of reliable gene sequences from a plurality of said candidate gene sequences according to the gene expression level comprises: Selecting a plurality of candidate gene sequences from a preset gene library, and carrying out transcriptome sequencing on each candidate gene sequence to obtain a plurality of sequencing reads corresponding to each candidate gene sequence and reading coordinates of each sequencing read in the candidate gene sequence; Acquiring a plurality of exon area coordinates corresponding to each candidate gene sequence from a preset gene annotation file; counting the number of sequencing reads which are compared into the exon region coordinates by comparing the read coordinates with the exon region coordinates, and obtaining the original read numbers respectively corresponding to each candidate gene sequence; Obtaining transcripts corresponding to each candidate gene sequence respectively from the gene annotation file so as to obtain the gene length corresponding to each candidate gene sequence according to the length of the transcripts; Determining each thousand of base values corresponding to each candidate gene sequence according to the ratio of the original read number to the gene length; Obtaining the accumulated sum of each kilobase value in a plurality of candidate gene sequences, and determining the gene expression quantity corresponding to each candidate gene sequence according to the accumulated sum, a preset scaling factor and each kilobase value; for any one of the candidate gene sequences, when the gene expression level of the candidate gene sequence is greater than or equal to a preset expression level threshold, the candidate gene sequence is determined to be a reliable gene sequence.
3. The method of claim 2, wherein said obtaining candidate pseudo-exon insertion events for each of said reliable gene sequences under treatment of each small molecule compound and obtaining splice inclusion rate and sequencing data for pseudo-exons in each of said candidate pseudo-exon insertion events comprises: For any one reliable gene sequence, respectively processing the reliable gene sequence according to different small molecular compounds to obtain a processed gene sequence corresponding to each small molecular compound; Transcriptome sequencing is carried out on the treatment gene sequence to obtain a plurality of molecular treatment sequencing reads corresponding to the treatment gene sequence; Acquiring a processing reading coordinate of each molecular processing sequencing reading segment in the processing gene sequence, so as to acquire false testimony body point position coordinates, pseudo acceptor point position coordinates, upstream donor point position coordinates and downstream acceptor point position coordinates of a pseudo exon in the processing gene sequence according to a comparison result of the processing reading coordinate and the reading coordinate; Selecting a plurality of first splice junction reads from a plurality of said molecular processing sequencing reads according to said pseudo-acceptor site coordinates and said upstream donor site coordinates; selecting a plurality of second splice junction reads from a plurality of the molecular processing sequencing reads according to the false testimony body point coordinates and the downstream acceptor point coordinates; Selecting a plurality of exclusion junction reads from a plurality of the molecular processing sequencing reads according to the upstream donor site coordinates and the downstream acceptor site coordinates; Acquiring a first segment number of the first splicing connecting reading segment, a second segment number of the second splicing connecting reading segment and a third segment number of the excluding connecting reading segment, and acquiring an average value of the first segment number and the second segment number; Obtaining a ratio of the average value to a sum of the average value and the third segment number, wherein the ratio is taken as a splicing inclusion rate of each pseudo exon in the processed gene sequence and the average value is taken as sequencing data of each pseudo exon.
4. The method of claim 3, wherein selecting a reliable pseudo-exon insertion event corresponding to each of the reliable gene sequences from a plurality of candidate pseudo-exon insertion events based on the splice inclusion rate and the sequencing data, comprises: Determining the pseudo-exon as a candidate reliable pseudo-exon when the splice inclusion rate is greater than or equal to a preset inclusion rate threshold and the sequencing data is greater than or equal to a preset number of segments threshold; acquiring the number of the processing gene sequences with the candidate reliable pseudo exons, and judging the pseudo exons to be reliable exons when the number of the processing gene sequences is larger than or equal to a preset sequence number threshold value; and generating a reliable pseudo-exon insertion event corresponding to each reliable gene sequence according to the false testimony body point position coordinates, the pseudo-receptor point position coordinates and the exon region coordinates corresponding to the reliable gene sequences of the reliable exons.
5. The method for identifying a splice site in a gene based on small molecule modulation according to claim 4, wherein training the pre-constructed deep learning model based on the reliable gene sequence and the reliable pseudo-exon insertion event to obtain the splice site identification model comprises: Labeling each base in the reliable gene sequence according to the false testimony body point position coordinates, the pseudo-acceptor point position coordinates and the exon region coordinates to obtain a labeling data set containing classical splice site labels and pseudo-exon splice site labels; training a pre-constructed deep learning model according to the annotation data set, and acquiring a loss value of the deep learning model in each training process; and optimizing the model parameters of the deep learning model according to the loss value and a preset model parameter optimizer until the loss value is smaller than or equal to a preset loss threshold value, and obtaining a splice site identification model.
6. The method for identifying a splice site of a gene based on small molecule modulation according to claim 1, wherein the steps of obtaining a target gene sequence, and encoding the position of each base in the target gene sequence to obtain a base position characteristic sequence corresponding to the target gene sequence, comprise: mapping any base in the target gene sequence into a four-dimensional independent-heat coding vector according to the type of the base to obtain an initial coding sequence corresponding to the base; Acquiring the sequence length of the initial coding sequence, and expanding the initial coding sequence at two ends of the initial coding sequence when the sequence length is smaller than a preset window length to obtain a processed coding sequence; When the sequence length is greater than or equal to the window length, dividing the initial coding sequence into a plurality of subsequences with the length of the window length in a sliding window mode; adding context areas with preset lengths at two ends of each subsequence, and obtaining model input sequences corresponding to each subsequence respectively; and obtaining the processed coding sequence or the model input sequence corresponding to each base respectively so as to obtain the base position characteristic sequence corresponding to the target gene sequence.
7. The method for identifying a splice site in a gene based on small molecule modulation according to claim 6, wherein inputting the base position feature sequence into the splice site identification model to predict the probability of each base being a splice site by the splice site identification model to obtain the splice site coordinates corresponding to the target gene sequence according to the probability comprises: Carrying out one-dimensional convolution treatment on the base position feature sequence through the splice site recognition model so as to map the base position feature sequence into a high-channel feature tensor; Extracting local context features and global context features of the high-channel feature tensor through a multi-cascade residual network preset in the splice site identification model, and carrying out weighted fusion on the local context features and the global context features to obtain base features corresponding to the target gene sequence; outputting the type and type probability of the splice site corresponding to each base in the target gene sequence according to the base characteristics and a classification function preset in the splice site recognition model; and obtaining the splice site coordinates corresponding to the target gene sequence according to the type of the splice site, the type probability and the base coordinates of the base in the target gene sequence.
8. The method for identifying gene splice sites based on small molecule modulation of claim 7, wherein the multi-cascade residual network comprises a plurality of cascade residual block sub-networks with progressively increasing convolution kernel sizes; Extracting local context characteristics and global context characteristics of the high-channel characteristic tensor through a multi-cascade residual error network preset in the splice site identification model, and carrying out weighted fusion on the local context characteristics and the global context characteristics to obtain base characteristics of the target gene sequence, wherein the method comprises the following steps: For any residual block sub-network, performing convolution processing and normalization processing on the input features through the residual block sub-network to obtain intermediate features; According to the sequence dimension of the target gene sequence, self-attention weighting is carried out on the intermediate features through the residual block sub-network, so that weighted features are obtained; Adding the input features, the intermediate features and the weighted features to obtain output features of the residual block sub-network, and taking the output features of the residual block sub-network as input features of a lower residual block sub-network; And taking the output characteristic of the last residual block sub-network in the multi-cascade residual network as the base characteristic of the target gene sequence.
9. The gene splice site recognition system based on small molecule regulation is characterized by comprising a gene sequence screening module, a pseudo-exon recognition module, a pseudo-exon selection module, a model training module, a base coding module and a gene splice recognition module; the gene sequence screening module is used for acquiring the gene expression quantity of each candidate gene sequence so as to select a plurality of reliable gene sequences from a plurality of candidate gene sequences according to the gene expression quantity; The pseudo-exon identification module is used for acquiring candidate pseudo-exon insertion events of each reliable gene sequence under the treatment of each small molecule compound, and acquiring splicing inclusion rate and sequencing data of pseudo-exons in each candidate pseudo-exon insertion event; the pseudo-exon selection module is used for selecting a reliable pseudo-exon insertion event corresponding to each reliable gene sequence from a plurality of candidate pseudo-exon insertion events according to the splicing inclusion rate and the sequencing data; The model training module is used for training a pre-constructed deep learning model according to the reliable gene sequence and the reliable pseudo-exon insertion event to obtain a splice site recognition model; The base coding module is used for obtaining a target gene sequence, and coding the position of each base in the target gene sequence to obtain a base position characteristic sequence corresponding to the target gene sequence; The gene splicing recognition module is used for inputting the base position characteristic sequence into the splice site recognition model so as to predict the probability of each base as a splice site through the splice site recognition model, and the splice site coordinates corresponding to the target gene sequence are obtained according to the probability.
10. The gene splice site recognition system based on small molecule regulation of claim 9, wherein the gene sequence screening module comprises a reference recognition unit and a sequence screening unit; The reference identification unit is used for selecting a plurality of candidate gene sequences from a preset gene library, and carrying out transcriptome sequencing on each candidate gene sequence to obtain a plurality of sequencing reads corresponding to each candidate gene sequence and a read coordinate of each sequencing read in the candidate gene sequence; Obtaining a plurality of exon region coordinates corresponding to each candidate gene sequence from a preset gene annotation file, and comparing the reading segment coordinates with the exon region coordinates to count the number of sequencing reads in the exon region coordinates so as to obtain the original reading segment numbers corresponding to each candidate gene sequence; The sequence screening unit is used for obtaining transcripts corresponding to each candidate gene sequence from the gene annotation file to obtain the gene length corresponding to each candidate gene sequence according to the length of the transcripts, determining each kilobase value corresponding to each candidate gene sequence according to the ratio of the original reading number to the gene length, obtaining the sum of the accumulation of each kilobase value in the candidate gene sequences to determine the gene expression quantity corresponding to each candidate gene sequence according to the sum, a preset scaling factor and each kilobase value, and judging that the candidate gene sequence is a reliable gene sequence for any one candidate gene sequence when the gene expression quantity of the candidate gene sequence is greater than or equal to a preset expression quantity threshold value.

Description

Gene splice site identification method and system based on small molecule regulation Technical Field The invention relates to the technical field of medicine intelligence, in particular to a gene splice site identification method and system based on small molecule regulation. Background With the progressive saturation of the traditional protein target space, small molecule drugs targeting RNA (ribonucleic acid) are becoming new directions for the development of innovative drugs. RNA plays a key role in transcriptional regulation, splicing, stability, translation, and the like, and is considered as an important field of next-generation targeted therapies. Research shows that alternative splicing is one of the key links of eukaryotic gene expression regulation, so RNA splicing regulation has become an emerging important field of drug development. In the prior art, when recognition of splice sites is carried out based on RNA splice regulation, natural splice sites in a gene sequence are often recognized, such as recognition of splice sites is carried out by using a deep learning network model, but in the practical application process, a small molecule compound can influence the regulation and control of splicing, but the recognition method does not consider the regulation and control influence of a small molecule drug on splicing, only classical donors, acceptors or other natural sites are recognized, splice targets regulated by drugs cannot be distinguished, so that new splice targets under the action of the small molecules cannot be efficiently found, and the recognition accuracy of the gene splice sites is low. Disclosure of Invention In order to solve the technical problems, the invention discloses a gene splice site identification method and a system based on small molecule regulation and control, which are used for improving the identification accuracy of gene splice sites. In order to achieve the above object, the present invention discloses a method for identifying splice sites of genes based on small molecule regulation, comprising: Obtaining the gene expression quantity of each candidate gene sequence, so as to select a plurality of reliable gene sequences from a plurality of candidate gene sequences according to the gene expression quantity; obtaining candidate pseudo-exon insertion events of each reliable gene sequence under the treatment of each small molecule compound, and obtaining splicing inclusion rate and sequencing data of pseudo-exons in each candidate pseudo-exon insertion event; Selecting a reliable pseudo-exon insertion event corresponding to each of the reliable gene sequences from a plurality of the candidate pseudo-exon insertion events according to the splicing inclusion rate and the sequencing data; Training a pre-constructed deep learning model according to the reliable gene sequence and the reliable pseudo-exon insertion event to obtain a splice site recognition model; Obtaining a target gene sequence, and coding the position of each base in the target gene sequence to obtain a base position characteristic sequence corresponding to the target gene sequence; And inputting the base position characteristic sequence into the splice site recognition model so as to predict the probability of each base as a splice site through the splice site recognition model, and obtaining the splice site coordinates corresponding to the target gene sequence according to the probability. According to the gene splice site identification method based on small molecule regulation, the problem of accurately identifying the splice site is solved by integrating small molecule regulation factors to train a deep learning model. The method comprises the steps of obtaining gene expression quantity of each candidate gene sequence, selecting reliable gene sequences according to the expression quantity, guaranteeing quality and reliability of a data source, avoiding noise introduced by low expression sequences, obtaining candidate pseudo-exon insertion events and splicing inclusion rate and sequencing data of each reliable gene sequence under the treatment of small molecular compounds, directly capturing splicing events regulated and controlled by drugs through small molecular processing, providing real data under the regulation and control effects for a model, selecting the reliable pseudo-exon insertion events according to the splicing inclusion rate and the sequencing data, screening out high confidence events, guaranteeing accuracy and representativeness of training data, training a deep learning model according to the reliable gene sequences and the reliable pseudo-exon insertion events, enabling the model to learn splicing modes under the regulation and control of small molecules, identifying new targets, obtaining the target gene sequences, encoding base position characteristics of the target gene sequences, converting sequence information into structural input, facilitating model processing, inputting base