CN-121215034-B - Functional nucleic acid slave head design framework and system based on nucleic acid language model

CN121215034BCN 121215034 BCN121215034 BCN 121215034BCN-121215034-B

Abstract

The invention discloses a construction method of a nucleic acid de novo design model based on a nucleic acid language model, which comprises the steps of encoding input data by using the nucleic acid language model subjected to SELEX data fine adjustment, converting the input data into a high-dimensional low-deviation hidden space, converting the hidden space into a low-dimensional low-noise hidden space vector by using a linear layer, screening the hidden space vector by using an E-HEBO algorithm and affinity priori data to obtain an affinity optimal hidden space vector, and further decoding the affinity optimal hidden space vector by using the linear layer and a self-attention layer to generate a candidate sequence. In addition, the method also comprises iterative encoding and decoding, and iterative generation of affinity prior data to obtain high-quality, high-accuracy and high-activity candidate sequences. The invention provides a high-efficiency artificial intelligence method for nucleic acid de novo design, which is beneficial to design and screening of functional nucleic acid and application of the functional nucleic acid in biological medicine.

Inventors

HAN DA
GUO PEI
ZHANG ZHIMING
QIU JIEZHONG
CHEN GUANGYONG

Assignees

臻智达生物技术（上海）有限公司

Dates

Publication Date: 20260512
Application Date: 20250910

Claims (10)

1. A method of constructing a nucleic acid de novo design model based on a nucleic acid language model, the method comprising the steps of: (S1) providing a nucleic acid dataset comprising a SELEX dataset comprising an affinity nucleic acid sequence and/or a non-affinity nucleic acid sequence; (S2) in the encoder, (S2.1) converting the nucleic acid sequences in the nucleic acid dataset into a plurality of symbols (Token) using a word segmentation apparatus, obtaining a pre-processed nucleic acid sequence; (S2.2) converting the preprocessed nucleic acid sequence into Gao Weiyin space vectors by using a nucleic acid language model, and reducing the dimension of the Gao Weiyin space vectors to N-dimensional hidden space vectors by using a linear layer, wherein N is an integer of 7-9; (S3) screening the N-dimensional hidden space vector by: (S3.1) clustering all the N-dimensional hidden space vectors obtained in the step (S2) by using a clustering algorithm, and selecting a central hidden space vector of each cluster; (S3.2) taking the hidden space vector of the affinity nucleic acid sequence as priori data, searching hidden space vector with affinity superior to the priori data in a preset range by using a Bayesian optimization algorithm, and/or (S3.3) selecting a hidden space vector that is close in distance to the hidden space vector of the affinity nucleic acid sequence with the hidden space vector of the affinity nucleic acid sequence as a priori data; inputting the hidden space vector generated in the steps (S3.1), (S3.2) and/or (S3.3) into a decoder; (S4) in a decoder, mapping said hidden space vector into a probability distribution corresponding to each of said symbols using a network model for generating a candidate sequence and a predictive probability for each symbol thereof; (S5) performing iterative optimization by the following steps to form the de novo nucleic acid design model based on the nucleic acid language model: (S5.1) sequence optimization iteration, namely sampling each symbol output in the step (S4) for a plurality of times based on probability distribution to obtain a plurality of candidate sequences, scoring the plurality of candidate sequences by using the nucleic acid language model, selecting the sequence with the highest score as the final sequence, and/or (S5.2) a priori data iteration: (s 5.2 a) detecting the affinity of the candidate sequence, and separating the candidate sequence into an affinity candidate sequence and a non-affinity candidate sequence; (S5.2 b) iterating steps (S2) - (S4) and (S5.2 a) a number of times using the result of step (S5.2 a) as the nucleic acid dataset, and adding the hidden space vector of the affinity candidate sequence to the prior data in step (S3.2) and/or (S3.3) at the time of iteration, thereby obtaining a final sequence; The nucleic acid language model comprises a DNA language model and an RNA language model, wherein the DNA language model is DNABERT, and the RNA language model is RNAErnie.
2. The method of claim 1, wherein DNABERT is selected from the group consisting of DNABERT-3mers, DNABERT-4mers, DNABERT-5mers, and DNABERT-6mers, wherein N = 8.
3. The method of claim 1 or2, wherein the Bayesian optimization algorithm is selected from the group consisting of E-HEBO algorithm, HEBO algorithm, and other Bayesian optimization-based algorithms, wherein the E-HEBO algorithm uses only the best binding sequence and affinity thereof among all known affinity nucleic acid sequences as the optimal function value in each Gaussian process modeling, and sets a progressively smaller search radius to continuously increase affinity while centering on the corresponding insert of the optimal function value when defining the search space.
4. The method of any one of claims 1, wherein the network model comprises a self-attention layer, a multi-layer perceptron (MLP), a linear layer and Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), mamba, hyena, GAN, or a combination thereof.
5. A method of producing a highly active nucleic acid aptamer, the method comprising the steps of: (X1) providing a nucleic acid dataset comprising a SELEX dataset comprising an affinity nucleic acid sequence and/or a non-affinity nucleic acid sequence; (X2) converting the nucleic acid sequence in the nucleic acid dataset into a plurality of symbols (Token) by using a word segmentation device in an encoder to obtain a preprocessed nucleic acid sequence, converting the preprocessed nucleic acid sequence into Gao Weiyin space vectors by using a continuously pretrained nucleic acid language model, reducing the Gao Weiyin space vectors into N-dimensional hidden space vectors by using a linear layer, wherein N is an integer of 7-9, wherein the nucleic acid language model comprises a DNA language model and an RNA language model, the DNA language model is DNABERT, and the RNA language model is RNAErnie; (X3) clustering all N-dimensional hidden space vectors obtained in the step (X2) by using a clustering algorithm, selecting a central hidden space vector of each cluster and inputting the central hidden space vector into a decoder so as to generate a candidate sequence; (X4) using the hidden space vector of the nucleic acid sequence with known affinity as priori data, determining the hidden space vector nearest to the prior data according to the space distance and inputting the hidden space vector into a decoder so as to determine the corresponding nucleic acid sequence; (X5) detecting the affinity of the top 10-100 nucleic acid sequence, the optimized sequence obtained in step (X3), and the preferred sequence obtained in step (X4) in the SELEX dataset, divided into an affinity sequence and a no-affinity sequence; (X6) inputting the nucleic acid sequences obtained in the step (X5) into the encoder to obtain hidden space vectors of each sequence, taking the nucleic acid sequences with known affinities and the hidden space vectors of the affinity sequences as priori data, searching hidden space vectors close to the priori data in a preset range by using a Bayesian algorithm, inputting the hidden space vectors into a decoder to generate candidate sequences, inputting the candidate sequences into the encoder for iterative optimization to obtain optimized sequences, screening according to the properties of the optimized sequences, and selecting the optimized sequences with the properties reaching a preset level as sequences to be tested; (X7) detecting the affinity of the test sequence, and dividing the test sequence into an affinity test sequence and a non-affinity test sequence; (X8) iterating steps (X6) - (X7) a plurality of times, and adding the hidden space vector of the affinity test sequence to the prior data in step (X6) at the time of iteration, thereby obtaining the high-activity nucleic acid aptamer.
6. The method of claim 5, further comprising, prior to (X1), the step of (X0) training the nucleic acid language model and decoder in the encoder, comprising the steps of: (X0.1) providing SELEX data for model training, the SELEX data being SELEX last round sequencing data; (X0.2) masking the SELEX data with a preset masking rate, and performing masking prediction continuous pre-training on the nucleic acid language model on the masked SELEX data so as to obtain a nucleic acid language model after continuous pre-training, wherein the preset masking rate is 10% -20%; (X0.3) training the decoder on the SELEX data, leaving the training result obtained in step (X0.2) unchanged, thereby obtaining a trained decoder.
7. A nucleic acid de novo design system based on a nucleic acid language model, the system comprising: An input unit configured to input data, the data being a SELEX dataset comprising affinity nucleic acid sequences and/or no affinity nucleic acid sequences; A design unit configured as a nucleic acid de novo design model based on a nucleic acid language model that generates highly active nucleic acid aptamers from the SELEX dataset, thereby outputting the results of the design unit, wherein the model is constructed by the method of claim 1; and an output unit configured to output a result of the design unit.
8. An electronic device comprising a processor and a memory, wherein the memory has a plurality of executable instructions, the processor being configured to read the instructions and perform the steps in the method of claim 5.
9. A computer readable storage medium having stored therein computer executable instructions which when read and executed by a processor perform the steps in the method of claim 5.
10. A computer program product comprising computer-executable instructions which, when executed by a processor, implement the steps in the method of claim 5.

Description

Functional nucleic acid slave head design framework and system based on nucleic acid language model Technical Field The invention belongs to the fields of artificial intelligence, nucleic acid screening and biological medicine. In particular to a technology for designing, optimizing and screening nucleic acid by utilizing artificial intelligence technology, in particular to a technology based on a nucleic acid language model, and further relates to a research and development and design method of nucleic acid medicines. Background Nucleic acid aptamers (aptamers) are a class of functional nucleic acid molecules with high specificity and affinity that are capable of precisely recognizing and binding target molecules, such as cell surface proteins, cytokines, small molecules, and the like. Through stable combination with the target, the aptamer can regulate biological functions, and has wide application potential. By virtue of their unique molecular recognition capabilities, nucleic acid aptamers have been widely used in the fields of disease diagnosis, targeted drug delivery, molecular imaging, and biosensors. In accurate medicine, the aptamer plays an important role, can target treatment aiming at an individual biomarker, thereby promoting the implementation of individual medical treatment and bringing new breakthrough and innovation for drug development and disease detection. The SELEX technique (SYSTEMATIC EVOLUTION OF LIGANDS BY EXPONENTIAL ENRICHMENT) is a versatile and powerful method of aptamer screening that, through iterative selection and amplification procedures, is able to screen for aptamers that are capable of specifically binding to a particular target. The main process of SELEX comprises preparing a random nucleic acid library, binding the nucleic acid library to a target molecule, separating the bound and unbound sequences, amplifying the bound sequences (PCR amplification), repeated screening, and screening for optimal sequences. However, the SELEX screening frequency is not necessarily strong due to factors such as the nature of the target protein, preference of PCR amplification, incomplete library, etc., which results in sometimes failure of the SELEX experiment. For failed SELEX experiments, a large amount of sequencing data was generated for each round, which data contained rich information between the nucleic acid sequence and the target substance. Initially, studies on SELEX sequencing data have focused on scoring the sequencing sequence using bioinformatics algorithms in combination with the sequence characteristics and frequency of occurrence (frequency) of each round of sequencing data to find nucleic acid sequences that are low in frequency but likely to be highly binding, such as RAPTRANKER. Another approach is to generate entirely new sequences, called de novo designed aptamer (aptamer), such as RaptGen, based on the distribution of sequencing data. RaptGen for the first time proposes the use of SELEX sequencing data to generate highly binding sequences that are not in a nucleic acid library, which uses deep learning methods to learn the potential distribution of sequencing data, uses VAE architecture and uses bayesian optimization in hidden space to generate highly binding nucleic acid sequences. However, the RaptGen encoder is based on CNN and has poor capabilities with respect to the self-attention mechanism in terms of sequence modeling. RaptGen has low hidden space dimension, each sequence is only composed of two-dimensional vectors, the semantics of each sequence cannot be fully represented, and the generated performance is further influenced, and a RaptGen decoder is composed of a hidden Markov model, so that the inference has great time complexity, and the speed of generating the sequence by the model is limited. Therefore, there is an urgent need in the art for a method and system for constructing a low-noise, high-search-efficiency SELEX sequence de novo design model, so as to generate a large number of high-quality, high-accuracy sequences, and further screening based thereon. Disclosure of Invention The invention aims to provide a method and a system for constructing a de novo design and screening model of a nucleic acid SELEX sequence with high efficiency and high accuracy. In a first aspect of the present invention, there is provided a method of constructing a nucleic acid de novo design model based on a nucleic acid language model, the method comprising the steps of: (S1) providing a nucleic acid dataset comprising a SELEX dataset comprising an affinity nucleic acid sequence and/or a non-affinity nucleic acid sequence; (S2) in the encoder, (S2.1) converting the nucleic acid sequences in the nucleic acid dataset into a plurality of symbols (Token) using a word segmentation apparatus, obtaining a pre-processed nucleic acid sequence; (S2.2) converting the preprocessed nucleic acid sequence into Gao Weiyin space vectors by using a nucleic acid language model, and redu