CN-116711015-B - Vector model training method, negative sample generating method, medium and device

CN116711015BCN 116711015 BCN116711015 BCN 116711015BCN-116711015-B

Abstract

The disclosure provides a vector model training method, a negative sample generation method, a storage medium and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the steps of obtaining a plurality of RNA sequences and a plurality of protein sequences, vectorizing the plurality of RNA sequences to obtain a plurality of RNA first vectors, vectorizing the plurality of protein sequences to obtain a plurality of protein first vectors, determining interaction between the RNA sequences and the protein sequences according to the RNA first vectors and the protein first vectors, calculating the distance between any two RNA sequences to obtain similarity of a plurality of RNA-RNA pairs, calculating the distance between any two protein sequences to obtain similarity of a plurality of protein pairs, training a vector model according to the interaction between the RNA sequences and the protein sequences, the similarity of the RNA-RNA pairs and the similarity of the protein-protein pairs, and generating a target negative sample by using the trained vector model.

Inventors

ZHANG ZHENZHONG

Assignees

京东方科技集团股份有限公司

Dates

Publication Date: 20260512
Application Date: 20220104

Claims (20)

1.A method of vector model training, comprising: obtaining a plurality of RNA sequences and a plurality of protein sequences; vectorizing the plurality of RNA sequences to obtain a plurality of RNA first vectors; Vectorizing the plurality of protein sequences to obtain a plurality of protein first vectors; Determining an interaction between the RNA sequence and the protein sequence based on the RNA first vector and the protein first vector; calculating the distance between any two RNA sequences to obtain the similarity of a plurality of RNA-RNA pairs; calculating the distance between any two protein sequences to obtain the similarity of a plurality of protein-protein pairs; Training a vector model based on interactions between the RNA sequences and protein sequences, similarities of RNA-RNA pairs, and similarities of protein-protein pairs; Wherein said determining the interaction between said RNA sequence and protein sequence from said RNA first vector and protein first vector comprises: According to the following: Calculating a probability value of interaction between the RNA sequence and the protein sequence, and determining the interaction between the RNA sequence and the protein sequence according to the probability value, As a parameter of the model, it is possible to provide, For the first vector of the RNA, Is the first vector of the protein; Wherein, the calculating the distance between any two RNA sequences to obtain the similarity of a plurality of RNA-RNA pairs comprises: calculating the editing distance between any two RNA sequences, and obtaining the sequence distance of any two RNA sequences according to the editing distance; Obtaining the similarity of a plurality of RNA-RNA pairs according to the sequence distance of any two RNA sequences; wherein the calculating the distance between any two protein sequences to obtain the similarity of a plurality of protein-protein pairs comprises: Mapping the plurality of protein sequences into a vector space to obtain a plurality of protein vectors; Calculating the distance between any two protein vectors to obtain the similarity of the protein-protein pairs; Wherein training the vector model based on the interactions between the RNA sequences and protein sequences, the similarities of RNA-RNA pairs, and the similarities of protein-protein pairs, comprises: Constructing an objective function based on the interactions between the RNA sequences and the protein sequences, the similarity of RNA-RNA pairs, and the similarity of protein-protein pairs; based on the objective function, carrying out iterative updating on model parameters of the vector model by utilizing a random gradient descent algorithm, and completing training of the vector model when an iteration termination condition is met; wherein the objective function is: Wherein, the The expression of the ith RNA sequence, The sequence of the jth RNA is shown, The sequence of the ith protein is represented by, Represents the sequence of the jth protein, The first vector of the ith RNA is represented, The first vector of the ith protein is represented, alpha, beta and gamma are model super parameters, and K is the sequence number of the RNA sequence and the protein sequence.
2. The vector model training method of claim 1, wherein vectorizing the plurality of RNA sequences results in a plurality of RNA first vectors comprising: converting each RNA sequence into N base k-mer subsequences; Vectorizing the k-mer subsequence of each base to obtain the RNA first vector.
3. The vector model training method of claim 2, wherein said vectorizing said each base k-mer subsequence results in said RNA first vector comprising: Encoding each base k-mer subsequence to obtain a first vector of N base k-mer subsequences; Inputting the first vector of the N-base k-mer subsequence into a recurrent neural network, and outputting N-base k-mer vectors; The RNA first vector is obtained according to the N-base k-mer vector.
4. The method of vector model training of claim 1, wherein vectorizing the plurality of protein sequences results in a plurality of protein first vectors comprising: Converting each protein sequence into M amino acid k-mer subsequences; vectorizing the k-mer subsequences of each amino acid to obtain a first vector of the protein.
5. The method of vector model training of claim 4, wherein said vectorizing said each amino acid k-mer subsequence to obtain said protein first vector comprises: Encoding each amino acid k-mer subsequence to obtain a first vector of M amino acid k-mer subsequences; Inputting the first vectors of the M amino acid k-mer subsequences into a recurrent neural network, and outputting M amino acid k-mer vectors; And obtaining the first vector of the protein according to the k-mer vectors of the M amino acids.
6. The method according to claim 1, wherein calculating the edit distance between any two RNA sequences and obtaining the sequence distance between any two RNA sequences according to the edit distance comprises: According to the following: obtaining any two RNA sequences 、 Wherein, Representation of RNA sequences And RNA sequences Is used for the editing distance of (a), Representation of RNA sequences Is provided for the length of (a), Representation of RNA sequences Is a length of (c).
7. The method according to claim 6, wherein the step of obtaining the similarity of the plurality of RNA-RNA pairs based on the sequence distances of the arbitrary two RNA sequences comprises: According to the following: obtaining the similarity of a plurality of RNA-RNA pairs, wherein, Representing any two RNA sequences 、 Is a sequence distance of (a).
8. A negative sample generation method, comprising: Obtaining positive RNA-protein pairs; Vectorizing a target RNA sequence and a target protein sequence in the positive RNA-protein pair through a trained vector model to obtain a corresponding RNA second vector and a protein second vector, wherein the trained vector model is trained through the vector model training method according to any one of claims 1-7; And obtaining a target negative example RNA-protein pair corresponding to the positive example RNA-protein pair based on the RNA second vector and the protein second vector, wherein the target negative example RNA-protein pair is used for training an RNA-protein interaction prediction model.
9. The negative example generation method of claim 8, wherein the deriving the positive example RNA-protein pair corresponding to the negative example RNA-protein pair based on the RNA second vector and the protein second vector comprises: calculating the similarity between the target RNA sequence and any RNA sequence except the target RNA sequence; Screening the random RNA sequences except the target RNA sequence according to the similarity to obtain candidate RNA sequences; calculating according to the RNA second vector of the candidate RNA sequence and the protein second vector of the target protein sequence to obtain the relation score between the candidate RNA sequence and the target protein sequence; determining the target negative example RNA-protein pair according to the relation score between the candidate RNA sequence and the target protein sequence.
10. The negative-sample generation method according to claim 9, wherein the calculating the relationship score between the candidate RNA sequence and the target protein sequence from the RNA second vector of the candidate RNA sequence and the protein second vector of the target protein sequence comprises: calculating a relation score between the candidate RNA sequence and the target protein sequence according to the RNA second vector of the candidate RNA sequence and the protein second vector of the target protein sequence based on the model parameters of the vector model; The model parameters are obtained by training the vector model.
11. The negative-sample generation method according to claim 10, wherein the calculating the relationship score between the candidate RNA sequence and the target protein sequence based on the model parameters of the vector model and based on the RNA second vector of the candidate RNA sequence and the protein second vector of the target protein sequence comprises: According to the following: calculating to obtain the relation score between the candidate RNA sequence and the target protein sequence, The second vector of RNA that is the candidate RNA sequence, A second vector of the protein which is the target protein sequence, To complete the model parameters in the trained vector model, Representation of And Dot product operation is performed.
12. The negative example generation method of claim 9, wherein the determining the target negative example RNA-protein pair from the relationship score between the candidate RNA sequence and the target protein sequence comprises: When the relation score between the candidate RNA sequence and the target protein sequence meets a preset condition, obtaining a first negative sample set from the candidate RNA sequence; Determining the target negative example RNA-protein pair from the first negative sample set.
13. The negative example generation method of claim 8, wherein the deriving each positive example RNA-protein pair corresponding to the negative example RNA-protein pair based on the RNA second vector and the protein second vector comprises: calculating a similarity between the target protein sequence and any protein sequence other than the target protein sequence; Screening the arbitrary protein sequences except the target protein sequence according to the similarity to obtain candidate protein sequences; Calculating according to the protein second vector of the candidate protein sequence and the RNA second vector of the target RNA sequence to obtain the relation score between the candidate protein sequence and the target RNA sequence; And determining the target negative RNA-protein pair according to the relation score between the candidate protein sequence and the target RNA sequence.
14. The negative example generation method of claim 13, wherein the determining the target negative example RNA-protein pair from the relationship score between the candidate protein sequence and target RNA sequence comprises: When the relation score between the candidate protein sequence and the target RNA sequence meets a preset condition, the candidate protein sequence is transmitted to a second negative sample set; Determining the target negative example RNA-protein pair from the second negative sample set.
15. The negative-sample generation method according to claim 8, characterized in that the method further comprises: Acquiring a training data set, wherein the training data set consists of a plurality of RNA-protein pairs; Determining interactions between RNA sequences and protein sequences, similarities of RNA-RNA pairs, and similarities of protein-protein pairs in the training dataset by the vector model; Constructing an objective function based on the interactions between the RNA sequences and the protein sequences, the similarity of RNA-RNA pairs, and the similarity of protein-protein pairs; and based on the objective function, carrying out iterative updating on model parameters of the vector model by using a random gradient descent algorithm, and completing training of the vector model when the iteration termination condition is met.
16. A negative sample generation method, comprising: Obtaining a positive sample, wherein the positive sample consists of two biological molecule sequences; Vectorizing a first biomolecule sequence in the positive sample through a trained network model to obtain a biomolecule vector of the first biomolecule sequence; Calculating a similarity between a biomolecular vector of the first biomolecular sequence and a biomolecular vector of any homologous biomolecular sequence other than the first biomolecular sequence; determining a homologous target biomolecule sequence that is similar to the first biomolecule sequence based on the similarity; obtaining a negative example sample corresponding to the positive example sample from the same target biomolecule sequence and a second biomolecule sequence in the positive example sample; the trained network model is trained by the following method: obtaining a plurality of homologous biomolecule sequences; Vectorizing the same biological molecule sequences to obtain a plurality of biological molecule vectors; Calculating the distance between any two homologous biomolecule sequences to obtain the similarity of a plurality of homologous biomolecule sequence pairs; Training the network model according to the similarity of the biomolecule vector and the same biomolecule sequence pair; When the homologous biomolecule sequences are RNA sequences, the calculating the distance between any two homologous biomolecule sequences to obtain the similarity of a plurality of homologous biomolecule sequence pairs comprises the following steps: calculating the editing distance between any two RNA sequences, and obtaining the sequence distance of any two RNA sequences according to the editing distance; Obtaining the similarity of a plurality of RNA-RNA pairs according to the sequence distance of any two RNA sequences; When the homologous biomolecule sequences are protein sequences, the calculating the distance between any two homologous biomolecule sequences to obtain the similarity of a plurality of homologous biomolecule sequence pairs comprises: Mapping the plurality of protein sequences into a vector space to obtain a plurality of protein vectors; and calculating the distance between any two protein vectors to obtain the similarity of a plurality of protein-protein pairs.
17. The method of generating negative examples according to claim 16, wherein the first biomolecule sequence is an RNA sequence and the biomolecule vector is an RNA third vector, wherein vectorizing the first biomolecule sequence in the positive examples by the trained network model to obtain the biomolecule vector of the first biomolecule sequence comprises: converting each RNA sequence into N base k-mer subsequences; vectorizing each base k-mer subsequence through a trained network model to obtain the RNA third vector.
18. The negative-sample generation method of claim 17, wherein vectorizing the each base k-mer subsequence through the trained network model yields the RNA third vector, comprising: Encoding each base k-mer subsequence to obtain a first vector of N base k-mer subsequences; inputting the first vector of the N-base k-mer subsequence into a trained pre-training model, and outputting N-base k-mer vectors; And obtaining the third vector of the RNA according to the k-mer vector of the N bases.
19. The negative-sample generation method according to claim 17, wherein the calculating of the similarity between the biomolecule vector of the first biomolecule sequence and the biomolecule vector of any homologous biomolecule sequence other than the first biomolecule sequence comprises: According to the following: calculating to obtain two RNA third vectors 、 Similarity between the two, wherein, Representation of RNA sequences Is selected from the group consisting of the first vector, Representation of RNA sequences Is a third vector of the RNA of (2).
20. The method of generating negative examples according to claim 16, wherein the first biomolecule sequence is a protein sequence and the biomolecule vector is a third protein vector, wherein vectorizing the first biomolecule sequence in the positive examples by the trained network model to obtain the biomolecule vector of the first biomolecule sequence comprises: Converting each protein sequence into M amino acid k-mer subsequences; Vectorizing each amino acid k-mer subsequence by a trained network model to obtain a third vector of the protein.

Description

Vector model training method, negative sample generating method, medium and device Technical Field The present disclosure relates to the field of artificial intelligence, and in particular, to a vector model training method, a negative sample generation method, a computer-readable storage medium, and an electronic device. Background In modern biological research, with the deep research of functional genome, biological functional research of biomolecules is very important, and interaction analysis of biomolecules becomes an indispensable important means in the current biological functional research. Taking RNA as an example, non-coding RNA (noncoding RNA, ncRNA) is involved in many complex cellular processes, plays an important role in life processes such as selective shearing, chromatin modification and epigenetic inheritance, and has a close relationship with many diseases. Studies have shown that most non-coding RNAs fulfill their regulatory functions by interacting with proteins. Therefore, research of the interaction of non-coding RNAs with proteins has great significance in revealing the molecular mechanisms of action of non-coding RNAs in human diseases and vital activities, and has become one of the important pathways for analyzing the functions of non-coding RNAs and proteins. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention The present disclosure provides a vector model training method, a negative-sample generation method, a computer-readable storage medium, and an electronic device. The present disclosure provides a vector model training method, comprising: obtaining a plurality of RNA sequences and a plurality of protein sequences; vectorizing the plurality of RNA sequences to obtain a plurality of RNA first vectors; Vectorizing the plurality of protein sequences to obtain a plurality of protein first vectors; Determining an interaction between the RNA sequence and the protein sequence based on the RNA first vector and the protein first vector; calculating the distance between any two RNA sequences to obtain the similarity of a plurality of RNA-RNA pairs; calculating the distance between any two protein sequences to obtain the similarity of a plurality of protein-protein pairs; The vector model is trained based on interactions between the RNA sequences and protein sequences, similarities of RNA-RNA pairs, and similarities of protein-protein pairs. In an exemplary embodiment of the present disclosure, said vectorizing said plurality of RNA sequences results in a plurality of RNA first vectors comprising: converting each RNA sequence into N base k-mer subsequences; Vectorizing the k-mer subsequence of each base to obtain the RNA first vector. In an exemplary embodiment of the present disclosure, said vectorizing said each base k-mer subsequence results in said RNA first vector comprising: Encoding each base k-mer subsequence to obtain a first vector of N base k-mer subsequences; Inputting the first vector of the N-base k-mer subsequence into a recurrent neural network, and outputting N-base k-mer vectors; The RNA first vector is obtained according to the N-base k-mer vector. In an exemplary embodiment of the present disclosure, the vectorizing the plurality of protein sequences results in a plurality of protein first vectors comprising: Converting each protein sequence into M amino acid k-mer subsequences; vectorizing the k-mer subsequences of each amino acid to obtain a first vector of the protein. In an exemplary embodiment of the present disclosure, said vectorizing said each amino acid k-mer subsequence to obtain said protein first vector comprises: Encoding each amino acid k-mer subsequence to obtain a first vector of M amino acid k-mer subsequences; inputting a first vector of the M amino acid k-mer subsequences into the recurrent neural network, outputting M amino acid k-mer vectors; And obtaining the first vector of the protein according to the k-mer vectors of the M amino acids. In an exemplary embodiment of the present disclosure, the determining the interaction between the RNA sequence and the protein sequence from the RNA first vector and the protein first vector comprises: According to the following: And calculating to obtain a probability value of interaction between the RNA sequence and the protein sequence, and determining the interaction between the RNA sequence and the protein sequence according to the probability value, wherein θ is a model parameter, v R is an RNA first vector, and v P is a protein first vector. In an exemplary embodiment of the disclosure, the calculating the distance between any two RNA sequences, resulting in a similarity of a plurality of RNA-RNA pairs, comprises: calculating the editing distance between any two