CN-122024843-A - Collagen fusion temperature prediction method based on deep learning

CN122024843ACN 122024843 ACN122024843 ACN 122024843ACN-122024843-A

Abstract

The invention discloses a collagen fusion temperature prediction method based on deep learning, which comprises the steps of carrying out multidimensional coding according to a collagen sequence processed by a dynamic mask and combining amino acid types, carrying out sequence processing according to the dynamic mask to obtain multidimensional feature tensors, carrying out feature mapping through two layers of convolution layers in sequence according to the multidimensional feature tensors, further restraining features corresponding to sequence positions in the feature mapping output by the convolution layers according to the dynamic mask, carrying out sequence length standardization on the feature mapping output by the convolution layers through a first pooling layer, and carrying out aggregation through a second pooling layer to obtain a global feature vector. The method breaks through the limitation of fixed length input by processing a variable length sequence, realizes accurate extraction of key structural features of the collagen by utilizing polypeptide perception convolution and amino acid type coding of a first layer of convolution layer, and finally realizes high-precision and wide-application-range prediction of the melting temperature of the collagen by combining multi-level feature learning and aggregation.

Inventors

ZHANG JUNLI
WANG RUIYAN
WANG XINGLONG
ZENG XUAN

Assignees

北京华熙荣熙生物技术研究有限公司

Dates

Publication Date: 20260512
Application Date: 20251229

Claims (10)

1. The collagen melting temperature prediction method based on deep learning is characterized by comprising the following steps of: according to the collagen sequence processed by the dynamic mask, carrying out multidimensional coding by combining amino acid types, and carrying out sequence processing according to the dynamic mask to obtain a multidimensional characteristic tensor; According to the multidimensional feature tensor, feature mapping is carried out through two layers of convolution layers in sequence; And further, according to the dynamic mask, the feature of the corresponding sequence position in the feature map output by the convolution layer is constrained, the sequence length of the feature map output by the convolution layer is standardized through the first pooling layer, and the global feature vector is obtained through aggregation through the second pooling layer, so that the melting temperature is predicted.
2. The method according to claim 1, wherein the dynamic masking process is specifically: Judging the length L of the collagen amino acid sequence; If L >1024, then the sequence is truncated, retaining up to 1024 amino acid symbols, If L <64, end filling is performed by using filling symbols to make the length reach 64; and generating a dynamic mask according to the symbol types in the collagen amino acid sequence, wherein the symbol types comprise amino acid symbols and filling symbols.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises, The amino acid types include standard amino acids and modified amino acids, Wherein the modified amino acid at least comprises any one of hydroxyproline, pyrrolysine and hydroxylysine.
4. A method according to claim 3, characterized in that the multidimensional coding is performed in combination with amino acid types, and the sequence processing is performed according to a dynamic mask, resulting in a multidimensional feature tensor, in particular: identifying the amino acid symbols according to the dynamic mask to process the amino acid symbols; Processing is carried out according to the amino acid types, multi-mode feature fusion is carried out, and multi-dimensional feature tensors are obtained, wherein the fused features at least comprise any one of physicochemical properties and chemical features, and the dimension of the obtained feature tensors is the same as the number of the amino acid types.
5. The method according to claim 1, characterized in that the step size of the first layer of convolution layers is in particular: Determining a collagen repeating pattern according to the number of repetitions; the step size of the first convolution layer is determined based on the number of peptides in the collagen repeating pattern.
6. The method of claim 5, wherein the step size of the first layer of convolution layers is greater than the step size of the second layer of convolution layers.
7. The method according to claim 1, further comprising constraining the feature of the corresponding sequence position in the feature map output by the convolutional layer according to the dynamic mask, and normalizing the sequence length of the feature map output by the convolutional layer by the first pooling layer, specifically: Identifying amino acid symbols according to the dynamic mask, and restraining the features of the corresponding sequence positions in the feature map output by the convolution layer; And further, the feature mapping output by the convolution layer is subjected to standardization processing through the first pooling layer, so that a standardized feature space is obtained.
8. The method of claim 7, further comprising, after normalizing the sequence length: sequentially passing through a plurality of residual enhancement blocks according to the standardized sequence data to obtain a plurality of residual enhancement vectors; according to any two residual enhancement vectors, a progressive convolution vector is obtained through a layer of progressive convolution network, And according to any two of the progressive convolution vector and the residual enhancement vectors, the progressive convolution vector is obtained through a next layer of progressive convolution network, and the circulation is sequentially executed to finish the processing of the multilayer progressive convolution network, preferably, the number of output channels of the progressive convolution network is increased along with the increase of the layer number.
9. The method according to claim 8, wherein the aggregation by the second pooling layer results in global feature vectors, preferably further comprising: According to the progressive convolution vector obtained after the circulation is executed, an attention weight graph is obtained through the space attention; And according to the attention weight graph, carrying out weighted correction on the progressive convolution vector, carrying out space information aggregation dimension reduction through a second pooling layer, and obtaining a global feature vector through a full-connection layer after flattening.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, The computer program implementing the method of any of claims 1 to 9 when executed by a processor.

Description

Collagen fusion temperature prediction method based on deep learning Technical Field The invention belongs to the technical field of collagen melting temperature prediction, and particularly relates to a collagen melting temperature prediction method based on deep learning. Background Collagen, one of the most abundant families of proteins in the living body, is widely used in the biological material, drug delivery system, tissue engineering scaffold, and food and cosmetic industries due to its good biocompatibility and degradability. The thermal stability of collagen, generally characterized by its melting temperature (Tm), is critical to its maintenance of natural structure and normal functioning. Therefore, in the development and quality control of related products, it is important to accurately predict the Tm value of collagen. Traditional experimental determination methods, such as Differential Scanning Calorimetry (DSC) or Circular Dichroism (CD), have the inherent defects of long time consumption (days) and high cost (thousands of yuan for a single sample) although the result is reliable, and severely restrict the research and development efficiency and the large-scale application. For this reason, a calculation prediction method has been developed. Protocols based on molecular dynamics modeling Tm values were estimated by modeling protein folding status at high temperature. The method relies on accurate force field parameters and known three-dimensional structures, so that the calculation resource consumption is huge, and the specific repeated pattern of the collagen is lack of targeted optimization, so that the prediction accuracy is limited. With the development of deep learning technology, a scheme based on a general protein language model (such as a transducer) has emerged. These schemes treat protein sequences as natural language, capturing long-range dependencies through self-attention mechanisms. However, such schemes have significant limitations, firstly, in that the space-division approach typically employed destroys the sequence integrity of the repeating units characteristic of collagen, and secondly, in order to meet the fixed input size requirements of the model, the sequences must be tightly truncated or filled to a fixed length, which results in the inability to handle native collagen sequences of varying lengths (often exceeding 1000 amino acids), losing a significant amount of structural information. In order to solve the collagen prediction problem more specifically, a solution (such as ColNet) based on a dedicated Convolutional Neural Network (CNN) has further emerged. The scheme adopts a CNN architecture which is more suitable for local mode extraction, and starts to support the unique hydroxyproline coding of collagen. However, even if the network architecture is still insufficient, on one hand, the length of the input sequence is still strictly fixed (e.g. 64 amino acids) and the bottleneck of processing the variable length sequence cannot be broken, and on the other hand, the design of the convolution layer cannot be accurately aligned with the repeated structural unit of the most core of the collagen, so that the biological rationality of feature extraction is insufficient and the further improvement of the prediction precision is limited. Therefore, the prior art, especially the prediction scheme based on deep learning, commonly faces the problems of sequence length rigidness, feature extraction misalignment, incomplete coding characterization and the like. Disclosure of Invention The invention provides a collagen fusion temperature prediction method based on deep learning, which aims to solve the problems of sequence length rigidness caused by fixed input length, feature extraction misalignment caused by general coding and network design and incomplete coding and characterization of key modified amino acids such as hydroxyproline in the prior prediction technology, thereby realizing high-precision and high-efficiency fusion temperature prediction of a collagen sequence. The technical scheme adopted by the invention is as follows: a collagen melting temperature prediction method based on deep learning comprises the following steps: according to the collagen sequence processed by the dynamic mask, carrying out multidimensional coding by combining amino acid types, and carrying out sequence processing according to the dynamic mask to obtain a multidimensional characteristic tensor; According to the multidimensional feature tensor, feature mapping is carried out through two layers of convolution layers in sequence; And further, according to the dynamic mask, the feature of the corresponding sequence position in the feature map output by the convolution layer is constrained, the sequence length of the feature map output by the convolution layer is standardized through the first pooling layer, and the global feature vector is obtained through aggregation through the second poolin