CN-122024845-A - Metagenome-assembled genome-oriented gram attribute identification method and terminal
Abstract
The invention discloses a gram attribute identification method, a computer readable storage medium and a terminal for a metagenome assembly genome. The identification method comprises the steps of constructing and screening an assembled genome, extracting component characteristics normalized by the total length of the genome and functional characteristics normalized by the total number of genes, mapping the two types of characteristics to the same dimensional space through a parallel mapping module to obtain component embedding and functional embedding, generating fusion characteristics through fusion of a bidirectional cross attention module, extracting the characteristics through a transducer encoder and outputting a gram attribute identification result. According to the invention, the component features and the functional features are extracted and fused in parallel, the feature space isomerism is eliminated, the complex association between the features is captured by utilizing a bidirectional cross attention mechanism, the self-adaptive feature interactive learning is realized, the genome multidimensional complementary information is fully utilized, and the accuracy of gram attribute identification is remarkably improved.
Inventors
- WANG JINGYUAN
- LIU FU
- LIU YUN
- HOU TAO
- KANG BING
- Duan Jilu
- LI ZHIHUI
Assignees
- 吉林大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260331
Claims (10)
- 1. A method for identifying gram attributes of a metagenome-oriented assembled genome, comprising the steps of: constructing an assembled genome of a microorganism to be detected, screening the assembled genome by utilizing quality evaluation software, and reserving the assembled genome meeting the requirements of preset integrity and pollution degree; extracting the constitutive features of the assembled genome And functional characteristics Wherein the composition features The functional characteristics are normalized by the total length of the genome Carrying out normalization treatment on total genes; the first mapping module and the second mapping module are used for respectively characterizing the components And the functional features Mapping to the representation space of the same dimension to obtain a composition embedding And functional embedding ; Fusing the component embeddings by a pair of cross-attention modules And functional embedding Generating a fusion feature T, wherein the first cross-attention module is used for For Query, a second cross-attention module to Is Query; and inputting the fusion characteristic T into a transducer encoder for characteristic extraction, and outputting a gram attribute identification result through mean value pooling and a full connection layer.
- 2. The method of claim 1, wherein the construction of the assembled genome of the microorganism under test comprises: assembling the original DNA sequence fragments obtained by metagenome sequencing by using assembly software to obtain a DNA contig sequence; clustering the DNA contig sequences by using box division software, performing quality evaluation on clustered results, and reserving the results with the integrity of more than or equal to 50% and the pollution of less than or equal to 10% to obtain an assembled genome.
- 3. The method of claim 1, wherein the extracting component features The method specifically comprises the following steps: Extracting 4-mer frequency of each DNA contig sequence in the assembled genome, adding all 4-mer frequencies, dividing by total length of all DNA contig sequences to obtain normalized 4-mer frequency characteristics 。
- 4. The method of claim 1, wherein the extracting functional features The method specifically comprises the following steps: Identifying gene coding regions in the assembled genome sequence to obtain total number N of genes; Annotating the gene coding region by utilizing the functional database, and counting the gene quantity annotated to different databases and classification levels; dividing the gene quantity by the total number N of genes respectively for normalization to obtain functional feature vectors; the functional feature vector includes a feature vector composed of the following 27 statistical features: The number of genes annotated to the GO, COG and KEGG 3 databases, the number of genes annotated to the 3 subcategories of the COG database, the number of genes annotated to the 3 GO domains, and the number of genes annotated to the 18 COG secondary classifications.
- 5. The method of claim 1, wherein the first and second mapping modules each comprise two perceptron linear layers and use GELU activation functions, the composition embedding And functional embedding Is the same.
- 6. The method of claim 1, wherein the fusing logic of the cross-attention module is: First cross-attention module For Query to For Key and Value, an embedded representation is obtained ; A second cross-attention module For Query to For Key and Value, an embedded representation is obtained ; Will be And And splicing to obtain a fusion representation T with the dimension being a two-dimensional value.
- 7. The method of identification of claim 6, wherein the The calculation formula of (2) is as follows: Wherein, the Is that Is used for the linear transformation of (a), , , ; , , Is 3 learnable matrixes with dimensions of 128 multiplied by 128, and the second cross attention module is used for processing the data In the case of a Query, the data is displayed, For Key and Value, an embedded representation is obtained : Wherein, the , , 。 , , Is a 3-leachable matrix with dimensions 128 x 128, And Is that D is the feature dimension.
- 8. The method of claim 1, wherein the ransformer encoder includes a multi-headed self-attention mechanism and a feed-forward neural network, and each layer encoder is provided with a residual connection.
- 9. A computer readable storage medium storing one or more programs executable by one or more processors to perform the steps of the metagenome-oriented assembly genome-oriented gram-attribute identification method of any of claims 1-8.
- 10. A terminal, comprising a processor and a memory, wherein the memory stores a computer readable program executable by the processor, and the processor implements the steps of the metagenome-oriented gram attribute identification method according to any one of claims 1 to 8 when the processor executes the computer readable program.
Description
Metagenome-assembled genome-oriented gram attribute identification method and terminal Technical Field The invention relates to the technical field of bioinformatics, in particular to a gram attribute identification method, a computer readable storage medium and a terminal for a metagenome-oriented assembled genome. Background Traditional bacterial gram attribute identification methods rely primarily on gram staining experiments. The method classifies bacteria as gram positive or negative based on staining results, and has long been regarded as a gold standard for bacterial classification. However, the biological-based experimental method has obvious limitations that the experimental process is complicated, the time is long, and professional experimenters and equipment are required to support, so that the cost of manpower and material resources is high. More importantly, the traditional experimental method is difficult to meet the requirements of high-efficiency and batch processing in face of the analysis requirement of the new generation of sequencing data of the current large-scale pathogen metagenome. To address the above problems, researchers have developed a variety of bioinformatics approaches for predicting bacterial gram staining. However, the existing biological information methods are mainly directed to microbial data with complete classification information, whereas for metagenome assembly genomes (Metagenomic Assembly Genome, MAG) obtained by assembly and binning in metagenome studies, the prior art has not provided an effective gram-attribute identification scheme. Disclosure of Invention In order to solve the technical problems, the invention provides a gram attribute identification method, a computer-readable storage medium and a terminal for a metagenome assembly genome, which aim to solve the problems of high cost, low efficiency and the like existing in the existing identification of bacterial gram attributes. Specifically: in a first aspect, a method for identifying gram attributes of a metagenome-oriented assembled genome, comprising the steps of: constructing an assembled genome of a microorganism to be detected, screening the assembled genome by utilizing quality evaluation software, and reserving the assembled genome meeting the requirements of preset integrity and pollution degree; extracting the constitutive features of the assembled genome And functional characteristicsWherein the composition featuresThe functional characteristics are normalized by the total length of the genomeCarrying out normalization treatment on total genes; the first mapping module and the second mapping module are used for respectively characterizing the components And the functional featuresMapping to the representation space of the same dimension to obtain a composition embeddingAnd functional embedding; Fusing the modules by a pair of cross-attention modulesAndGenerating a fusion feature T, wherein the first cross-attention module is used forFor Query, a second cross-attention module toIs Query; and inputting the fusion characteristic T into a transducer encoder for characteristic extraction, and outputting a gram attribute identification result through mean value pooling and a full connection layer. The following is a preferred technical scheme of the present invention, but not a limitation of the technical scheme provided by the present invention, and the following preferred technical scheme can better achieve and achieve the objects and advantages of the present invention. As a preferred embodiment, the identifying method, wherein the constructing the assembled genome of the microorganism to be tested specifically includes: assembling the original DNA sequence fragments obtained by metagenome sequencing by using assembly software to obtain a DNA contig sequence; clustering the DNA contig sequences by using box division software, performing quality evaluation on clustered results, and reserving the results with the integrity of more than or equal to 50% and the pollution of less than or equal to 10% to obtain an assembled genome. As a preferred embodiment, the identification method, wherein the extracting component featuresThe method specifically comprises the following steps: Extracting 4-mer frequency of each DNA contig sequence in the assembled genome, adding all 4-mer frequencies, dividing by total length of all DNA contig sequences to obtain normalized 4-mer frequency characteristics 。 As a preferred embodiment, the identification method, wherein the extracting functional featuresThe method specifically comprises the following steps: identifying gene coding regions in the assembled genome sequence to obtain a total number of genes N; Annotating the gene coding region by utilizing the functional database, and counting the gene quantity annotated to different databases and classification levels; dividing the gene quantity by the total number N of genes respectively for normalization to obtain functi