Search

CN-121983121-A - Traceability analysis and multi-habitat homology evaluation method for antibiotic drug resistance genes in environmental samples

CN121983121ACN 121983121 ACN121983121 ACN 121983121ACN-121983121-A

Abstract

The invention discloses a traceable analysis and multi-habitat homology evaluation method of antibiotic drug resistance genes in an environmental sample, which is used for cross-habitat propagation of the drug resistance genes and identification of potential sources; the method comprises the steps of 1) establishing a multi-element abundance matrix of antibiotic drug resistance genes in all environment samples, constructing a drug resistance gene traceability model by utilizing a machine learning algorithm, accurately predicting potential contribution ratios of different habitats to a target habitat resistance group, 2) reconstructing bacterial genomes by genome assembly and box division technology aiming at the identified habitats, identifying homologous drug resistance genes of 'source-sink' habitats by utilizing multi-sequence comparison and phylogenetic analysis, constructing a multi-medium propagation map of the drug resistance genes and a bacterial host propagation network thereof, and clarifying the connectivity and the propagation mechanism between habitats. According to the invention, through integrating machine learning traceability and gene homology analysis, the system evaluation of drug-resistant gene cross-habitat propagation is realized, and technical support is provided for the pollution source identification and risk resistance control of bacterial drug resistance.

Inventors

  • MA LIPING
  • FANG YUXIANG
  • JIAO PENGBO

Assignees

  • 华东师范大学

Dates

Publication Date
20260505
Application Date
20260126

Claims (5)

  1. 1. The method for tracing analysis of antibiotic drug resistance genes in environmental samples and evaluation of multi-habitat homology of the antibiotic drug resistance genes is characterized by comprising the following steps: step 1, acquiring environmental microorganism metagenome data of different habitats, which comprises the following steps: 1-1, collecting environmental samples in different habitats; 1-2, extracting microbial genome DNA in an environmental sample, and performing metagenome sequencing to obtain original metagenome data in different habitats; 1-3, performing quality control on original metagenome data, filtering low-quality sequences with the base quality lower than 30 and the sequence length smaller than 150 bp to obtain high-quality metagenome data sets of all environment samples after quality control; constructing a multi-element abundance matrix of antibiotic drug resistance genes in different habitats, namely comparing the structural antibiotic drug resistance gene database SARG v 3.0.0 by using ARGs-OAP tools based on the high-quality metagenome data set obtained in the steps 1-3, and accurately identifying and quantifying the drug resistance genes in different habitats to obtain abundance matrices of diverse drug resistance genes in different habitats; The method comprises the steps of 3, constructing a pollution source model FEAST for microbial source tracking by using a machine learning algorithm, identifying and quantifying the source contribution rate of potential source habitats to composition of drug resistance genes in target habitats, namely, defining a habitat as the target habitat based on abundance matrixes of various drug resistance genes in different habitats obtained in the step 2, screening the potential source habitat, constructing a tracing model FEAST by using a machine learning expectation maximization algorithm based on the abundance matrixes of the drug resistance genes in the potential source habitats, verifying and optimizing the prediction accuracy and generalization capability of the model by using an iterative method and a simulation data set method, and calculating the source contribution rate of each source habitat to the target habitat by using a model prediction inter-habitat relationship; selecting a habitat serving as a medium connectivity and gene homology analysis, namely selecting the habitat with the contribution rate of more than 1% from a target habitat for connectivity and/or gene homology analysis according to the source-sink habitat prediction relation and the contribution rate thereof obtained in the step 3, and defining the selected habitat as a potential connectivity habitat; Step 5, extracting a high-quality metagenome data set of the potential connectivity habitat, namely screening and extracting the high-quality metagenome data set of the potential connectivity habitat in the step 4 from the high-quality metagenome data set obtained in the step 1-3; Reconstructing a high-quality bacterial genome in the potential connectivity habitat, namely calling a genome component box analysis platform MetaWRAP at a Linux terminal through conda, merging double-end sequences of the high-quality metagenome datasets of all samples into one file by using a cat tool, assembling the merged short sequence of the genome by using a genome assembly tool Megahit, sorting the assembled long sequence contig file by using Metabat and Concoct, maxbin algorithms, purifying and filtering a sorting result by using BIN_ REFINEMENT, and finally reserving a high-quality bacterial genome with the integrity of more than or equal to 50% and the pollution of less than or equal to 10%; Step 7, performing antibiotic resistance gene annotation and species annotation on the high-quality bacterial genome in the reconstructed potential connectivity habitat, namely, for the high-quality bacterial genome reconstructed in the step 6, predicting the protein sequence of an open reading frame ORF of the bacterial genome by using Prodigal, comparing the ORF sequence with a structural antibiotic resistance gene database SARG v 3.0.0 by using a BLASTP tool, and identifying a drug resistance gene sequence according to the standard with the similarity of more than or equal to 80 percent and the coverage of more than or equal to 75 percent and the e-value of less than or equal to 10 -7 to obtain the high-quality bacterial genome carrying the drug resistance gene, defining the high-quality bacterial genome as ARG-MAGs, and simultaneously performing species annotation on the ARG-MAGs by using a GTDB _tk tool to obtain microbial host information and classification results of the ARG-MAGs in the potential connectivity habitat; Step 8, constructing a phylogenetic tree of antibiotic resistance genes in the potential connectivity habitat, namely extracting nucleotide sequences of the same type of antibiotic resistance genes from the microbial host based on ARG-MAGs in the potential connectivity habitat in step 7, carrying out nucleotide sequence comparison by using a clustalW tool, constructing a phylogenetic tree of the antibiotic resistance genes in the potential connectivity habitat by using a Fasttree tool according to the comparison result, extracting the nucleotide sequences of the same type of antibiotic resistance genes from the reconstructed high-quality bacterial genome, And 9, elucidating the cross-habitat propagation characteristics of the antibiotic drug resistance genes, namely comparing and analyzing the drug resistance gene homology characteristics of the potential source habitat and the target habitat based on the clustering relation, the branch structure and the evolutionary branch length difference according to the phylogenetic tree of the antibiotic drug resistance genes in the potential connectivity habitat in the step 8, constructing a multi-medium propagation map of the drug resistance genes and a bacterial host-to-host propagation network of the drug resistance genes, and comprehensively evaluating the cross-habitat propagation characteristics of the antibiotic drug resistance genes in the potential connectivity habitat.
  2. 2. The method of claim 1, wherein the different habitats of step 1-1 include soil, water, sediment, sewage treatment systems, biological environments, and human activity environments.
  3. 3. The method of claim 1, wherein the step 2 of accurately identifying and quantifying antibiotic resistance genes in different habitats specifically comprises the steps of importing a quality-controlled sample metagenome data set file into a Linux terminal, calling a ARGs-OAP tool through conda, accurately annotating, classifying and quantifying the antibiotic resistance genes in two steps of rapid screening and accurate classification respectively, and outputting a result including a type, subtype, gene quantitative result in three different levels, wherein a gene level is selected as an abundance file of quantification of the antibiotic resistance genes in the multi-habitat.
  4. 4. The method of claim 1, wherein the constructing a plastic source model FEAST by using the machine learning expectation maximization algorithm in step 3 specifically comprises preparing input files for traceability analysis, including an abundance file of antibiotic resistance genes and a classification information file, wherein the abundance file is in a format of each row as a sample and each behavior of antibiotic resistance genes, the classification information file defines a target habitat as a sink and other habitats as sources, importing the abundance file and the classification information file into R, installing a software package required by FEAST model, and constructing a plastic source model FEAST based on the expectation maximization algorithm.
  5. 5. The method of claim 1, wherein the extracting the nucleotide sequence of the same class antibiotic resistance gene in step 8 comprises screening the classes of the drug resistance genes carried in ARG-MAGs, and calling seqkit a tool to extract the nucleotide sequence of the same class antibiotic resistance gene from the high-quality genome of the potential connectivity habitat according to the open reading frame ORF number and the genome number in the BLASTP comparison result.

Description

Traceability analysis and multi-habitat homology evaluation method for antibiotic drug resistance genes in environmental samples Technical Field The invention belongs to the technical field of metagenome bioinformatics analysis and environmental antibiotic resistance monitoring, and particularly relates to an analysis method for carrying out traceability analysis and gene homology evaluation on antibiotic resistance genes in various different ecological environments based on a metagenome sequencing big data set and a machine learning algorithm, which realizes analysis of the cross-habitat propagation relationship of the antibiotic resistance genes among different ecological environments and identification of potential sources and provides technical support for source control, propagation risk evaluation and environmental management of antibiotic resistance pollution. Background The large amount of antibiotic use and long-term release results in large amounts of proliferation and spread of Antibiotic Resistant Bacteria (ARB) and Antibiotic Resistant Genes (ARGs) in the environment, making antibiotic resistance a serious global public health problem. ARGs have been widely detected in a variety of ecological environments such as soil, water, sediment, sewage treatment systems, and human-related environments. The persistent accumulation and diffusion of ARGs in the environment not only can change the microbial community structure and ecological functions, but also can enter the human activity environment through the modes of food chains or biospheres, and the like, and can be carried by bacterial microorganisms, thus forming potential threats to public health and ecological safety. Therefore, the identification of the cross-habitat transmission characteristics of antibiotic resistance genes in different ecological environments, the connectivity of the resistance genes among different habitats and potential pollution sources by the system is an important problem to be solved in the fields of current environmental drug resistance research and pollution prevention and control. With the development of high-throughput sequencing technology and metagenome, detection and annotation of antibiotic resistance genes in the environment based on metagenome large data sets has become a common research approach. The existing research is to compare the drug-resistant gene database and carry out statistical analysis on the types and abundance of antibiotic drug-resistant genes in different samples or habitats, so as to reveal the composition difference and abundance change trend of the drug-resistant genes in different environments. However, such analysis methods mainly focus on differences in abundance of drug-resistant genes between single habitats or different samples, and it is difficult to further answer whether there is a potential transmission association between different ecological environments, and the roles played by different habitats in the formation of drug-resistant contamination. To explore the drug-resistant gene transmission relationship between different environments, a co-occurrence analysis or network analysis method is introduced in part of the research, and the potential hosts of the drug-resistant genes in different samples or environments are correlated. The method reveals the correlation characteristics between the drug-resistant gene host and the environmental factors to a certain extent, but the analysis results usually depend on statistical correlation, so that the real alternate relationship between different environments and the accidental co-occurrence phenomenon are difficult to distinguish. In addition, correlation analysis often cannot directly reflect potential source relations of drug-resistant genes in different ecological environments, and direct basis is difficult to provide for identifying sources of drug-resistant pollution. On the other hand, for the problem of microbial community or functional gene source analysis, studies have been proposed to estimate the source composition of microorganisms or functional genes in a target environment using a traceable model. The method can quantitatively analyze the contribution proportion of different sources to the target environment to a certain extent, and provides a new technical idea for understanding the substance or information input in the complex environment system. However, existing traceable analyses are mostly focused on the microbial community level, or estimate the source contribution based on abundance information alone, lack of an analytical framework combined with specific sequence features of drug-resistant genes and host information, and have difficulty in further revealing the intrinsic biological mechanisms of drug-resistant gene propagation between different ecological environments. Thus, there is a need for a tool that can accurately and rapidly identify contamination transmission characteristics of antibiotic resistance gen