CN-121983139-A - Method for evaluating bacterial antibiotic drug resistance risk in environmental sample based on machine learning
Abstract
The invention discloses a machine learning-based method for evaluating bacterial antibiotic drug resistance risk in an environmental sample, which is applicable to environmental multimedia such as air, water, soil and the like. The method comprises the steps of 1) identifying and quantifying high-risk bacterial antibiotic resistance genes in environmental samples, 2) identifying key environmental impact factors of the high-risk antibiotic resistance genes by utilizing a machine learning model and SHAP interpretability analysis, and 3) establishing an antibiotic resistance comprehensive risk assessment model and calculating bacterial resistance risk values in all environmental samples. The method realizes the interpretable and quantifiable comprehensive risk assessment of bacterial drug resistance among different samples, and provides important technical support for risk resistance control of environmental bacterial drug resistance pollution transmission.
Inventors
- MA LIPING
- REN QINGPING
- LIU HUAFENG
Assignees
- 华东师范大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260129
Claims (4)
- 1. A method for evaluating bacterial antibiotic resistance risk in an environmental sample based on machine learning, which is characterized by comprising the following steps: step 1, acquiring second generation metagenome sample data in an environmental sample, and performing quality control to obtain a metagenome data set; Step 2, identifying high-risk antibiotic resistance genes in an environmental sample, wherein the method comprises the following steps: 2-1, assembling the metagenome data set obtained in the step 1 by using an assembly module of metaWRAP software to obtain an contig sequence of each sample, and predicting an open reading frame of the contig sequence by using prodigal software to obtain an open reading frame sequence; 2-2, comparing SARG antibiotic resistance gene databases with the open reading frame sequences obtained in the step 2-1, so as to obtain contig sequences carrying antibiotic resistance genes; 2-3, comparing the open reading frame sequences obtained in the step 2-1 with mobileOG-db movable genetic element databases to further obtain an contig sequence carrying movable genetic elements, defining that the antibiotic drug resistance genes and the movable genetic elements have mobility on the same contig and the distance is less than 5 kb, and screening according to the definition to obtain a mobile antibiotic drug resistance gene directory; 2-4, performing species annotation on the contig sequence comparison GTDB database carrying the antibiotic resistance genes obtained in the step 2-2, defining hosts of the antibiotic resistance genes with species annotation results in a public pathogen list containing 1,005 clinically relevant species as pathogens, and screening and obtaining contig sequences and an antibiotic resistance gene list of which the hosts are pathogens according to the definition; 2-5, defining the antibiotic resistance gene which has mobility and takes the host as a pathogen as a high-risk antibiotic resistance gene, and taking an intersection of the antibiotic resistance gene directory with mobility obtained in the step 2-3 and the antibiotic resistance gene directory taking the host as the pathogen obtained in the step 2-4 according to the definition, so as to obtain the type of the high-risk antibiotic resistance gene in the environmental sample; Step 3, identifying and quantifying the antibiotic drug resistance genes by adopting a ARGs-OAP flow to the metagenome dataset obtained in the step 1, setting parameters, wherein the e value is less than or equal to 10 -7 , the similarity is more than or equal to 80%, the coverage is more than or equal to 75%, the minimum matching length is more than or equal to 25 amino acids, and screening according to the types of the high-risk antibiotic drug resistance genes obtained in the step 2 to obtain the types and the relative abundance of the high-risk antibiotic drug resistance genes in the environmental sample; step4, acquiring environmental parameter index data corresponding to the environmental sample in the step 1; Step 5, based on the type and the relative abundance of the high-risk antibiotic drug resistance gene obtained in the step 3 and the environmental parameter index data obtained in the step 4, constructing an input data set for establishing a machine learning model, and carrying out missing value processing and normalization on the data set to obtain a processed data set; Step 6, dividing the processed data set obtained in the step 5 into a training set and a testing set, wherein the training set accounts for 80% of the processed data set, the testing set accounts for 20% of the processed data set, respectively training at least two machine learning models based on training set data, and obtaining trained candidate optimal models through cross verification, super parameter optimization and feature selection; Step 7, carrying out SHAP interpretability analysis on the optimal model obtained in the step 6, outputting a sequencing result of environmental impact factors of the high-risk antibiotic drug resistance genes, and screening the environmental impact factors of the first three in the sequencing result; Step 8, building a drug resistance group risk assessment model of the following formula by using three factors of antibiotic resistance, mobility and host pathogenicity, respectively counting the number of the contig sequences in the step 2 by adopting a self-built Python script, and calculating according to the formula to obtain a bacterial antibiotic drug resistance risk value in the environmental sample: ; Wherein Q Struct represents the bacterial antibiotic resistance risk value, N Contig represents the total number of contig sequences in the sample, N ARG represents the number of contig sequences comprising only the antibiotic resistance gene, N ARG,MGE represents the number of contig sequences comprising both the antibiotic resistance gene and the mobile genetic element, i.e., the physical co-location of the antibiotic resistance gene and the mobile genetic element on the same contig and at a distance of less than 5 kb, N ARG,MGE,PAT represents the number of contig sequences wherein the host is a pathogen and carries both the antibiotic resistance gene and the mobile genetic element, and N ARG,PAT represents the number of contig sequences wherein the host is a pathogen and carries both the antibiotic resistance gene; And 9, obtaining a comprehensive antibiotic drug resistance risk assessment result which can be quantified and interpreted in the environmental sample by utilizing the bacterial antibiotic drug resistance risk value in the environmental sample obtained in the step 8 and the environmental impact factors of the first three ranks obtained in the step 7.
- 2. The method of claim 1, wherein the environmental sample of step 1 comprises one of air, water, soil, and a sludge medium.
- 3. The method of claim 1, wherein the environmental parameter index data corresponding to the environmental sample in step 4 includes sample physicochemical indexes, weather hydrographic parameter indexes of the sample location and socioeconomic and public service indexes of the sample location, wherein the sample physicochemical indexes include one or more of temperature, pH, salinity, dissolved oxygen, total nitrogen, total phosphorus, chemical oxygen demand and biochemical oxygen demand, the weather hydrographic parameter indexes of the sample location include one or more of rainfall, barometric pressure, wind speed, particulate matter concentration, runoff, water depth and hydraulic retention time, and the socioeconomic and public service indexes of the sample location include one or more of domestic total production, population density, labor force, immigration rate, employment and population ratio, trade rate, aquaculture production capacity, fertilizer consumption, current people and health expenditure, hospital bed number and antibiotic usage.
- 4. The method of claim 1, wherein the machine learning model of step 6 is selected from at least two of a random forest model, an extreme random tree model, an extreme gradient lift model, and a lightweight gradient lift model.
Description
Method for evaluating bacterial antibiotic drug resistance risk in environmental sample based on machine learning Technical Field The invention relates to the technical fields of environmental engineering, bioinformatics and environmental health risk assessment, in particular to a method for comprehensively assessing bacterial antibiotic drug resistance risk in an environmental sample by utilizing macrogenomic analysis and combining a machine learning model. The method is suitable for identifying antibiotic drug resistance risk of environmental medium samples such as air, water or soil, analyzing driving factors and quantitatively characterizing risk. Background Antimicrobial resistance (Antimicrobial Resistance, AMR) has evolved from a purely clinical treatment problem to a typical environmental new contaminant and environmental health risk problem. The antibiotics and their metabolites, disinfectants, heavy metals and other environmental stresses can form continuous selective pressure in sewage treatment plants, culture discharge, hospital and pharmaceutical industry discharge, urban runoff and polluted soil/sediment, so that the antibiotic drug resistance genes (ARGs) are promoted to be enriched and diffused in the environmental microbial community. The environment ARGs can be transmitted between different microorganisms through horizontal gene transfer mediated by movable genetic elements and can be associated with pathogenic hosts under specific conditions, thereby constituting a potential risk to the health of the population. Meanwhile, host bacteria of antibiotic resistance genes in environmental media can enter a crowd exposure channel along with aerosol, water, food chains, soil contact and other ways, and serious health risks are generated for human bodies. Therefore, the risk assessment of antibiotic drug resistance is carried out aiming at an environmental sample, the occurrence and the abundance of drug resistance genes are required to be focused, and the actions of mobility, host pathogenicity and environmental influence factors are required to be comprehensively considered, so that accurate risk identification and resistance control are realized. In the existing environment antibiotic drug resistance risk assessment practice, common problems include (1) single risk identification dimension, part of methods are mainly characterized according to ARGs detection and abundance, key dimensions which are more directly related to crowd infection and health risk such as mobility and pathogenic host mediation are difficult to consider, (2) driving factor analysis capability is insufficient, environmental physicochemical indexes, weather hydrologic parameters, socioeconomic parameters and the like possibly have important influences on drug resistance risk factors, but traditional statistics or experience judgment often have difficulty in processing multisource isomerism factors and nonlinear relations, and also have difficulty in providing stable influence factor sequencing, and (3) an interpretable and referenceable comprehensive quantification frame, a risk standardization method and a technical flow which comprise the factors are lacked. Thus, there is a need for a comprehensive risk assessment technique based on high risk antibiotic resistance gene identification, key environmental driver interpretation, and comparability and interpretability of environmental samples for large metagenomic datasets and large machine learning models. Disclosure of Invention The invention aims to provide a method for evaluating bacterial antibiotic resistance risk in an environmental sample based on machine learning, which aims to solve the technical limitation that the environmental resistance risk is evaluated in a biased way only based on bacterial resistance gene occurrence in the prior art. The invention integrates multidimensional risk attributes such as mobility of the drug resistance gene, host pathogenicity and the like, and realizes nonlinear influence analysis driven by multiple environmental influence factors and comprehensive risk quantitative evaluation of sample antibiotic resistance based on a machine learning model. The specific technical scheme for realizing the aim of the invention is as follows: a method for evaluating bacterial antibiotic resistance risk in an environmental sample based on machine learning, comprising the following steps: step 1, acquiring second generation metagenome sample data in an environmental sample, and performing quality control to obtain a metagenome data set; Step 2, identifying high-risk antibiotic resistance genes in an environmental sample, wherein the method comprises the following steps: 2-1, assembling the metagenome data set obtained in the step 1 by using an assembly module of metaWRAP software to obtain an contig sequence of each sample, and predicting an open reading frame of the contig sequence by using prodigal software to obtain an open reading frame sequenc