CN-121765490-B - Sewage high-risk pollutant screening and identifying method based on fracture tree pre-training
Abstract
The invention discloses a method for screening and identifying high-risk pollutants in sewage based on pre-training of a fragmentation tree, which comprises the steps of constructing a pre-training data set of the fragmentation tree according to secondary mass spectrum data of known high-risk pollutants, performing self-supervision pre-training on a graph neural network encoder to obtain a fragmentation tree editor, obtaining secondary mass spectrum data of suspected high-risk pollutant related compounds in a sewage sample to be tested, constructing a fragmentation tree set, constructing a high-risk pollutant screening model based on the fragmentation tree encoder, screening high-risk pollutants on the fragmentation tree set to obtain a high-risk candidate fragmentation tree set, generating a candidate molecule set of each high-risk candidate fragmentation tree in the generation of the high-risk candidate fragmentation tree, and identifying a target molecule structure and a matching score of the high-risk candidate fragmentation tree from the candidate molecule set. The invention realizes the rapid screening and priority ordering of high-risk pollutants in sewage and further provides candidate output of structure identification on the basis of screening.
Inventors
- WU GANG
- FENG SHUYANG
- Lou Jingya
- ZHANG XUXIANG
- SHI XIAOPING
- Ji Minxin
- YING MING
- YAN XU
Assignees
- 南京大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260303
Claims (10)
- 1. The method for screening and identifying the high-risk pollutants in the sewage based on the pre-training of the fragmentation tree is characterized by comprising the following steps of: s1, constructing a fragmentation tree pre-training data set according to secondary mass spectrum data of known high-risk pollutants; S2, performing self-supervision pre-training on the graph neural network encoder by using the fragmentation tree pre-training data set to obtain a fragmentation tree editor; s3, acquiring secondary mass spectrum data of the suspected high-risk pollutant related compounds in the sewage sample to be detected, and constructing a fragmentation tree set; S4, constructing a screening model of high-risk pollutants based on the fragmentation tree encoder so as to screen the fragmentation tree set for high-risk pollutant signals and obtain a high-risk candidate fragmentation tree set; s5, generating a candidate molecule set of each high-risk candidate fragmentation tree in the high-risk candidate fragmentation tree set; And S6, identifying the target molecular structure and the matching score of the high-risk candidate fragmentation tree from the candidate molecular set.
- 2. The method for screening and identifying the sewage high-risk pollutants based on the pre-training of the fragmentation tree according to claim 1 is characterized in that the secondary mass spectrum data comprise parent ion information of the high-risk pollutants and fragment ion spectrogram data corresponding to the parent ions, the pre-training data set of the fragmentation tree is constructed and further comprises spectrogram pretreatment of the secondary mass spectrum data, the spectrogram pretreatment comprises the steps of identifying discrete peaks of fragment ion spectrogram data of the secondary mass spectrum data to obtain a fragment ion peak list, removing noise peaks and low-reliability peaks from the fragment ion peak list, labeling and merging or rejecting isotope peaks of the fragment ion peak list and/or nearby peaks of parent ions of the secondary mass spectrum data, and adduct identification is carried out based on the parent ion information.
- 3. The method for screening and identifying high risk contaminants in contaminated water based on pre-training of a frailty tree according to claim 2, wherein said method for constructing a pre-training dataset of frailty trees comprises: For each secondary mass spectrum data, taking a parent ion corresponding to a parent ion peak as a root node, and taking each fragment ion peak in a fragment ion peak list as a candidate node to form a candidate node set; Calculating candidate neutral losses between the root node and the candidate node set and between the candidate nodes based on the mass difference between the ion mass of the father node and the ion mass of the son node, and establishing association of the father node, the son node and the neutral losses to generate a candidate edge set, wherein each candidate edge carries a neutral loss mass value delta m; under the conditions of meeting mass conservation constraint, parent ion and fragment relation constraint and element composition rationality constraint, scoring the credibility of candidate nodes and candidate edges by adopting a scoring function, searching or optimizing a tree structure, and selecting a group of edges from the candidate nodes and the candidate edges so as to form a directed acyclic tree taking a root node as a starting point, optimize the total score or meet a preset optimal criterion, thereby obtaining a fragmentation tree corresponding to the secondary mass spectrum data; generating a graph structure representation of the fragmentation tree, and summarizing the fragmentation tree graph structure representations generated by all secondary mass spectrometry data to form a fragmentation tree pre-training data set; The graph structure representation comprises a node set taking parent ions and fragment ions as nodes, an edge set taking neutral loss as a directed edge, node features and edge features, wherein the node features comprise fragment ion mass-to-charge ratio m/z, peak intensity, mass deviation and/or fragmentation scores, and the edge features comprise neutral loss mass, loss composition information and loss scores.
- 4. The method of claim 1, wherein the fragmentation tree editor is configured to map a graph structure of a fragmentation tree into a fragmentation tree level representation vector, the representation vector being obtained by a self-supervised pre-training of a graph neural network encoder, wherein the self-supervised pre-training of the graph neural network encoder is configured with one or more task headers to implement an output layer structure of the self-supervised pre-training tasks, the self-supervised pre-training tasks including a node mask reconstruction task, an edge mask reconstruction task, a contrast learning task, a spectrogram and fragmentation tree consistency constraint task, and updating parameters of the graph neural network encoder by minimizing a pre-training total loss function.
- 5. The method of high risk wastewater contaminant screening and identification based on fraining of claim 4, wherein said total pretraining loss function is a weighted sum of the losses of each supervised pretraining task expressed as: , Wherein, the In order to pre-train the total loss function, The penalty is reconstructed for the node mask, The penalty is reconstructed for the edge mask, In order to compare the learning loss with the learning loss, Loss of consistency for the spectrogram and the fragmentation tree; the weight coefficients of the node mask reconstruction loss, the edge mask reconstruction loss, the contrast learning loss and the spectrogram and fragmentation tree consistency loss are respectively.
- 6. The method for screening and identifying high risk contaminants in contaminated water based on pre-training of a fracking tree according to claim 1 wherein the screening model includes a fracking tree encoder and a screening head connected to an output of the fracking tree encoder, the screening head being a classification head for mapping a fracking tree map-level characterization vector output by the fracking tree encoder to a screening probability score.
- 7. The method of claim 6, wherein a threshold screening or Top-N screening is performed to obtain a set of high risk candidate fragmentation trees based on the screening probability score, wherein the Top-N screening is to keep the Top N fragmentation trees after sorting all fragmentation trees from high to low in the screening probability score in the same batch, same sampling point or same chromatographic time window.
- 8. The method of claim 6, wherein the screening model is obtained by supervised fine tuning of training a fragmentation tree encoder with a screening head using a tagged fragmentation tree dataset while maintaining the fragmentation tree encoder network structure such that screening probability scores are used to distinguish between "high risk contaminants related and non-related".
- 9. The method for screening and identifying the sewage high-risk pollutants based on the pre-training of the fragmentation trees, which is characterized in that the method for generating the candidate molecule set of the high-risk candidate fragmentation trees comprises the steps of converting the parent ion mass of the high-risk candidate fragmentation trees to obtain the neutral mass M of the compound corresponding to the candidate fragmentation trees, searching in a structure candidate library according to an error window of the neutral mass M, and screening to obtain the candidate molecule set meeting the constraint of the error window so as to form a relation of a plurality of candidate molecules corresponding to the fragmentation trees.
- 10. The method for screening and identifying high risk contaminants in wastewater based on pre-training of a fragmentation tree according to claim 1, wherein the method for identifying target molecular structures and matching scores corresponding to high risk candidate fragmentation trees from the candidate molecular set is: construction of a structure identification model comprising a fragmentation tree coding branch and a molecular structure coding branch, the fragmentation tree coding branch mapping a fragmentation tree graph structure into a fragmentation tree embedding vector The molecular structure encoding branch maps Hou Xuanfen sub-map to molecular embedding vector ; And respectively calculating the matching scores of a certain candidate fragmentation tree and all the corresponding candidate molecules according to the matching function, sequencing the candidate molecules from high to low according to the matching scores, and selecting a set number of candidate molecular structures before selection as target molecular structures of the high-risk candidate fragmentation tree.
Description
Sewage high-risk pollutant screening and identifying method based on fracture tree pre-training Technical Field The invention relates to the technical field of environmental monitoring and analytical chemistry, in particular to a method for screening and identifying high-risk pollutants in sewage based on training of a fragmentation tree. Background The sewage often contains various high-risk pollutants or high-risk attention substances, such as industrial auxiliary agents and intermediates, related substances of medicines and personal care products, pesticides and related substances thereof, disinfection byproducts, surfactants and derivatives thereof, functional compounds containing halogen, sulfur, phosphorus and the like, and chemical substances with risk characteristics such as biotoxicity, carcinogenesis, mutation, teratogenesis, endocrine disturbance and the like. The source of the pollutants in a real sewage sample is complex, the compositions are various, the concentration span is large, and the pollutants are obviously influenced by a matrix effect, so that a large number of suspected high-risk related signals which need to be distinguished and treated preferentially exist in the secondary mass spectrum data acquired by LC-HRMS. The current identification method of high-risk pollutants in sewage comprises targeted detection based on standard substances and suspicious screening based on a standard spectrum library. The target detection usually takes a standard product as a reference, and the retention time and the secondary spectrogram of the signal to be detected are compared and confirmed, so that the method has the advantage of high confirmation degree, but the method depends on the preset target list and the standard product supply, so that the application requirements of high-risk pollutant in a sewage sample, such as multiple types, rapid variation, rapid investigation and dynamic expansion, are difficult to meet. The suspicious screening is usually based on a candidate list and a reference spectrum library, and the candidate screening is completed by carrying out parent ion accurate quality matching, feature fragment comparison and spectrogram similarity calculation on a signal to be detected and candidate substances in the library, but the method has strong dependence on coverage rate and updating speed of the reference spectrum library, and when the spectrum library is incomplete, spectrogram difference is caused by different platforms/conditions, or the influence of a sample matrix is obvious, the method is easy to cause missed detection, misjudgment or can not stably give a reliable screening result. In addition, complicated matrix interference, co-elution, ion inhibition, multi-source aliasing and other conditions commonly exist in a real sewage sample, so that a screening strategy which only depends on local fragment matching or spectrogram similarity is insufficient in robustness, and high-sensitivity screening and reliable prioritization of high-risk pollutant related signals are difficult to realize, so that subsequent manual review, targeting confirmation and risk management decisions are influenced. Disclosure of Invention The invention aims to provide a method for screening and identifying high-risk pollutants in sewage based on the pre-training of a fragmentation tree, which realizes the rapid screening and priority ordering of the high-risk pollutants in the sewage and further provides candidate output for structure identification on the basis of screening. The method for screening and identifying the sewage high-risk pollutants based on the training of the fragmentation tree comprises the following steps of: s1, constructing a fragmentation tree pre-training data set according to secondary mass spectrum data of known high-risk pollutants; S2, performing self-supervision pre-training on the graph neural network encoder by using the fragmentation tree pre-training data set to obtain a fragmentation tree editor; s3, acquiring secondary mass spectrum data of the suspected high-risk pollutant related compounds in the sewage sample to be detected, and constructing a fragmentation tree set; s4, constructing a high-risk pollutant screening model based on the fragmentation tree encoder so as to screen the fragmentation tree set for high-risk pollutant signals and obtain a high-risk candidate fragmentation tree set; s5, generating a candidate molecule set of each high-risk candidate fragmentation tree in the high-risk candidate fragmentation tree set; And S6, identifying the target molecular structure and the matching score of the high-risk candidate fragmentation tree from the candidate molecular set. The method comprises the steps of obtaining a secondary mass spectrum data, obtaining a fragment ion peak list, removing noise peaks and low credibility peaks from the fragment ion peak list, marking and merging or eliminating isotope peaks of the fragment ion peak list and/or nearby peak