CN-118398077-B - Method for identifying DNA virus in organism based on high-throughput sequencing data and application thereof
Abstract
The invention discloses a method for identifying DNA viruses in organisms based on high-throughput sequencing data, which takes a biological sample derived from the organisms as a sample to be tested, carries out high-throughput sequencing on the sample to be tested, acquires sequencing data with a file format of FastQ, and uses the sequencing data for analysis and identification so as to identify the DNA viruses in the sample to be tested. According to the method, the DNA virus can be accurately identified by using sequencing lower machine data of second-generation sequencing (high-throughput sequencing) without depending on metagenomic sequencing, false positive is avoided, and the utilization rate of sample sequencing data is greatly improved.
Inventors
- CHEN SHIFU
- HE JIANLI
- ZHOU YANQING
Assignees
- 深圳海普洛斯医学检验实验室
Dates
- Publication Date
- 20260512
- Application Date
- 20240401
Claims (10)
- 1. A method for identifying DNA viruses in an organism based on high throughput sequencing data, characterized in that a biological sample derived from said organism is used as a sample to be tested, comprising the steps of: s1, carrying out high-throughput sequencing on a sample to be tested, collecting sequencing data with a file format of FastQ in sequencing machine data, and carrying out quality control on the sequencing data to obtain clean fq data; S2, acquiring sequence information of a reference genome of the organism from an NCBI database as an organism database; Obtaining genome sequence information of a DNA virus of which host is the organism from an NCBI database as a DNA virus database; S3, comparing the clean fq data obtained in the step S1 with the biological database shown in the step S2 to obtain a sequence which cannot be compared with the biological database, comparing the clean fq data obtained in the step S1 with the DNA virus database shown in the step S2 to obtain a sequence of the DNA virus database shown in the comparison, removing the sequence which is simultaneously present in the sequence which cannot be compared with the biological database and the sequence which is simultaneously present in the comparison, and marking the sequence which is simultaneously present in the comparison and the biological database as unmap data; s4, removing a sequence with a clip ratio of more than 10% in the virus_map data obtained in the step S3 to obtain the virus_map_rc data; S5, respectively selecting 9000-11000 sequences from unmap data obtained in the step S3 and virus_map_rc data obtained in the step S4, and respectively marking the 9000-11000 sequences as unmap _name data and virus_map_rc_name data; The number of sequences in unmap data obtained in the step S3 or virus_map_rc data obtained in the step S4 is less than 10000, and then all sequences are selected; S6, summarizing DNA virus whole genome sequence information in an NCBI database as a comparison database, and comparing unmap _name data obtained in the step S5 with the comparison database to obtain the high-quality comparison sequence number of the DNA virus in unmap _name data; comparing the virus_map_rc_name data obtained in the step S5 with a comparison database to obtain the high-quality comparison sequence number of the DNA viruses in the virus_map_rc_name data; The high quality alignment is that the length of the sequence fragments aligned in unmap _name data and/or virus_map_rc_name data in an alignment database is at least 90% of the length of the sequence fragments, and the sequence consistency of the sequence fragments aligned in the alignment database is more than or equal to 90%; S7, the number of sequences of the DNA virus in the unmap _name data and/or the number of sequences of the DNA virus in the virus_map_rc_name data obtained in the step S6 meet the judgment condition, and the sample to be tested contains the DNA virus; the number of sequences of the high quality comparison of the DNA virus in unmap _name data and/or the number of sequences of the high quality comparison of the DNA virus in virus_map_rc_name data obtained in the step S6 do not meet the judgment condition, and the sample to be tested does not contain the DNA virus; the judgment conditions are that the number of sequences of the DNA virus in the unmap _name data obtained in the step S6 is more than or equal to 1 and/or the number of sequences of the DNA virus in the virus_map_rc_name data is more than or equal to 1.
- 2. The method of claim 1, wherein the number of reads of DNA virus in the test sample is no less than 2% of the total number of reads in the test sample.
- 3. The method of claim 1, wherein the high throughput sequencing in step S1 is high throughput sequencing with a sequencing depth >30 x.
- 4. The method according to claim 1, wherein the obtaining of the genomic sequence information of the DNA virus whose host is the organism in step S2 is specifically obtaining the genomic sequence information of the DNA virus having complete nucleotide information, whose host is the organism and whose genomic molecular type is unknown DNA in NCBI database.
- 5. The method of claim 1, wherein the alignment in step S3 is performed using bwa mem software based on default parameters.
- 6. The method according to claim 1, wherein the judging condition in the step S7 is that the number of sequences of the high quality comparison of the DNA virus in unmap _name data obtained in the step S6 is more than or equal to 100 and the percentage of the number of sequences of the high quality comparison of the DNA virus in unmap _name data in unmap _name data obtained in the step S5 is more than or equal to 5% and/or the number of sequences of the high quality comparison of the DNA virus in virus_map_rc_name data is more than or equal to 100 and the percentage of the number of sequences of the high quality comparison of the DNA virus in virus_map_rc_name data in virus_map_name data obtained in the step S5 is more than or equal to 5%; Or the number of sequences of the high quality comparison of the DNA virus in unmap _name data obtained in the step S6 is more than or equal to 10, and the ratio of the number of sequences of the high quality comparison of the DNA virus in unmap _name data in unmap _name data obtained in the step S5 is more than or equal to 0.5%, and/or the number of sequences of the high quality comparison of the DNA virus in virus_map_rc_name data is more than or equal to 10, and the ratio of the number of sequences of the high quality comparison of the DNA virus in virus_map_rc_name data in the step S5 is more than or equal to 0.5%.
- 7. A device for identifying DNA viruses in an organism based on high throughput sequencing data, the device comprising a data acquisition component, an identification component, and a result output component; The data acquisition component is used for acquiring sequencing data with a file format of FastQ in high-throughput sequencing machine data of a sample to be detected, wherein the sample to be detected is a biological sample derived from the organism; the identification component uses the sequencing data acquired by the data acquisition component as input data, and executes the method according to any one of claims 1-6 to obtain a DNA virus identification result in an organism; The result output component is used for outputting the DNA virus identification result in the organism obtained by the identification component.
- 8. The device of claim 7, wherein the sample to be tested is a biological sample derived from a tumor patient.
- 9. Use of the method according to any one of claims 1 to 6 and/or the device according to any one of claims 7 to 8 for identifying DNA viruses in organisms.
- 10. The use according to claim 9, wherein the DNA virus is HPV virus, HBV virus, EBV virus, betapolyomavirus hominis BK virus and/or HHV-8 virus.
Description
Method for identifying DNA virus in organism based on high-throughput sequencing data and application thereof Technical Field The invention relates to the technical field of biomedicine, in particular to a method for identifying DNA viruses in organisms based on high-throughput sequencing data and application thereof. Background DNA viruses are a group of biological viruses containing DNA genetic material, which are widely distributed in humans, vertebrates, insects, plants, and microorganisms. According to the latest ICTV international virus classification standards, DNA viruses are classified into vertebrate DNA viruses, plant DNA viruses, invertebrate DNA viruses, prokaryotic microorganism (bacteria and archaebacteria) DNA viruses and eukaryotic microorganism DNA (algae, fungi and protozoa) viruses according to host types, and DNA viruses are classified into double-stranded DNA viruses (dsDNA) and single-stranded DNA viruses (ssDNA) according to the types of nucleic acids of the DNA viruses. Part of DNA viruses have certain pathogenicity and tumorigenicity to human bodies, are closely related to the health of the human bodies, and can spread and infect the next generation along with the mother and infant, so that the specific DNA virus type carried by the human bodies and the corresponding subtype are identified and detected, and the diseases caused by the DNA viruses can be effectively prevented. DNA virus is used as one kind of microorganism, under the condition of specific virus types contained in an unknown sample, metagenome second generation sequencing (mNGS) is often used for detecting and identifying the virus in the sample, metagenome second generation sequencing (mNGS) is independent of traditional microorganism culture, nucleic acid information of all microorganisms in the sample can be extracted unbiased for high-throughput sequencing, microbial population genome in the sample is used as a research object, bioinformatics analysis is combined, host information is removed and then the sample is compared with a pathogen database to obtain the type information of pathogenic microorganisms, but interpretation aiming at mNGS report is usually subjective at present, the standard for interpretation mNGS is limited, the unified clinical standard for sequence threshold value, evaluation sensitivity and specificity is lacked, and the metagenome sequencing is expensive and takes a long time. The second generation sequencing (high throughput sequencing) is generally used for deep sequencing of the genome of a single organism in a sample, is widely applied to somatic mutation detection and gene differential expression detection, and has no other uses except for genetic mutation analysis, so that sequencing data is wasted. Thus, there is a great need for a method that enables DNA virus identification using sequencing-down data from second generation sequencing (high throughput sequencing). Disclosure of Invention The invention aims to overcome the defects in the prior art and provides a method for identifying DNA viruses in organisms based on high-throughput sequencing data and application thereof. It is a first object of the present invention to provide a method for identifying DNA viruses in organisms based on high throughput sequencing data. A second object of the present invention is to provide a device for identifying DNA viruses in organisms based on high throughput sequencing data. It is a third object of the present invention to provide the use of the above method and/or device for identifying DNA viruses in organisms. In order to achieve the above object, the present invention is realized by the following means: A method for identifying DNA viruses in an organism based on high throughput sequencing data, using a biological sample derived from said organism as a test sample, comprising the steps of: s1, carrying out high-throughput sequencing on a sample to be tested, collecting sequencing data with a file format of FastQ in sequencing machine data, and carrying out quality control on the sequencing data to obtain clean fq data; S2, acquiring sequence information of a reference genome of the organism from an NCBI database as an organism database; Obtaining genome sequence information of a DNA virus of which host is the organism from an NCBI database as a DNA virus database; S3, comparing the clean fq data obtained in the step S1 with the biological database shown in the step S2 to obtain a sequence which cannot be compared with the biological database, comparing the clean fq data obtained in the step S1 with the DNA virus database shown in the step S2 to obtain a sequence of the DNA virus database shown in the comparison, removing the sequence which is simultaneously present in the sequence which cannot be compared with the biological database and the sequence which is simultaneously present in the comparison, and marking the sequence which is simultaneously present in the comparison and the biologi