CN-120977392-B - AI-based virus-host RNA sequence classification method and device
Abstract
The invention discloses a virus-host RNA sequence classification method and device based on AI, relating to the field of biological detection, wherein the method comprises the steps of mapping a pretreated short-reading long RNA sequence to a host genome twice, assembling the short-reading long RNA sequence which is not mapped to the host genome into a continuous RNA sequence, and screening out the RNA sequence with the length of more than 1000 bp; and (3) carrying out AI classification on the RNA sequences with the length of more than 1000bp to obtain the viral RNA sequences. The invention combines host filtering, rapid assembly and AI classification, can obviously reduce the calculated amount, lighten the hardware pressure, realizes high-efficiency and accurate virus sequence classification, and can rapidly distinguish unknown viruses.
Inventors
- JIA LEI
- XIAO FUGUI
- JIA NAN
- GUO YU
Assignees
- 北京领为科技发展有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20250807
Claims (4)
- 1. An AI-based method for classifying viral-host RNA sequences, comprising: Mapping all short-reading long RNA sequences to a host genome by using a first sequence comparison tool, and taking part of short-reading long RNA sequences with failed mapping as first filtered sequences; Re-mapping the first filtered sequence to the host genome with a second sequence comparison tool, taking the part of the first filtered sequence with failed mapping as a second filtered sequence, wherein the second sequence comparison tool has higher comparison precision than the first sequence comparison tool, and the first sequence comparison tool has higher comparison speed than the second sequence comparison tool; Assembling the fragmented second filtered sequences into continuous RNA sequences, and screening continuous RNA sequences with the length of more than 1000 bp; carrying out AI classification on the screened continuous RNA sequences to obtain viral RNA sequences; The method also comprises the steps of performing joint removal treatment and quality control filtration treatment on the short-reading long RNA sequence; the quality control filtering treatment comprises the following steps: calculating the bred score Q for each base in the short read long RNA sequence: Wherein P is the probability of base recognition errors; Calculating the average phred score of the whole short-reading long RNA sequence according to the phred score of each base; filtering and discarding short-reading long RNA sequences with average phred scores lower than a first preset threshold value and short-reading long RNA sequences with unknown base numbers higher than a second preset threshold value; the step of AI classifying the selected continuous RNA sequences to obtain viral RNA sequences comprises: performing single-heat coding on the input continuous RNA sequence, and converting the character string data into a numerical matrix; Inputting the continuous RNA sequence subjected to the single heat coding into an LSTM layer for feature extraction, and outputting a feature vector; Inputting the feature vector to a full connection layer, and extracting advanced sequence features by combining an activation function; and calculating a virus probability value of the current continuous RNA sequence through a softmax function based on the advanced sequence characteristics, and judging the current continuous RNA sequence as the virus RNA sequence if the virus probability value is larger than a third preset threshold value.
- 2. An AI-based virus-host RNA sequence classification device, comprising: the first mapping module is used for mapping all short-reading long RNA sequences to a host genome by using a first sequence comparison tool, and taking part of short-reading long RNA sequences which are failed to map as first filtered sequences; The second mapping module is used for mapping the first filtered sequence to the host genome again by using a second sequence comparison tool, and taking the part of the first filtered sequence which is failed to map as a second filtered sequence, wherein the second sequence comparison tool has higher comparison precision than the first sequence comparison tool, and the first sequence comparison tool has higher comparison speed than the second sequence comparison tool; The assembly module is used for assembling the fragmented second filtered sequences into continuous RNA sequences, and screening continuous RNA sequences with the length of more than 1000 bp; The classification module is used for carrying out AI classification on the screened continuous RNA sequences to obtain viral RNA sequences; The system also comprises a decoking and quality control module, a detection module and a control module, wherein the decoking and quality control module is used for carrying out decoking treatment and quality control filtering treatment on the short-reading long RNA sequence; the quality control filtering treatment comprises the following steps: calculating the bred score Q for each base in the short read long RNA sequence: Wherein P is the probability of base recognition errors; Calculating the average phred score of the whole short-reading long RNA sequence according to the phred score of each base; filtering and discarding short-reading long RNA sequences with average phred scores lower than a first preset threshold value and short-reading long RNA sequences with unknown base numbers higher than a second preset threshold value; The classification module comprises: The single-heat coding module is used for carrying out single-heat coding on the input continuous RNA sequence and converting the character string data into a numerical matrix; The LSTM module is used for inputting the continuous RNA sequence subjected to the single thermal coding into the LSTM layer for feature extraction and outputting feature vectors; The full connection and activation module is used for inputting the feature vector to a full connection layer and extracting advanced sequence features by combining an activation function; and the classification output module is used for calculating the virus probability value of the current continuous RNA sequence through a softmax function based on the high-level sequence characteristics, and judging that the current continuous RNA sequence is the virus RNA sequence if the virus probability value is larger than a third preset threshold value.
- 3. An electronic device, comprising: one or more processors; One or more storage devices for storing computer programs; The computer program, when executed by the processor, causes the processor to implement an AI-based virus-host RNA sequence classification method as claimed in claim 1.
- 4. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements an AI-based virus-host RNA sequence classification method according to claim 1.
Description
AI-based virus-host RNA sequence classification method and device Technical Field The invention relates to the field of biological detection, in particular to an AI-based virus-host RNA sequence classification method, an AI-based virus-host RNA sequence classification device, an AI-based electronic device and a AI-based storage medium. Background RNA sequencing (RNA-seq) is an indispensable technique to study transcriptome-level gene expression and regulation, short read RNA sequencing is the most common way to detect and quantify transcriptome-wide gene expression, short read length generally being a read length less than 500bp in length. In addition, RNA-seq is an important means of studying antiviral mechanisms. In virus-infected cells, the viral gene and the host gene are expressed simultaneously, in which case the RNA sequence of the host and the RNA sequence of the virus cannot be distinguished. The conventional method is based on the steps of firstly assembling short-reading long RNA sequences, then comparing the assembled sequences with a database, relying on the known database, and being difficult to screen unknown viruses, the flow is large in calculated amount and long in time consumption, the assembling process is dependent on high-performance hardware calculation, and the cost is high. Disclosure of Invention In view of the above-mentioned drawbacks or shortcomings in the prior art, the present invention provides a method, an apparatus, an electronic device, and a storage medium for classifying virus-host RNA sequences based on AI, which greatly reduce the amount of calculation by combining two-step filtration, assembly, and AI classification, and efficiently and accurately implement virus classification. The first aspect of the invention provides an AI-based virus-host RNA sequence classification method comprising: Mapping all short-reading long RNA sequences to a host genome by using a first sequence comparison tool, and taking part of short-reading long RNA sequences with failed mapping as first filtered sequences; Re-mapping the first filtered sequence to the host genome with a second sequence comparison tool, taking the part of the first filtered sequence with failed mapping as a second filtered sequence, wherein the second sequence comparison tool has higher comparison precision than the first sequence comparison tool, and the first sequence comparison tool has higher comparison speed than the second sequence comparison tool; Assembling the fragmented second filtered sequences into continuous RNA sequences, and screening continuous RNA sequences with the length larger than a preset value; and (5) carrying out AI classification on the screened continuous RNA sequences to obtain the viral RNA sequences. Further, the method also comprises the steps of carrying out decoking treatment and quality control filtration treatment on the short-reading long RNA sequence. Further, the quality control filtering process includes: calculating the bred score Q for each base in the short read long RNA sequence: Wherein P is the probability of base recognition errors; Calculating the average phred score of the whole short-reading long RNA sequence according to the phred score of each base; Short-reading long RNA sequences with average phred scores below a first preset threshold and short-reading long RNA sequences with unknown base numbers above a second preset threshold are filtered and discarded. Further, the step of AI classifying the selected continuous RNA sequences to obtain viral RNA sequences comprises: performing single-heat coding on the input continuous RNA sequence, and converting the character string data into a numerical matrix; Inputting the continuous RNA sequence subjected to the single heat coding into an LSTM layer for feature extraction, and outputting a feature vector; Inputting the feature vector to a full connection layer, and extracting advanced sequence features by combining an activation function; and calculating a virus probability value of the current continuous RNA sequence through a softmax function based on the advanced sequence characteristics, and judging the current continuous RNA sequence as the virus RNA sequence if the virus probability value is larger than a third preset threshold value. In a second aspect, the present invention provides an AI-based virus-host RNA sequence classification apparatus comprising: the first mapping module is used for mapping all short-reading long RNA sequences to a host genome by using a first sequence comparison tool, and taking part of short-reading long RNA sequences which are failed to map as first filtered sequences; The second mapping module is used for mapping the first filtered sequence to the host genome again by using a second sequence comparison tool, and taking the part of the first filtered sequence which is failed to map as a second filtered sequence, wherein the second sequence comparison tool has higher comparison precision