CN-122024861-A - Multiple PCR amplicon sequencing data detection method based on deep learning

CN122024861ACN 122024861 ACN122024861 ACN 122024861ACN-122024861-A

Abstract

The invention relates to the technical field of bioinformatics and computational biology, in particular to a multiple PCR amplicon sequencing data detection method based on deep learning. The technical problem of the prior art that the accuracy of the whole detection of complex sequencing data is low is solved. The method comprises the steps of obtaining a sequencing data set to be detected, obtaining a reference sequence database, constructing a feature extraction model for extracting amplicon features based on historical amplification condition information and multiple groups of historical sequencing data of target tags, processing the sequencing data set to be detected based on the feature extraction model, and determining target types corresponding to each sequencing sequence in the sequencing data set. The invention is used for the scene of multiplex PCR amplicon sequencing.

Inventors

YU JING
ZHOU CHUNHUA
ZHAO YUANYUAN
LIU YAN
PU HAO
HUANG WENPAN
BAI RU
LIU ZHANG

Assignees

河北医科大学第一医院
河北冰缘圣康医疗科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (10)

1. A multiple PCR amplicon sequencing data detection method based on deep learning, the method comprising: Acquiring a sequencing data set to be detected, wherein the sequencing data set to be detected comprises a sequencing sequence determined by multiplex PCR amplification; The method comprises the steps of obtaining a reference sequence database, wherein the reference sequence database comprises historical amplification condition information and a plurality of groups of historical sequencing data of target tags, and the historical amplification condition information is used for evaluating the correlation between the historical sequencing data and a sequencing data set to be detected; constructing a feature extraction model for extracting amplicon features based on the historical amplification condition information and multiple sets of historical sequencing data of the target tag; And processing the sequencing data set to be detected based on the feature extraction model, and determining the target class corresponding to each sequencing sequence in the sequencing data set.
2. The deep learning-based multiplex PCR amplicon sequencing data detection method of claim 1, wherein the constructing a feature extraction model for extracting amplicon features based on the historical amplification condition information and the sets of historical sequencing data of target tags comprises: Acquiring current amplification condition information of the sequencing data set to be detected; determining a reference dataset similar to the sequencing dataset to be detected from the reference sequence database based on current amplification condition information; classifying the target historical sequencing data based on sequence features of primers of the target historical sequencing data in the reference data set, and determining at least one sequence class of historical sequencing; Determining a quality distribution statistical value of each sequence category in the at least one sequence category, wherein the quality distribution statistical value is used for representing the reliability centralized trend of sequencing results in the sequence category; determining at least one high quality sequencing sequence based on the mass distribution statistics, constituting a high quality sequencing data subset; And training a preset deep learning network based on the high-quality sequencing data subset and the target label of the high-quality sequencing data subset, and determining the feature extraction model.
3. The deep learning based multiplex PCR amplicon sequencing data detection method of claim 2, wherein the determining a reference dataset similar to the sequencing dataset to be detected from the reference sequence database based on current amplification condition information comprises: determining the similarity of reaction conditions between the current amplification condition information and the historical amplification condition information, wherein the similarity of reaction conditions is used for representing the consistency degree of two amplification reaction environments; Determining the primer sequence similarity between the current amplification condition information and the historical amplification condition information, wherein the primer sequence similarity is used for indicating the specific matching degree of a primer binding target region; determining a comprehensive association based on the reaction condition similarity and the primer sequence similarity; And screening the reference data set from the reference sequence database based on the comprehensive association degree.
4. The method of claim 3, wherein the current amplification condition information comprises a current temperature parameter sequence of a current amplification reaction process, the historical amplification condition information comprises a historical temperature parameter sequence of a historical amplification reaction process, and the determining the similarity of reaction conditions between the current amplification condition information and the historical amplification condition information comprises: Carrying out dynamic time warping calculation on the current temperature parameter sequence and the historical temperature parameter sequence; And determining the result of the dynamic time warping calculation as the similarity of the reaction conditions.
5. The deep learning based multiplex PCR amplicon sequencing data detection method of claim 2, wherein the classifying the target historical sequencing data based on sequence features of primers of the target historical sequencing data in the reference dataset, determining at least one sequence class of historical sequencing, comprises: And carrying out cluster analysis on the historical sequencing data based on the similarity between sequence features of primers associated with each historical sequencing data in the reference data set, and determining each cluster formed by clustering as a sequence class.
6. The deep learning based multiplex PCR amplicon sequencing data detection method of claim 2, further comprising: determining the mass average value and the mass standard deviation of a plurality of historical sequencing sequences of each sequence class at each preset base position; Determining a mass deviation score of each historical sequencing sequence at each position based on the mass value of each historical sequencing sequence at each preset base position and the mass average value and the mass standard deviation of the sequence class of each historical sequencing sequence at the corresponding position; Determining a local mass deviation score for each historical sequencing sequence based on mass deviation scores for locations within a sliding window of a preset length; and determining a confidence score of each historical sequencing sequence based on the quality deviation score and the local quality deviation score, wherein the confidence score is used for representing the deviation degree of the quality value of the historical sequencing sequence relative to the concentrated trend of the quality distribution of the sequence category to which the historical sequencing sequence belongs.
7. The deep learning based multiplex PCR amplicon sequencing data detection method of claim 6, wherein the determining at least one high quality sequencing sequence based on the mass distribution statistics comprises: And screening target historical sequencing sequences from the at least one sequence class based on the confidence scores of each historical sequencing sequence to form the high-quality sequencing data subset.
8. The deep learning-based multiplex PCR amplicon sequencing data detection method as claimed in claim 7, wherein the training a preset deep learning network based on the high quality sequencing data subset and the target tag thereof, determining the feature extraction model, comprises: Inputting the target historical sequencing sequence and the target tag thereof in the high-quality sequencing data subset into a convolutional neural network; And introducing an attention mechanism in the training process of the convolutional neural network, and constructing the feature extraction model, wherein the weight adjustment of the attention mechanism is based on the confidence score of the target historical sequencing sequence.
9. The method for deep learning-based multiplex PCR amplicon sequencing data detection of claim 3, the determining the comprehensive association degree based on the reaction condition similarity and the primer sequence similarity comprises the following steps: And carrying out weighted summation on the similarity of the reaction conditions and the similarity of the primer sequences, and determining the comprehensive association degree.
10. The deep learning based multiplex PCR amplicon sequencing data detection method of claim 1, wherein the obtaining the sequencing dataset to be detected comprises: Performing multiplex PCR amplification and sequencing on a target sample to obtain original sequencing data; And carrying out quality filtering and sequence calibration processing on the original sequencing data to obtain a sequencing data set to be detected.

Description

Multiple PCR amplicon sequencing data detection method based on deep learning Technical Field The invention relates to the technical field of bioinformatics and computational biology, in particular to a multiple PCR amplicon sequencing data detection method based on deep learning. Background The multiplex polymerase chain reaction (Polymerase Chain Reaction, PCR) amplicon sequencing technology remarkably improves the detection flux and the sensitivity by amplifying a plurality of target areas simultaneously in a single reaction system and combining with high-flux sequencing, is particularly suitable for screening and identifying specific deoxyribonucleic acid (DeoxyriboNucleic Acid, DNA) sequences in low-abundance samples, and has important application value in clinical diagnosis and basic research. Currently, analysis methods for such sequencing data rely mostly on quality assessment of the sequencing read length itself, such as filtering and classifying based on simple indicators such as single base quality threshold or global coverage, which often treat data from different sources and different amplification backgrounds equally. However, due to inherent complexity of primer competition, amplification efficiency difference, non-specific amplification and the like in a multiplex PCR reaction system, the conventional method based on simple rules is difficult to accurately distinguish from the identification of the real target signal under the background noise, so that the accuracy of the overall detection of complex sequencing data is low. Disclosure of Invention In order to solve the technical problem that the prior art scheme is difficult to accurately distinguish and identify the real target signal under the background noise, which results in lower accuracy of overall detection of complex sequencing data, the invention aims to provide a multiple PCR amplicon sequencing data detection method based on deep learning, and the adopted technical scheme is as follows: The invention provides a deep learning-based multiple PCR amplicon sequencing data detection method, which comprises the steps of obtaining a sequencing data set to be detected, wherein the sequencing data set to be detected comprises sequencing sequences determined through multiple PCR amplifications, obtaining a reference sequence database, wherein the reference sequence database comprises historical amplification condition information and multiple groups of historical sequencing data of target labels, the historical amplification condition information is used for evaluating correlation between the historical sequencing data and the sequencing data set to be detected, constructing a feature extraction model for extracting amplicon features based on the historical amplification condition information and the multiple groups of historical sequencing data of the target labels, and processing the sequencing data set to be detected based on the feature extraction model to determine target types corresponding to each sequencing sequence in the sequencing data set. With reference to the first aspect, in one possible implementation manner, the method specifically includes obtaining current amplification condition information of a sequencing dataset to be detected, determining a reference dataset similar to the sequencing dataset to be detected from a reference sequence database based on the current amplification condition information, classifying target historical sequencing data based on sequence features of primers of the target historical sequencing data in the reference dataset, determining at least one sequence category of historical sequencing, determining a mass distribution statistic value of each sequence category in the at least one sequence category, determining a reliability concentration trend of a sequencing result in the sequence category by the mass distribution statistic value, determining at least one high-quality sequencing sequence based on the mass distribution statistic value to form a high-quality sequencing data subset, and training a preset deep learning network based on the high-quality sequencing data subset and a target tag thereof to determine a feature extraction model. With reference to the first aspect, in one possible implementation manner, the method specifically includes determining a reaction condition similarity between current amplification condition information and historical amplification condition information, wherein the reaction condition similarity is used for representing a consistency degree of two amplification reaction environments, determining a primer sequence similarity between the current amplification condition information and the historical amplification condition information, wherein the primer sequence similarity is used for representing a specific matching degree of a primer binding target region, determining a comprehensive relevance based on the reaction condition similarity and the primer sequen