CN-121983117-A - Sequencing data error correction method, device, electronic equipment and storage medium

CN121983117ACN 121983117 ACN121983117 ACN 121983117ACN-121983117-A

Abstract

The present disclosure provides a sequencing data error correction method, apparatus, electronic device, and storage medium. The method comprises the steps of obtaining long-reading long-sequencing data, wherein the long-reading long-sequencing data comprises a tandem repeat region, encoding the tandem repeat region and upstream and downstream sequencing data of the tandem repeat region by adopting a first encoding mode to obtain base continuity characteristics, encoding the upstream and downstream sequencing data of the tandem repeat region by adopting a second encoding mode to obtain upstream and downstream characteristics, inputting the base continuity characteristics and the upstream and downstream characteristics into a pre-trained correction model to perform error correction on the tandem repeat region in the long-reading long-sequencing data, and performing error correction on the long-reading long-sequencing sequence by combining the tandem repeat region and the upstream and downstream sequencing data of the tandem repeat region in the long-reading long-sequencing data in a depth model, so that high-efficiency and universal sequencing data error correction is achieved, sequencing data accuracy is improved, and subsequent assembly performance is realized.

Inventors

LI RUOJUN
DONG YULIANG
ZHANG JIAYUAN
SUN YUHUI
QI YANWEI
LI YUXIANG
ZENG TAO
XU XUN

Assignees

杭州华大序风科技有限公司

Dates

Publication Date: 20260505
Application Date: 20241025

Claims (10)

1. A method for error correction of sequencing data, the method comprising: acquiring long-reading long-sequencing data, wherein the long-reading long-sequencing data comprises a tandem repeat region; Coding the tandem repeat region and the upstream and downstream sequencing data of the tandem repeat region by adopting a first coding mode to obtain base continuity characteristics; coding the upstream and downstream sequencing data of the tandem repeat region by adopting a second coding mode to obtain upstream and downstream characteristics; inputting the base continuity feature and the upstream and downstream features into a pre-trained correction model to error correct tandem repeat regions in the long read long sequencing data.
2. The method of claim 1, wherein the first encoding scheme comprises a base recognition algorithm and the second encoding scheme comprises a language model, the method comprising: Coding the tandem repeat region and upstream and downstream sequencing data of the tandem repeat region by using a base recognition algorithm to obtain base continuity characteristics; and coding the upstream and downstream sequencing data of the tandem repeat region by using a language model to obtain upstream and downstream characteristics.
3. The method of claim 2, wherein encoding the upstream and downstream sequencing data of the tandem repeat region using a language model to obtain upstream and downstream features comprises: Performing modeling on the long-reading long-sequencing data according to a first preset length to obtain a die body table corresponding to the long-reading long-sequencing data, wherein the modeling is to segment the long-reading long-sequencing data based on the language model to obtain a plurality of die bodies; and coding each die body in the die body table to obtain the upstream and downstream characteristics.
4. The method of claim 1, wherein the inputting the base continuity feature and the upstream and downstream features into a pre-trained correction model to error correct tandem repeat regions in the long read long sequencing data, the method comprising, prior to: and acquiring long-reading long-sequencing data of an approximate species of the species corresponding to the long-reading long-sequencing data, and training the correction model by using the long-reading long-sequencing data of the approximate species to obtain a pre-trained correction model.
5. The method of claim 1, wherein the inputting the base continuity feature and the upstream and downstream features into a pre-trained correction model to error correct tandem repeat regions in the long read long sequencing data, the method thereafter comprising: and determining the number of corrected sequencing data subjected to error correction, and storing the number, the sequencing name corresponding to the corrected sequencing data and the corrected sequencing data into an output file.
6. The method of claim 1, wherein the correction model employs a three-layer structure comprising a convolutional layer, an active layer, and a pooling layer.
7. The method of claim 1, wherein the inputting the base continuity feature and the upstream and downstream features into a pre-trained correction model to error correct tandem repeat regions in the long read long sequencing data, the method thereafter comprising: Determining corrected long-reading long-sequencing data corresponding to the corrected sequencing data after error correction, and determining error evaluation parameters of the corrected long-reading long-sequencing data; and comparing the reference evaluation parameter with the error evaluation parameter to obtain abnormal distribution data corresponding to the corrected long-reading long-sequencing data.
8. A sequencing data error correction device, wherein the device is applied to a depth model, the device comprising: The acquisition unit is used for acquiring long-reading long-sequencing data, wherein the long-reading long-sequencing data comprises a tandem repeat region; The first coding unit is used for coding the tandem repeat region and the upstream and downstream sequencing data of the tandem repeat region by adopting a first coding mode to obtain base continuity characteristics; The second coding unit is used for coding the upstream and downstream sequencing data of the tandem repeat region by adopting a second coding mode to obtain upstream and downstream characteristics; And a correction unit for inputting the base continuity feature and the upstream and downstream features into a pre-trained correction model to perform error correction on tandem repeat regions in the long-read long-sequencing data.
9. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 7.

Description

Sequencing data error correction method, device, electronic equipment and storage medium Technical Field The disclosure relates to the technical field of gene sequencing, in particular to a sequencing data error correction method, a sequencing data error correction device, electronic equipment and a storage medium. Background Nanopore sequencing technology is used as a third generation sequencing technology, and by measuring the change of an electric signal of a DNA molecule passing through a capture DNA molecule passing through a nanopore and combining a base recognition algorithm (such as a hidden Markov model, a neural network model and the like), the electric signal is converted into a DNA sequence. The current change in the nano hole is used for sequencing, so that the method has the advantages of real time, high flux, low cost and the like. However, although third generation nanopore sequencing techniques have many advantages, there are still some drawbacks in practical applications. For example, third generation nanopore sequencing techniques have certain errors in the sequencing process, and the errors are usually concentrated in a single base tandem repeat region (Homopolymer) and are not uniformly distributed in the whole sequence, so that the errors are difficult to correct by a traditional mode of increasing the sequencing depth, and a large amount of systematic errors are brought about by introducing Homopolymer regions, so that the existing error correction effect based on short reading data is poor, and the accuracy of sequencing data is limited. In addition, the related art (such as the neural network coding method at the base recognition level) is not based on the upstream base recognition technique when sequencing, and the correction efficiency is not high. Disclosure of Invention The present disclosure provides a sequencing data error correction method, apparatus, electronic device, and storage medium. According to a first aspect of the disclosure, a sequencing data error correction method is provided, and the method comprises the steps of obtaining long-reading long-sequencing data, wherein the long-reading long-sequencing data comprises a tandem repeat region, encoding the tandem repeat region and upstream and downstream sequencing data of the tandem repeat region by a first encoding mode to obtain base continuity features, encoding the upstream and downstream sequencing data of the tandem repeat region by a second encoding mode to obtain upstream and downstream features, and inputting the base continuity features and the upstream and downstream features into a pre-trained correction model to perform error correction on the tandem repeat region in the long-reading long-sequencing data. In some embodiments of the present disclosure, the first encoding mode comprises a base recognition algorithm, the second encoding mode comprises a language model, the method comprises encoding the tandem repeat region and the upstream and downstream sequencing data of the tandem repeat region by the base recognition algorithm to obtain base continuity features, and encoding the upstream and downstream sequencing data of the tandem repeat region by the language model to obtain upstream and downstream features. In some embodiments of the disclosure, encoding the upstream and downstream sequencing data of the tandem repeat region using a language model to obtain upstream and downstream features includes modeling the long-read long-sequencing data at a first preset length to obtain a model table corresponding to the long-read long-sequencing data, and performing segmentation on the long-read long-sequencing data based on the language model to obtain a plurality of models, and encoding each of the models in the model table to obtain the upstream and downstream features. In some embodiments of the present disclosure, the base continuity features and upstream and downstream features are input into a pre-trained correction model to error correct tandem repeat regions in long-read long-sequencing data, prior to which the method includes obtaining long-read long-sequencing data for an approximate species of a species corresponding to the long-read long-sequencing data and training the correction model using the long-read long-sequencing data for the approximate species to obtain the pre-trained correction model. In some embodiments of the present disclosure, the base continuity features and upstream and downstream features are input into a pre-trained correction model to error correct tandem repeat regions in long-read long-sequencing data, after which the method includes determining the number of corrected sequencing data after error correction and storing the number, the sequencing name to which the corrected sequencing data corresponds, and the corrected sequencing data to an output file. In some embodiments of the present disclosure, the correction model employs a three-layer structure that includes a convolut