Search

CN-121999836-A - DNA encoding method, decoding method, device, equipment and medium

CN121999836ACN 121999836 ACN121999836 ACN 121999836ACN-121999836-A

Abstract

The invention provides a DNA coding method, which comprises the steps of preprocessing a plurality of initial subsequences in an initial sequence set based on screening conditions and a Debrucine graph principle to obtain a to-be-selected directed graph, wherein each node in the to-be-selected directed graph comprises a plurality of bases, the initial sequence set is obtained by arranging and combining ACGT bases, current signals between adjacent nodes in the to-be-selected directed graph are restrained, nodes which do not meet preset conditions are deleted to obtain a target directed graph, data to be coded are traversed, the data to be coded are converted into target DNA sequences according to the degree information of the nodes in the target directed graph through degree judgment, and the data to be coded are obtained by LDPC coding of original data. The coding method of the present disclosure degrades complexity in the coding process and improves accuracy of the DNA sequence obtained by coding in single molecule sequencing by screening bases and current signal constraint in the coding preamble process.

Inventors

  • WEI YANJIE
  • ZHANG HAOLING
  • PING ZHI
  • ZHANG WENWEI
  • SHEN YUE

Assignees

  • 深圳华大生命科学研究院

Dates

Publication Date
20260508
Application Date
20241104

Claims (15)

  1. 1. A method of encoding DNA, the method comprising: Preprocessing a plurality of initial subsequences in an initial sequence set based on screening conditions and a Debrucine diagram principle to obtain a directed graph to be selected, wherein each node in the directed graph to be selected comprises a plurality of bases, and the initial sequence set is obtained by arranging and combining ACGT bases; Restricting current signals between adjacent nodes in the directed graph to be selected, and deleting nodes which do not meet preset conditions to obtain a target directed graph; traversing data to be encoded, and converting the data to be encoded into a target DNA sequence according to the degree information of nodes in the target directed graph through degree judgment, wherein the data to be encoded is obtained by performing LDPC encoding on original data.
  2. 2. The method of claim 1, wherein constraining current signals between adjacent nodes in the candidate directed graph, and deleting nodes that do not meet a preset condition comprises: Determining current statistical information of each node in the directed graph to be selected, wherein the current statistical information at least comprises a mean value and a standard deviation; And determining the current signal fluctuation range of each node based on the current statistical information, and deleting the nodes of which the current signal fluctuation ranges between adjacent nodes meet the first preset condition to obtain a first directed graph.
  3. 3. The method according to claim 2, wherein the method further comprises: determining the degree information of each node in the first directed graph, and deleting the nodes of which the degree information meets a second preset condition to obtain the target directed graph.
  4. 4. The method of claim 3, wherein traversing the data to be encoded and converting the data to be encoded into the target DNA sequence according to the degree-of-occurrence information of the nodes in the target directed graph by degree-of-occurrence judgment comprises: determining a current node; determining a mapping rule based on the outbound information of the current node; based on the mapping rule, slicing the data to be encoded, and writing a base corresponding to the next node of the current node into the target DNA sequence; and writing the current node into a path list, and updating the current node to be the next node of the current node.
  5. 5. The method of claim 4, wherein prior to determining a mapping rule based on the outbound information for the current node, the method further comprises: Taking the current node as a root node, taking a subsequent node which can be connected as a child node, and constructing a tree structure, wherein the number of layers of the tree structure is smaller than the number of bases in the initial subsequence; Deleting the palindromic complementary nodes in the tree structure, and determining the number of the child nodes of the current node after deletion; If the number of the child nodes is zero, the current node is redetermined in the target directed graph; And if the number of the child nodes is not zero, determining the outbound degree information of the current node.
  6. 6. A method of DNA decoding, the method comprising: obtaining a DNA sequence and a target directed graph, wherein the DNA sequence is obtained by synthesizing and single-molecule sequencing a target DNA sequence obtained by the DNA coding method according to claims 1-7, the target directed graph is obtained by screening a plurality of initial subsequences in an initial sequence set and constraining current signals, and the initial subsequences comprise a plurality of bases; Carrying out path backtracking according to the path list, and carrying out error correction processing on the DNA sequence to obtain a DNA sequence to be decoded; Traversing the DNA sequence to be decoded, and converting the DNA sequence to be decoded into a binary sequence through degree judgment based on the target directed graph; And performing LDPC decoding on the binary sequence to obtain target decoding data.
  7. 7. The method of claim 6, wherein performing path backtracking based on a path list, performing error correction processing on the DNA sequence comprises: Determining an initial base sequence of the DNA sequence; traversing the DNA sequence by taking the initial base subsequence as a starting point, and judging whether the base in the current base subsequence appears in a subsequent base subsequence which can be connected with the current base subsequence; If not, backtracking through a path, taking the precursor base subsequence which can be connected with the current base subsequence as a backtracking starting point, and taking the connectable subsequent base subsequence of the backtracking starting point as a comparison base subsequence so as to correct the base in the current base subsequence.
  8. 8. The method of claim 7, wherein traversing the DNA sequence to be decoded and converting the DNA sequence to be decoded to a binary sequence by a degree of extraction determination based on the target directed graph comprises: determining the starting node as a current node; determining the outbound degree information of the current node in the target directed graph; determining a mapping rule according to the outbound degree information; And based on the mapping rule, slicing the DNA sequence to be decoded, writing binary information corresponding to the next node of the current node into the binary sequence, and updating the current node to be the next node of the current node.
  9. 9. The method of claim 8, wherein the determining the current node precedes the outbound information in the target directed graph, the method further comprising: Based on the target directed graph, taking the current node as a root node, taking a subsequent node which can be connected as a child node, and constructing a tree structure, wherein the number of layers of the tree structure is smaller than the number of bases in the initial subsequence; Deleting the palindromic complementary nodes contained in the tree structure, and determining the number of the child nodes of the current node after deletion; If the number of the child nodes is zero, performing the error correction processing on the current node; And if the number of the child nodes is not zero, determining the outbound degree information of the current node.
  10. 10. A DNA encoding apparatus comprising: The screening module is used for preprocessing a plurality of initial subsequences in an initial sequence set based on screening conditions and a Debrucine diagram principle to obtain a to-be-selected directed graph, each node in the to-be-selected directed graph comprises a plurality of bases, and the initial sequence set is obtained by arranging and combining ACGT bases; The constraint module is used for constraining current signals between adjacent nodes in the directed graph to be selected, deleting nodes which do not meet preset conditions, and obtaining a target directed graph; The coding module is used for traversing the data to be coded, converting the data to be coded into a target DNA sequence according to the degree information of the nodes in the target directed graph and judging through the degree, wherein the data to be coded is obtained by performing LDPC coding on the original data.
  11. 11. A DNA decoding apparatus comprising: The acquisition module is used for acquiring a DNA sequence and a target directed graph, wherein the DNA sequence is obtained by synthesizing and single-molecule sequencing a target DNA sequence obtained by the DN A coding device in claim 12, the target directed graph is obtained by screening a plurality of initial subsequences in an initial sequence set and restraining current signals, and the initial subsequences comprise a plurality of bases; the error correction module is used for carrying out path backtracking according to the path list, and carrying out error correction processing on the DNA sequence to obtain a DNA sequence to be decoded; The conversion module is used for traversing the DNA sequence to be decoded and converting the DNA sequence to be decoded into a binary sequence through degree judgment based on the target directed graph; And the decoding module is used for performing LDPC decoding on the binary sequence to obtain target decoding data.
  12. 12. An electronic device, comprising: One or more processors; a storage device communicatively coupled to the one or more processors, having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-5 or 6-9.
  13. 13. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5 or 6-9.
  14. 14. A computer program product comprising a computer program which, when executed, implements the method of any of claims 1-5 or 6-9.
  15. 15. A chip comprising at least one processor and a communication interface for receiving signals input to or output from the chip, the processor being in communication with the communication interface and implementing the method of any of claims 1-5 or 6-9 by logic circuitry or execution of code instructions.

Description

DNA encoding method, decoding method, device, equipment and medium Technical Field The present disclosure relates to the field of data storage technology, in particular to a DNA coding method, a decoding method, a device, equipment and a medium. Background The DNA storage has the characteristics of high density, strong stability, low energy consumption and the like, has huge application prospect as a novel information storage mode, and comprises the steps of information coding, information writing, information reading and information decoding. At present, in the single-molecule sequencing (such as nanopore sequencing) process, due to the factors of insufficient stability of the motor protein in controlling the via speed, existence of system noise and the like, the analysis capability of a current signal is poor, the error rate is high, and the subsequent decoding is more complex and difficult. Disclosure of Invention In order to solve the problems in the related art, the present disclosure proposes a DNA encoding method, a decoding method. According to the method, bases are screened in the coding preamble process, current constraint is conducted between adjacent nodes in the directed graph obtained through directed splicing, so that the accuracy of a DNA sequence obtained through coding in single-molecule sequencing is improved, DNA coding and decoding are conducted based on the rule of degree judgment, and therefore complexity in the coding and decoding process is reduced. An embodiment of a first aspect of the present disclosure provides a DNA encoding method, where the method includes preprocessing a plurality of initial subsequences in an initial sequence set based on a screening condition and a debluring graph principle to obtain a to-be-selected directed graph, each node in the to-be-selected directed graph includes a plurality of bases, the initial sequence set is obtained by permutation and combination of ACGT bases, restricting current signals between adjacent nodes in the to-be-selected directed graph, deleting nodes that do not meet a preset condition to obtain a target directed graph, traversing data to be encoded, according to output information of nodes in the target directed graph, converting the data to be encoded into a target DNA sequence through output judgment, and the data to be encoded is obtained by LDPC encoding of original data. In some embodiments of the disclosure, constraining current signals between adjacent nodes in a directed graph to be selected, and deleting nodes which do not meet preset conditions comprises determining current statistical information of each node in the directed graph to be selected, wherein the current statistical information at least comprises a mean value and a standard deviation, determining a current signal fluctuation range of each node based on the current statistical information, deleting nodes, of which the current signal fluctuation range between the adjacent nodes meets a first preset condition, and obtaining a first directed graph. In some embodiments of the present disclosure, the method further includes determining outbound information of each node in the first directed graph, and deleting nodes whose outbound information meets a second preset condition to obtain the target directed graph. In some embodiments of the present disclosure, traversing data to be encoded, and converting the data to be encoded into a target DNA sequence through an out-degree judgment according to out-degree information of nodes in a target directed graph includes determining a current node, determining a mapping rule based on the out-degree information of the current node, performing a slicing operation on the data to be encoded based on the mapping rule, writing bases corresponding to a next node of the current node into the target DNA sequence, writing the current node into a path list, and updating the current node to be the next node of the current node. In some embodiments of the present disclosure, before determining the mapping rule based on the outbound information of the current node, the method further includes using the current node as a root node, using a subsequent node available for connection as a child node, building a tree structure, the number of layers of the tree structure being smaller than the number of bases in the initial subsequence, deleting the palindromic complementary nodes in the tree structure, determining the number of child nodes of the deleted current node, if the number of child nodes is zero, redefining the current node in the target directed graph, and if the number of child nodes is not zero, determining the outbound information of the current node. An embodiment of a second aspect of the present disclosure provides a DNA decoding method, where the method includes obtaining a DNA sequence and a target directed graph, where the DNA sequence is obtained by synthesizing and single-molecule sequencing a target DNA sequence obtained