CN-115440302-B - Genome array, genome architecture, genome sequence assembling method and system
Abstract
The invention provides a genome array stacking assembly method and system, a genome architecture assembly method and system and a genome sequence assembly method and system, and belongs to the technical field of biological information. The invention optimizes a global loss function through a robust regression technology to determine the overlapping relation between sequencing sequences, eliminates interference of false positive comparison, thereby obtaining a more accurate assembled overlapping set, avoiding the deletion of copy numbers on an assembled genome or error splicing, carries out estimation of overlapping arrangement based on regression matrix calculation, has higher tolerance to abnormal values unless the number of ambiguous sample information is greater than that of real samples, does not generate a situation of breakdown, independently samples, assembles, improves and evaluates the sequencing data for a plurality of times through a resampling technology, and then integrates the sequencing data into a final assembled genome, thereby reducing uncertainty brought by noise in the sequencing data to an assembly result and reducing sensitivity of the assembly result to assembly parameter selection.
Inventors
- CAO CHENGHAO
- LI MENGTIAN
- LI LEI
Assignees
- 中国科学院数学与系统科学研究院
Dates
- Publication Date
- 20260505
- Application Date
- 20210603
Claims (19)
- 1. A method of assembling a genome array, the method comprising: S01, extracting a first sequencing sequence set with a certain depth; S02, pairwise comparison is carried out on the sequencing sequences in the first sequencing sequence set through a sequence comparison algorithm, so that pairwise comparison information of the sequencing sequences is obtained; S03, according to pairwise comparison information of the sequencing sequences, all the comparison meeting the stacking condition are jointly expressed as a first linear regression model; s04, solving the first linear regression model through an iterative re-weighted least square algorithm to obtain a solution with the minimum global loss function and robust estimation of coordinates of the sequencing sequences in the first sequencing sequence set on a genome to be tested, wherein the method comprises the following steps: S041, solving a sparse linear equation set related to beta by using an acceleration solving algorithm of a large sparse linear equation set: the resulting solution is taken as Is set to an initial value of (1); s042, in each iteration, according to the current Calculating residual vectors A kind of electronic device Maximum value of absolute value of each component ; S043 if Then calculate the weight matrix ; Wherein, the Is a preset convergence threshold value, and the weight function is ; S044, according to the calculated W, taking the smaller value of the weight values of two samples which are dual to each other, and giving the two samples; S045, solving a sparse linear equation set related to beta by using an acceleration solving algorithm of a large sparse linear equation set: ; s046 updating with the solution obtained Deleting the rows with the corresponding weights of 0 in the matrixes X and Y, and returning to the next iteration; S047 up to Stopping iteration and recording a matrix X of the last iteration; s048 for last round Is rounded and output Namely, the method is a robust estimation of the coordinates of the sequencing sequences in the first sequencing sequence set on a genome to be tested; S05, dividing the sequencing sequence into a plurality of connected components according to an undirected graph constructed by a solution with the minimum global loss function, and arranging the sequencing sequence according to the stable estimation of the coordinates of the sequencing sequence in each connected component to generate a stacked array; s06, merging the stacked arrays according to pairwise comparison information of the sequencing sequences to obtain a preliminary assembled stacked array set; S07, obtaining the consistent sequence of each array in the preliminary assembly array, comparing and mapping the sequencing sequence to the consistent sequence to obtain a mapping result of the sequencing sequence to the consistent sequence, wherein the mapping result is a new assembly array set.
- 2. The method of genome array assembly according to claim 1, wherein step S03 comprises: s031 assume a sequence in the first set of sequence to be sequenced Alignment of the rightmost base to the sequenced sequence Inside the container, the inner part of the container, The right overhang portion contains A base, thereby obtaining an observation sample: ; Wherein, the Is that The coordinates on the single strand of DNA on which it is located, Is that The coordinates on the single strand of DNA on which it is located, Representing an observation error of the observation sample; S032 assuming a second sequencing sequence The leftmost base is aligned to the first sequencing sequence Inside the container, the inner part of the container, The left overhang portion contains A single base, thereby obtaining a dual observation sample: ; Wherein, the Is that Coordinates of the reverse complement of (c) on the complementary single strand, Is that Coordinates of the reverse complement of (c) on the complementary single strand, Representing the observation error of the dual observation sample; s033, integrating all aligned observation samples and dual observation samples meeting the stacking condition into a matrix equation according to pairwise alignment information of the sequencing sequences: ; wherein each behavior is an observation sample, vector For the parameters to be estimated, X is a large sparse matrix, and each row of X has only two non-zero elements 1 and-1, and X is stored in a sparse matrix format.
- 3. The method of genome array assembly according to claim 1, wherein the step S05 comprises: s051, using a sequencing sequence in a first sequencing sequence set and a reverse complementary sequence corresponding to the sequencing sequence as vertexes, determining two sequencing sequences according to two non-zero elements of each row in a matrix X of the last iteration, establishing an edge between vertexes corresponding to the two sequencing sequences, and constructing to obtain an undirected graph; s052, traversing the undirected graph, dividing the undirected graph into a plurality of connected components to obtain 2k connected components, and marking the connected components as Wherein And (3) with The contained sequencing sequences are in one-to-one correspondence and are mutually reverse complementary sequences; s053, sequencing sequences in each connected component are arranged according to the stable estimation of the coordinates of the sequencing sequences on the genome to be tested, so as to obtain an array stack: Stacking array And Respectively correspond to two subsequences which are complementary to each other in reverse direction on the DNA double strand.
- 4. The method of genome array assembly according to claim 3, wherein the step S06 comprises: s061, taking the stacked arrays in the stacked arrays as vertexes; s062, for any two stacked arrays, if 3' End s sequencing sequences and The 5' end s sequencing sequences have more than t alignments meeting the stacking condition, and then a strip is established Corresponding vertex pointing Constructing a directed graph corresponding to directed edges of the vertexes, wherein the weight of the edges is set to satisfy the comparison of the stacking condition, and the values of s and t are positively related to the sampling depth; S063A stacked array comprising the most sequenced sequences from the directed graph Starting, determining the slave stack Stacked array pointed by edge with greatest departure weight ; S064 from Is selected from a sequencing sequence at the 3' end of (2) And from Selecting a sequencing sequence at the 5' end of (2) , And The comparison between the two meets the stacking condition; S065 assume that The rightmost base is aligned to Inside the container, the inner part of the container, The right overhang portion contains One base, then Robust estimates of the coordinates of all sequenced sequences in (a) are added ; S067 will And The sequencing sequences in (a) are arranged and combined into a stacked array according to the size of the robust estimation of the coordinates of the sequencing sequences, and meanwhile And Removing corresponding vertexes in the directed graph; S068 from Repeating the steps S063-S067 until there is no array which can be combined, repeating the steps S063-S067 from the array which contains the most sequencing sequence in the rest vertexes in the directed graph until all vertexes in the directed graph are traversed, and marking the obtained combined array as a preliminary assembled array set which is recorded as 。
- 5. The method of genome array assembly according to claim 4, wherein the step S07 comprises: s071 for any one of the preliminary assembled stacks The base with the largest frequency in each column is taken to generate a stacked array Corresponding consensus sequences, noted as ; S072 for stacked arrays Any one of the sequencing sequences The leftmost base is at Is marked as , The rightmost base is at Is marked as Will be Middle (f) From base to base Sub-sequence extraction between bases Performing local sequence comparison; s073 taking the part with the consistent local sequence alignment as To the direction of Is used to determine the mapping result of (a), All sequencing sequences in (a) The mapping result of (2) constitutes novel stacked array ; And (5) assembling the stacked array set after updating.
- 6. A method of genome architecture assembly, the method comprising: s11, extracting a second sequencing sequence set with a certain depth; S12, assembling the genome array according to the second sequencing sequence set by adopting the genome array assembly method of any one of claims 1-5 to obtain an assembled array set ; S13, assembling the second sequencing sequence set into an array stack Screening single mapped sequencing sequences and overlapping arrays on the mapping according to the comparison result, and screening the single mapped sequencing sequences and the overlapping arrays on the mapping according to the set conditions to obtain unique mapped sequencing sequence information and a screened consistent sequence set; S14, constructing an undirected weighted overlap graph based on the sequencing sequence information of the unique map and the screened consistent sequence set, and dividing the undirected weighted overlap graph into a plurality of connected components; s15, determining the direction of the array corresponding to each vertex on each connected component by adopting a combined optimization model; S16, establishing a regression model, estimating arrangement coordinates of the stacked arrays on a genome to be detected, sequencing all the stacked arrays in the same connected component according to the direction of the stacked arrays corresponding to each vertex and the arrangement coordinates to obtain a quasi-architecture of the connected component, and obtaining an architecture set according to the quasi-architectures of a plurality of connected components; and S17, carding the array stacking arrangement coordinates of each architecture in the architecture set, and eliminating contradiction situations of overlapping coordinates or redundant array stacking to obtain an assembled genome architecture.
- 7. The method of genome architecture assembly of claim 6 wherein the screening out single map sequencing sequence stacks according to alignment results comprises: s1301, for the double-end sequencing sequence, selecting a mapping result of mapping two ends of the double-end sequence onto the same stacked array consistent sequence from the comparison result to estimate the library length of the double-end sequencing fragment library, wherein the mapping result comprises starting point coordinates of the sequence of each end aligned onto the stacked array consistent sequence, which are respectively recorded as S 1 ,t 1 and S 2 ,t 2 , and the library length observation value is recorded as: Taking the predetermined number of observations from each segment library, taking the median M of the observations as the observation of the library length, and calculating the standard deviation ; S1302, mapping coordinates of the double-end sequencing sequences respectively mapped to two different stacked array consistent sequences are still respectively recorded as S 1 ,t 1 and S 2 ,t 2 ; S1303, judging whether the directions of the double-end sequencing sequence and the overlapping array consistent sequence mapped by the double-end sequencing sequence are consistent according to the mapping coordinates, judging whether the directions of two different overlapping array consistent sequences are consistent according to the direction judging results of the double-end sequencing sequence and the overlapping array consistent sequence mapped by the double-end sequencing sequence, calculating according to the direction judging results of the two different overlapping array consistent sequences and the library length to obtain the distance between the two overlapping array starting points, and screening out the connection that the position of one end of the double-end sequencing sequence mapped to the overlapping array exceeds the preset insertion length range to obtain the connection information from the sequencing sequence to the two overlapping array consistent sequences.
- 8. The genome architecture assembly method of claim 7, wherein the step S1303 comprises: Defining that the stacking direction on the left end sequence mapping of the double-end sequencing sequence is positive, and the positions are in front, and assuming that the stacking starting points on the left end mapping and the right end mapping are respectively , Then: a) If it is And is also provided with The two overlapping sequences are respectively consistent with the double-end sequencing sequences in the direction, namely, the two overlapping sequences are opposite in direction, and the distance is as follows: Sieving out Or alternatively Is connected with the connecting part of the connecting part; b) If it is And is also provided with The two overlapping sequences are opposite to the double-ended sequencing sequence, i.e. the two overlapping sequences are identical in direction, and the distance is: Sieving out Or alternatively Is connected with the connecting part of the connecting part; c) If it is And is also provided with The two overlapping sequences are respectively consistent with the double-end sequencing sequences in the direction, namely, the two overlapping sequences are opposite in direction, and the distance is as follows: Sieving out Or alternatively Is connected with the connecting part of the connecting part; d) If it is And is also provided with The two overlapping sequences are opposite to the double-ended sequencing sequence, i.e. the two overlapping sequences are identical in direction, and the distance is: Sieving out Or alternatively Is connected to the connection of (a).
- 9. The method of genome architecture assembly of claim 6 wherein the screening out single mapped sequencing sequences and mapped stacks thereof based on alignment results comprises: s1311, for single-molecule long-reading long-sequencing sequences, the coordinates of the alignment intervals aligned to a plurality of stacked consistent sequences on the sequencing sequences are assumed to be sequentially ordered according to the starting point The coordinates on the corresponding stacked array consistent sequence are sequentially as follows Obtaining the direction relation and the starting point distance of any two overlapping array consistent sequences which are compared to the same sequencing sequence according to the coordinates, and taking the direction relation and the starting point distance as the connection information of every two overlapping array consistent sequences; Wherein, if the array is consistent with the direction of the sequencing sequence, then: if the direction of the stacking array is opposite to that of the sequencing sequence, then: The distance is as follows: ; S1312, screening out the mutually overlapped connection on the same sequencing sequence.
- 10. The genome architecture assembly method according to claim 7 or 9, characterized in that the step S14 comprises: S1401, defining a threshold value parameter m of the number of connections forming an edge between stacked arrays; S1402, forming an edge by taking a consistent sequence set of the assembled stacked array set as a vertex and enabling the number of connections between single mapping sequencing sequence stacked arrays to be larger than a threshold parameter m, and constructing an undirected weighted stacked array diagram; s1403, respectively marking the same-direction connection number and different-direction connection number between the vertex i and the vertex j in the undirected weighted stacked array diagram as And Corresponding distance information is respectively input into an array stacking distance list in the same direction and in different directions; And S1404, dividing the undirected weighted array overlay into a plurality of connected components.
- 11. The genome architecture assembly method of claim 10, wherein the step S15 comprises: s1501 to And To weight each edge on connected components and mark the direction of the edge if Then weight , Otherwise, weight is given , Traversing all vertexes to obtain the initial direction of each vertex of the connected component; And S1502, iteratively optimizing the connection total number of contradictory directions on the current connected component, respectively calculating and changing single-point variation of an objective function caused by each vertex in each iteration, if the single-point variation of a certain vertex is negative, selecting the vertex with the largest ratio of the single-point variation to the total connection number to change the direction, and continuing the next iteration after updating the direction until the total contradiction number cannot be smaller by changing any vertex, stopping the iteration, and obtaining the direction of the stacked array corresponding to each vertex.
- 12. The genome architecture assembly method of claim 11, wherein the step S16 comprises: s1601, establishing a regression model; s1602, determining the type of each edge according to the connection information of the consistent sequence of the pairwise overlapping arrays and the direction of the overlapping array corresponding to each vertex; S1603, according to the type of each side, taking out the median of the stacking distance list corresponding to each side as the observation value of the regression model; S1604 sets a residual error allowance threshold of the weighted truncated least squares estimation ; S1605, taking the product of the inverse of the variance estimation of the library length and the number of connections conforming to the current vertex direction contained in the current edge as the initial weight, namely Weighted least square estimation for calculating start point coordinates of overlapped array consistent sequence Wherein the coefficient matrix Except for the first row (1, 0,.,. 0), the remaining rows represent the connecting edges between the corresponding vertices i and j, so item i is-1 and item j is 1; S1606, calculating all sample residuals according to the previous round of estimation, when the maximum value of the residuals is less than the allowable residual threshold Stopping iteration to obtain a feasible estimation Otherwise, selecting residual error in the sample to be smaller than Quantile samples, wherein, Representing the current iteration number; Judging whether the coefficient matrix corresponding to the selected sample is listed as full rank or not, if not, dividing the current subgraph into a plurality of connected components according to the coefficient matrix, restarting execution from step S1501 on each connected component of the subgraph, and if so, again carrying out weighted least square estimation on the selected sample subset to obtain Step S1606 is repeated until a viable estimate is obtained ; S1607, sorting all the overlapped arrays in the current connected components according to the corresponding feasible estimation according to the direction of the overlapped array corresponding to each vertex, and arranging the sorted overlapped arrays into the same coordinate system to obtain a quasi-architecture, and obtaining an architecture set according to the quasi-architectures of a plurality of connected components.
- 13. The genome architecture assembly method of claim 12, wherein the step S17 comprises: S1701, adjusting all the overlapping array consistent sequences in the framework according to the overlapping array direction corresponding to each vertex, and exchanging the starting point coordinates and the end point coordinates of the opposite overlapping arrays, so that all the overlapping array starting point coordinates in the framework arrangement are in front of the end point coordinates, and sequencing the overlapping arrays from the small to the large according to the starting point coordinates; S1702, starting from the first stacked array after sequencing, sequentially and backwards comparing local sequences of the stacked arrays contained in or intersected with the last stacked array currently selected in coordinates, and removing the contained stacked array or merging the intersected stacked array into the current stacked array if the comparison is successful; selecting an array which is close to the current array end point coordinate and has connection information, adding the array into the current architecture, and reserving the array which is close to the current array end point coordinate and has no connection information, 1) If the coordinates of the two overlapping consistent sequences i and j satisfy: And (2) and The two overlapped arrays are closely adjacent; 2) If the coordinates of the two overlapping consistent sequences i and j satisfy: If the i contains j; 3) If the coordinates of the two overlapping consistent sequences i and j satisfy: Two stacked arrays are crossed; Wherein, the Respectively representing the start point and end point coordinates of the stacked array consistent sequence in a quasi-architecture; s1703, updating the current array stack to be the array stack newly added into the current architecture, repeating the step S1702 until the last array stack on the architecture is reached, ending the current architecture, and outputting the final assembled architecture file; S1704, repeating steps S1702-S1703 in the reserved stacked arrays, determining the arrangement of the next architecture and outputting the arrangement to a final assembly architecture file, and repeating the steps until all the stacked arrays are arranged into one final architecture to obtain an assembly genome architecture, position and direction information of the stacked arrays in the architecture, estimation and estimation variance of gap lengths between adjacent stacked arrays, and a new stacked array consistent sequence which is obtained by merging through coordinate intersection and successful comparison.
- 14. The method of genome architecture assembly of claim 12 further comprising replacing the stacked consistent sequence set with a quasi-architecture set of the quasi-architecture, repeating steps S14-S16 to obtain an architecture set.
- 15. A method of genome sequence assembly, the method comprising: S21, extracting a third sequencing sequence set of a certain depth for generating an assembled stack according to the genome stack assembly method of any one of claims 1-5, the assembled stack being used for generating an assembled genome architecture according to the genome architecture assembly method of any one of claims 6-14; S22, extracting a fourth sequencing sequence set with another depth to update and improve the assembled genome architecture, and carrying out quality assessment on the improved assembled genome, if the quality reaches the standard, storing the improved assembled genome as a candidate assembled genome; s23, repeating the steps S21-S22 until a preset number of candidate assembled genomes are obtained, performing multiple sequence comparison on all the candidate assembled genomes, and connecting bases with the greatest frequency in each row of comparison to obtain a final assembled genome.
- 16. The method of genome sequence assembly according to claim 15, wherein the step S22 comprises: S2201 fourth sequencing sequence Mapping to an assembled genome, removing sequencing sequences mapped to multiple positions simultaneously therein to form a single mapped sequencing sequence stack, for each column in the stack if it is more than Covering the single mapping sequence, and updating the base at the corresponding position of the assembled genome by using the base with the greatest frequency in the sequence to obtain an improved assembled genome; S2202 fourth sequencing sequence Mapping to the improved assembled genome, evaluating the quality according to the mapping result, and storing the quality as a candidate assembled genome if the quality reaches the standard.
- 17. A genome array assembly system, characterized in that the system performs genome array assembly using the genome array assembly method of any one of claims 1-5.
- 18. A genome architecture assembly system, characterized in that the system performs genome architecture assembly using the genome architecture assembly method of any one of claims 6-14.
- 19. A genome sequence assembly system, characterized in that the system performs genome sequence assembly using the genome sequence assembly method of claim 15 or 16.
Description
Genome array, genome architecture, genome sequence assembling method and system Technical Field The invention relates to the technical field of biological information, in particular to a genome array assembly method, a genome architecture assembly method, a genome sequence assembly method, a genome array assembly system, a genome architecture assembly system and a genome sequence assembly system. Background Genomic sequencing is an important technique for developing molecular biology research. Researchers can obtain a genome base sequence of a living body through genome sequencing, and the genome base sequence is used as a genetic sequence template of the living body, so that important references are provided for research on the aspects of transcription, regulation, modification and the like of genes, and the explanation of molecular mechanisms behind life phenomena is facilitated. By comparing the genomes of the tested species or individuals, researchers can find the difference of the detected species or individuals at the genome level, thereby providing guidance for revealing the evolution history of the species, improving and cultivating crops, diagnosing genetic diseases, optimizing drug treatment and the like. The relatively widely used sequencing techniques are currently second and third generation sequencing techniques. The second generation sequencing technology has high base recognition quality and low cost, and the length (i.e. the number of base pairs) of the detected sequence is usually between 100 and 300 bp. An important feature of the method is that a long fragment can be sequenced from both ends to obtain the base sequences at both ends of the long fragment, namely the double-end sequencing sequence. Third generation sequencing techniques are capable of measuring a substantial increase in sequence length (typically between 10-100 kbp), but are accompanied by an increase in sequencing cost and base recognition error rate. Assembling sequenced sequences into a genome is a fundamental problem in the field of computing biology. Because the length of the sequences that can be measured by a sequencer is much smaller than the length of the genome, the sequences need to be assembled after sequencing to infer their relative positions and thus restore the genome being measured. One key challenge faced by genome assembly is that there are many highly similar, or repeatedly occurring, segments of the genome whose presence adds great uncertainty to the relative positions of the putative sequencing sequences. Existing genome assembly methods are mainly divided into two categories in principle. One type is a De Bruijn map-based method which mainly operates by, for each sequencing sequence, cutting every other base a subsequence of a specific length (commonly referred to as k-mer, k representing the length of the subsequence), constructing a De Bruijn map using all the cut subsequences, performing certain error correction operations, finding paths on the map, each path being inferred to be a fragment on the genome being tested. Another type of method is based on the overlapping relation of sequencing sequences, the method is used for comparing every two sequences, and then the overlapping relation of the sequences is deduced according to the comparison result, so that an assembled overlapping array set is obtained, wherein each overlapping array corresponds to a fragment on a genome to be tested. After a number of stacks (or segments) are obtained, they need to be oriented and arranged to be assembled into a framework. The general method is to compare the sequence with all the array-overlapping sequences, and determine the sequence, direction relation and distance range between different arrays by using the mapping of the double-end sequence and library length information or the coordinate information of the single-molecule long-reading long sequence mapped to a plurality of array-overlapping sequences, thereby obtaining the assembled genome architecture. The De Bruijn graph-based method has a better effect on genome assembly with low repetition, but is not ideal for genomes with high repetition, and because the length of the cut subsequence is obviously shorter than that of the sequencing sequence, the specificity is reduced, so that errors occur when searching paths on the graph. The sequencing sequence stacking relation-based method takes the sequencing sequence as a unit instead of k-mer, so that more position information on the genome is reserved, and the repeated sequence reduction is facilitated. However, the existing method is mainly based on greedy strategy, and the overlapping relation between sequences is determined by sequentially selecting the locally optimal alignment, so that the method is easy to be interfered by false positive alignment, and the incorrect overlapping relation is obtained, thereby causing the deletion of the copy number on the assembled genome or incorrect splicing. However,