Flye and metaFlye: algorithmsa for long-read de novo assembly using repeat graphs
Mikhail Kolmogorov from the University of California, San Diego, and one of the team behind popular assembly tools Flye and metaFlye, began his talk by reiterating assemblies, particularly of relatively large genomes, are typically fragmented and incomplete due to the presence of repeat stretches that make precise determination of overlap difficult.
Mikhail presented a single, compelling example of the significance of algorithms when it comes to assembly performance. Recalling 2018, Mikhail displayed a digital karyotype representative of the first human genome assembly from nanopore data using ultra-long sequencing reads, where each coloured band represented a different section of contiguous sequence. Mikhail and team developed their assembly algorithm Flye some time later, and using the same dataset across different assembler versions, managed to achieve an impressive 3-fold improvement in contiguity – without making any changes to the original data.
Mikhail moved on to introduce assembly graphs, the data structures commonly used in assembly. Different algorithms build different graphs with different properties, but one of the most popular is a de Bruijn graph. In a de Bruijn graph built from short sequencing reads, for each short k-mer, there is a node on the graph, but if k-mers are sampled from a repetitive region they will correspond to the same node, collapsing repeats onto the same path. Moving from short to long sequencing reads circumvents some of this problem, but de Bruijn graphs rely on exact overlaps, meaning errors in long sequencing reads can also cause assembly failures and tangled graphs.
In order to overcome this in Flye, Mikhail explained, the team makes use of repeat graphs instead of de Bruijn graphs. Repeat graphs can be viewed as a generalisation of de Bruijn graphs, using approximate matches instead of exact matches to tolerate errors, and aiming to reveal genomic repeat structure identified by long reads.
The genome traverses this graph somehow, and reads represent sections of paths, so aligning reads rapidly to the graph produces a “walk” along it, termed by the team a “disjointig”. An individual disjointig may not necessarily correspond accurately to the genome, Mikhail explained, but a complete set of disjointigs can be used to assemble a repeat graph, as any misassemblies in the disjointigs will be collapsed in the final graph. The very last step in the process is simplification; looping repeat graphs can make the directionality of repetitive sections unclear, so long reads spanning junctions between units are mapped to the graph to establish the direction in which the genome traverses the repeats. Mikhail showed the simplification process in Klebsiella pneumoniae, with a knotty graph transforming into a straightforward one with just one or two loops for which there were no reads long enough to span the full repeat section.
But, Mikhail questioned, how can this be applied to human genomes? The goal in this setting would be to complete assemblies of both parental haplotypes. Using 120X nanopore data, of which 60X was ultra-long, the team binned reads from HG002 into haplotypes using TrioCanu, before assembling with Flye and polishing with Medaka. This produced very strong results, yielding an NG50 of 38 Mb for the paternal haplotype and 45 Mb for the maternal.
Mikhail noted, though, that quality is still very important alongside contiguity. By updating the basecalling software used through each iteration, Mikhail was able to improve the QV scores from an initial QV28/29 for paternal/maternal haplotypes, up to QV38/39. With the R10.3 pore, this improved even further to QV40/42 when estimated using Merqury. These results, Mikhail described, are very exciting, and still don’t represent the most recent in basecalling improvements, so further gains are easily expected. In addition, near perfect phasing accuracy can also be achieved.
Mikhail moved briefly over to the problem of metagenomic assembly. metaFlye produces very different and highly tangled assembly graphs. Nevertheless, when zooming into the paths present in these graphs, an interesting “bubbling” phenomenon can be observed, where the path splits into two before rejoining itself. Mikhail indicated that what they propose to be happening here is the graphing of significant heterogeneity within a mixed sample, and perhaps instead of incorrect assembly, true variation is being captured.
When benchmarking on simulated data, metaFlye showed better performance on a wide range of bacteria, particularly those that were more complex, and when applied to real data a single run generated lots of complete and near-complete bacteria with a high degree of novelty.
Overall, Flye and metaFlye make use of repeat graphs to generate high quality and highly contiguous assemblies, from single bacteria, to human haplotypes, to mixed metagenomic communities.
Видео Flye and metaFlye: algorithmsa for long-read de novo assembly using repeat graphs автора Магия кулинарии
Видео Flye and metaFlye: algorithmsa for long-read de novo assembly using repeat graphs автора Магия кулинарии
Информация
1 мая 2024 г. 15:41:31
00:22:16
Похожие видео