Skip to content

marbl/TTT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTT stands for Trivial Tangle Traverser. This tool generates "not terrible" traversals through repetitive genomic tangles that somehow matches coverage and the read alignment.

For help run ./TTT.py --help

Requires python ≥ 3.7 and dataclasses, pulp, ahocorasick, networkx, statistics, and logging python libraries.

Slides explaining algorithmic details

UNDER CONSTRUCTION!

Usage

./TTT.py --gfa <gfa_file> --alignment <alignment_file> --output <output_directory> [options]
Will TTT help with this gap in my scaffold?

Generally there are three main reasons for gaps in a scaffold:

  • Lack of coverage

    TTT searches for the "best" path in the assembly graph that traverses the gap. If there's no path because of the coverage gap --- nothing can be done.

    gap

    Scaffold <utig4-1497[N100000N:scaffold]<utig4-340 --- nothing can be done

  • Long homozygous nodes

    Such gaps happen because of the read length being shorter than homozygous nodes. Typical structure looks like a sequence of "bubbles" of similar length, interlaced with long homozygous nodes. TTT can be run on such tangles. But usually if those structures left unresolved in the assembly graph (especially if homozygous nodes are longer than ~100kbp homopolymer-compressed) then there's just no information in the read alignments helping to traverse this region, and thus it will be essentially a random guess.

    diploid_simple_tangle

    Scaffolds <utig4-1225<utig4-1224[N5000N:ambig_bubble]>utig4-1511<utig4-1513 and <utig4-1226<utig4-1224[N5000N:ambig_bubble]>utig4-1511<utig4-1512. Because of long homozygous nodes utig4-1224 and utig4-1511 there's just no long reads connecting utig4-1228/utig4-1227 with utig4-1225/utig4-1226 or utig4-1512/utig4-1513

  • Complex repeats

    TTT was designed for such cases. However, there are still limitations --- there can be no more than 2 haplotypes in the tangle (so rDNA tangles connecting multiple chromosomes are usually unresolvable), and for two tangle cases you should provide pairs of in- and out- nodes for each of the haplotypes.

    haploid tangle

    Gap caused by repeat array

    diploid tangle

    Gap caused by large duplication of homozygous region, present inonly one of the haplotypes

Required Arguments:

  • --gfa: Path to the GFA file with the graph structure
  • --alignment: Path to a file with GraphAligner alignment

OR

  • --verkko-output - HiFi graph ,coverage (ONT) and ONT alignments from verkko would be used.

  • Tangle should be specified with either one internal node (--tangle-node utig4-267) or a file with complete list of internal tangle nodes one by line (--tangle-file nodes.list)

  • --output: Output directory for all result files (will be created if it doesn't exist)

Be sure that you use the same graph for all the files (gfa, alignment, coverage, tangle and border nodes) - HiFi graph (or --verkko-output) will not work with tangle nodes provided with respect to the final ONT resolved (utig4- in verkko case) graph.

Tangle traversal does no scaffolding. So, when running on tangles with two genomic paths you should provide incoming and outgoing boundary node pairs with --boundary-nodes boundary_file.tsv

File should be tab separated, in format:

incoming_hap1_node outgoing_hap1_node

incoming_hap2_node outgoing_hap2_node

Tangle_traverer does not support tangles with more than 2 traversing paths (i.e. most of the rDNA tangles)

Example:

./tangle_traverser.py --gfa assembly.gfa --alignment reads.gaf --output results_dir --tangle-node utig4-267 --quality-threshold 20

Verkko's final graph coverage fix

In verkko up to (and including )v2.2.1 coverage of the short nodes in tangles in final graph (assembly.homopolymer-compressed.gfa) is deeply flawed. To get the updated coverage file we suggest to run additional scripts

./verkko_coverage_fix/utig4_to_utig1.py <assembly_folder> > utig42utig1.gaf

./verkko_coverage_fix/utig4_coverage_updater.py utig42utig1.gaf <assembly_folder>/assembly.homopolymer-compressed.noseq.gfa <assembly_folder>/2-processGraph/unitig-unrolled-hifi-resolved.ont-coverage.csv > utig4_upt.ont-coverage.csv

and then pass utig4_upt.ont-coverage.csv as --coverage-file in main script.

Alternatively you can find how utig4- nodes match to the utig1- graph in utig42utig1.gaf and run tangle_traverser.py on the same tangle in hifi-only graph.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published