Skip to content

What is GToTree?

Mike Lee edited this page Mar 13, 2019 · 58 revisions

GToTree is a more structured implementation of a workflow I would put together everytime I wanted to make a large-scale phylogenomic tree. What do I mean by large-scale? Anything from a full-blown Tree of Life with all 3 domains, down to, for example, all available genomes of Staphylococcus. At its heart it just takes in genomes and outputs an alignment and phylogenomic tree based on the specified HMM profiles. But I think its value comes from: 1) its flexibility with regard to input format (taking fasta files, GenBank files, and/or NCBI accessions); 2) its automation of required between-tool tasks such as filtering hits by gene-length, filtering out genomes with too few hits to the target genes, and swapping genome labels for something more useful and/or appending identifiers of characteristics you care about; and 3) its scalability – GToTree can turn 1,700 input genomes into a tree in ~60 minutes on a standard laptop.

GToTree also comes packaged with 13 newly generated single-copy gene sets suitable for phylogenomic analysis of different major taxa.

The Bioinformatics publication is available here.


See the conda quickstart installation page to have things up and running in just a couple steps!


Overview

Presented below is an overview of the processing GToTree does. For practical ways GToTree can be helpful, check out the Example-usage page. And for detailed information on using GToTree, see the User guide.

Input files - any combination of fasta files, GenBank files, and/or NCBI assembly accessions

  • fasta files - will identify coding sequences (CDSs) with prodigal
  • GenBank files - will extract CDSs if they are annotated in the GenBank file, if not will identify them with prodigal
  • NCBI assembly accessions - downloads NCBI assembly summary files, builds ftp links to download the appropriate assembly, attempts to download just the amino acid (AA) sequences of CDSs if annotations exist for it, if not will download the assembly in fasta format and identify CDSs with prodigal – examples of generating this accessions file from both the NCBI website and from the command line are shown in the examples page

Identify target genes

  • GToTree then uses HMMER3 to search each genome for the target genes specified the provided HMM file
    • 14 of these are provided with the software, listed on the SCG-sets page

Estimate genome completeness/redundancy

  • using the information from the HMM search, reports estimates of % completeness and redundancy for each genome, also outputs a table of hits per target-gene per genome

Filter gene hits and genomes

  • filter out genes based on length - get the median of all genes in that set, filter out those whose length is not within a certain range of the median length (20% by default)
  • filter out genomes if they do not have hits to at least a certain fraction of the total genes searched (50% by default)

Add needed gap-sequences

  • adds the appropriate-sized gap-sequences for target genes that are missing from genomes being retained in the analysis

Align, trim, concatenate

  • align each gene set with Muscle
  • perform automated trimming with Trimal
  • concatenate all together into full alignment

Optionally add more informative headers - making things easily searchable in the resulting tree and alignment

  • this can be done in two ways (one or the other, or both together)
    • use TaxonKit for those genomes that have taxids associated with them (whether from NCBI accessions or found in the provided GenBank files) to add lineage information to the genome labels
    • a two- or three-column tab-delimited mapping file can be provided with either the NCBI accession or input file name in column 1 (depending on input source), and the desired genome label in column 2, and/or text to append to the label in column 3 (not all input genomes need to be provided)

Tree

Outputs

Primary outputs include:

  • the tree file and alignment file
  • a genome summary table mapping all modified labels to original genome IDs, estimates of completion/redundancy, and any available taxonomy information
  • a table showing number of hits per target-gene per genome
  • reports on what, if anything, was filtered out at which steps
Clone this wiki locally