-
Notifications
You must be signed in to change notification settings - Fork 6
Running AleRax
AleRax takes as input, for each gene family, a distribution of gene trees (in practice, a list of trees), ideally computed with a Bayesian inference tool (MrBayes, PhyloBayes, RevBayes, etc.) from the corresponding gene multiple sequence alignment. If you provide a species tree, it must be rooted.
If you are a new user, I strongly recommend reading the tutorial with several standard examples.
To parallelize computations, please call AleRax
with mpiexec
. (mpiexec -np NUMBER_OF_CORES build/bin/alerax [arguments]
).
To understand how to map genes to species, please read this page
General commands:
Command | Comment |
---|---|
-h , --help
|
Print the help message |
-f , --families <value>
|
Family file with the paths to the gene tree distributions and mapping files. More information here |
-s , --species-tree {filepath, random, MiniNJ}
|
Starting species tree. If the species tree is given as a file path, the corresponding file should contain one rooted bifurcating tree in newick format. random and MiniNJ generate a starting tree for species tree inference. random generates a random rooted species tree. MiniNJ infers a starting species tree from the gene trees with our distance method MiniNJ (see SpeciesRax paper). ) |
-p , --prefix <value>
|
Output directory. This directory will be created. |
-g, --gene-tree-samples <value> |
Number of output reconciled gene tree samples per family. Default is 100 . |
--seed <value> |
Random seed. Default is 123. |
--memory-savings |
Set this parameter if you run out of memory. AleRax will save intermediate results to the disk instead of keeping it in RAM (this costs ~20% runtime overhead). |
Reconciliation model:
Command | Comment |
---|---|
-r , --rec-model {UndatedDL, UndatedDTL}
|
The probabilistic model used to compute the reconciliation likelihood. UndatedDL accounts for duplication and losses. UndatedDTL also accounts for horizontal gene transfers. Default is UndatedDTL. |
--model-parametrization <filepath> or <mode> |
Defines which model parameters are shared among species or families (per species rates, per family rates, etc.) More information here |
--transfer-constraint {NONE, PARENTS, RELDATED} |
Define horizontal gene transfer constraints for the receiving species. NONE allows transfers to any species. PARENT (default value) forbids transfers to parents species. RELDATED forbids transfers to the past and requires the species tree to be relatively dated (TODO document!) |
--prune-species-tree |
Activate pruned species tree mode. Recommended for species tree inference in presence of missing data that is not due to gene losses. |
--no-tl |
Disable TL events (only relevant for the UndatedDTL model ). Makes the computations ~4x faster, but TL (transfer-loss) events won't be considered. |
--speciation-probability-categories <value> |
Set the number of gamma categories for the speciation probabilities: each category will have a different multiplier for the speciation probabilities. Default is 1 . The runtime is linear to the number of categories. This is an experimental parameter, there is no evidence that it improves the inference accuracy. |
--gene-tree-rooting {UNIFORM, ROOTED} |
If set to UNIFORM (default value): gene trees are considered as unrooted, even if the input gene trees are rooted. If set to ROOTED the input gene trees must be rooted and AleRax will not consider the alternative root positions. |
--fraction-missing-file <filepath> |
File with per-species fraction of missing genes (TODO: document) |
--origination {UNIFORM, ROOT, LCA, OPTIMIZE} |
Indicates how to set the origination probabilities. UNIFORM (default value) sets the same origination probability for each species. ROOT only allows origination from the root of the species tree. LCA only allows origination from the lowest common ancestor of the species that are covered by the gene family. OPTIMIZE allows origination probabilities to be estimated, but only when used in combination with --model-parametrization . Please check this section
|
--rec-opt ALGO |
Set the algorithm used to optimized model parameters. By default, AleRax uses gradient descent. AleRax provides GRADIENT , SIMPLEX , GSL_SIMPLEX , and LBFGSB . GSL_SIMPLEX is only available if you have GSL installed. |
--per-family-rates |
DEPRECATED: see --model-parametrization
|
--per-species-rates |
DEPRECATED: see --model-parametrization
|
Search strategy:
Command | Comment |
---|---|
--species-tree-search {HYBRID, REROOT, EVAL, SKIP} |
Sets the species tree search strategy. HYBRID will run the standard search strategy (a mix of transfer-guided and local SPR moves). REROOT only re-estimate the root of the starting species tree.SKIP (default) skips the species tree search step. |
--infer-speciation-order |
Estimate the order of speciation events (relative dating) on the species tree. Only makes sense with the UnrootedDTL model and the RELDATED transfer constraint. This steps is run right after the species tree inference. |
Trimming:
Command | Comment |
---|---|
--min-covered-species <value> |
Filter out the families that cover strictly less than value species. Default is 4
|
--max-clade-split-ratio <value> |
Filter out families such that ccp_size / gene_tree_size < value1. The rational is that families with too many clades in their CCPs are not that informative and slow down computations. Default is -1.0` (no filtering) |
--trim-ratio <proportion> |
Filter out a proportion of the families with the highest CCP sizes. Default is 0.0 , max is 1.0 ). This mostly exists for experimental purposes. |
Transfer highways:
Command | Comment |
---|---|
--highways |
Enables transfer highway inference. |
--highway-candidate-file <filepath> |
A file with the list of highway candidates to test (see below). |
--highway-candidate-step1 <NUMBER> |
The number of highways to test in the first step of the heuristic (see below). |
--highway-candidate-step2 <NUMBER> |
The number of highways to test in the second step of the heuristic (see below). |
- The family file allows you to specify per-family input files.
- In a family file, everything after a
#
will be ignored. - The file should start with the tag
[FAMILIES]
. - A family block starts with
-
and the family name. A family block contains:- the path to the gene tree distribution (a file with one newick string per line)
- the path to the gene to species mapping file (see this page)
Please note that AleRax only supports taxon names that raxml-ng supports: in particular, taxon labels with spaces, tabs, newlines, commas, colons, semicolons and parentheses are invalid.
Example:
[FAMILIES] # this is a comment
- family_1
gene_tree = mrbayes_trees_1.newick
mapping = mapping_file_1.link
- family_2
gene_tree = mrbayes_trees_2.newick
mapping = mapping_file_2.link
- family_3
gene_tree = mrbayes_trees_3.newick
mapping = mapping_file_3.link
By default, AleRax assumes that all species and all families share the same model parameters (e.g. DTL probabilities). --model-parametrization
allows to customize this behavior. Note that the modes that allow different rates for different species are not compatible with species tree inference.
-
--model-parametrization GLOBAL
: each species has its own set of DTL probabilities, common to all families. -
--model-parametrization PER-SPECIES
: each species branch has its own set of model parameters, common to all families. -
--model-parametrization PER-FAMILY
: each family has its own set of model parameters, common to all species branches. -
--model-parametrization ORIGINATION-PER-SPECIES
: each species has its own set of origination probabilities, common to all families. The other model parameters (e.g. DTL probabilities) are common to all species and all families. -
--model-parametrization <parametrization_file>
: specify the set of species that share the same model parameters in the input file. The set of model parameters are common to all families (see above).
The parametrization file has the following syntax:
SPECIESNODE1 <list of parameter names>
SPECIESNODE2 <list of parameter names>
SPECIESNODE3 <list of parameter names>
etc.
The parameter names are D
, L
, T
, and O
for duplication, loss, transfer, and origination probabilities.
For instance, if the species tree is:
(((bacteria1,bacteria2),bacteria3)bacteria,((eukaryota1,eukaryota2),eukaryota3)eukaryota)root;
and the content of the parametrization file is:
bacteria T
eukaryota DL
Then:
- All nodes under eukaryota will share the same DL probabilities
- All nodes under bacteria will share the same T probabilities
- The root and all nodes under eukaryota will share the same T probabilities
- The root and all nodes under bacteria will share the same DL probabilities
- All nodes will share the same origination probabilities (uniform origination distribution)
You can use AleRax to test different species trees (or different species tree root positions) using statistical tests implemented in consel. Run AleRax on different species trees, for instance using:
alerax -f families.txt -s tree1.newick --species-tree-search SKIP -p runtree1
alerax -f families.txt -s tree2.newick --species-tree-search SKIP -p runtree2
alerax -f families.txt -s tree3.newick --species-tree-search SKIP -p runtree3
AleRax will produce three output directories, that can be analyzed with the python script that we provide in the github repository as follow:
python scripts/generate_consel_file.py consel.txt runtree1 runtree2 runtree3
The script will generate a file (consel.txt
) with the per-family likelihoods of each tree and indicate how to analyze it with consel (just follow the instructions from the logs). Consel must be installed.
Here is an example of the output:
# rank item obs au np | bp pp kh sh wkh wsh |
# 1 3 -47.9 0.913 0.871 | 0.866 1.000 0.873 0.954 0.873 0.954 |
# 2 2 47.9 0.127 0.120 | 0.124 2e-21 0.127 0.209 0.127 0.210 |
# 3 1 90.4 0.015 0.009 | 0.010 5e-40 0.013 0.024 0.013 0.024 |
The item column indicates the position of each run. In this example, item 3 corresponds to runtree3 and to the first line (rank 1 item 3). Please check the consel documentation for a description of the statistical tests and how to interpret them.
If you stop AleRax (or if you cluster does), you can restart it by re-running the exact same command. AleRax will restart from the last checkpoint. We currently save checkpoints:
- each time a better species tree is found
- several times during the parameter optimization step
- after each main step of the pipeline (species tree optimization, model parameter optimization, speciation order estimation, highway inference, etc.)
Checkpoints were introduced very recently. If you encounter any issue with it, please send us a report of the problem!
By default, origination probabilities (the probability that a gene originated at a given species branch) are uniform among the species tree branches. Estimating those probabilities requires two arguments:
-
--origination OPTIMIZE
: adds the origination probabilities to the set of parameters to estimate (but by default, model parameters are global to all species, so this is not enough to actually estimate them). -
--model-parametrization
with eitherPER-SPECIES
,ORIGINATION-PER-SPECIES
, or a parametrization file, in which you can use the parameterO
.
The origination probability is always the last probability (the last column) in the output model parameter file.
A transfer highway is a pair of species with a high transfer probability. By default, AleRax assumes that all pairs of species have the same transfer probability. Highways relax this constraint by allowing a few pairs (the highway candidates) to have their transfer probability estimated with maximum likelihood.
To infer transfer highways, AleRax implements a heuristic that runs the following step:
- Step 1: Selection of the initial candidates. By default, AleRax reconciles the gene trees with the species tree (without highway) and assign a score to each pair of species, based on their ancestral genome size and on the number of genes transferred from the first to the second. Then, it only keeps the
n
pairs with the highest score for the next step.n
can be set with--highway-candidate-step1 <number>
. Alternatively, a highway candidate file can be provided with a list of candidates to tests (see above). - Step 2: Pre-filtering of the candidates: for each highway candidate, AleRax adds the highway to the model with a low highway probability and computes the new likelihood. If the likelihood increases, then the highway is kept for the next step. The candidates that pass the test are sorted by likelihood, and AleRax only keeps the
m
best candidates for the next step.m
can be set with--highway-candidate-step2 <number>
. - Step 3: All highway candidates are simultaneously added to the model, and AleRax estimates their highway probabilities using gradient descent (or any other optimization algorithm used for model parameter optimization).
In step 1, the higwhay candidate file has the following syntax, where each entry is a node label in the input species tree:
from_1, to_1
from_2, to_2
from_3, to_3
etc.
Users can also replace a label with the wildcard *
, for instance:
Bacteria, *
*, Eukaryota
will test the highways from from the branch Bacteria
to all other species, and from all species to the branch Eukaryota
. Time-inconsistent transfers are skipped.