-
Notifications
You must be signed in to change notification settings - Fork 6
Best practices
I will list here personal comments about AleRax that do not really fit in the other pages of this wiki
AleRax does not compute the gene tree distributions for you. You will have to use another tool for this. AleRax assumes that the gene tree distributions correspond to the posterior distribution under a given model of sequence evolution and for a given multiple sequence alignment. The best tools for this are Bayesian tree inference tools, such as MrBayes, PhyloBayes, etc.
A faster alternative is to generate ultra-fast bootstrap trees with IQTree-2, but i) it will most likely fail to sample many plausible trees (due to the nature of the local search used by IQTree) and ii) bootstrap trees are NOT a posterior distribution of trees. It can be used as a very rough approximation (one can expect that the most likely clades will appear more often than the least likely clades) but there is no guarantee that this approximation is good enough for your analysis.
For large analyses, you might run out of memory or the execution might take way too long. You can easily address the memory issue by using the option --memory-savings
, which will however cost a slight runtime overhead. To reduce runtime, you can:
- reduce the number of species/genomes (this is very unsatisfying but often greatly reduces the overall runtime)
- get rid of the families with too large ccps (there is a
ccpdims.txt
file that sorts the families per size) - parallelize with
mpiexec -np number_of_cores alerax [args]
, ideally on a cluster
A great feature of AleRax is that it can sample n
reconciled gene trees per family. If you recover a transfer (or duplication or loss) with only one sample out of 100, then this transfer is very unlikely to be the correct one. But be aware that the fact that a transfer is recovered by all samples does not necessarily mean that it really happened. Here is a non-exhaustive list of the potential causes for spurious results:
- your input species tree is wrong
- the gene families were not correctly clustered or some genes were not sampled
- DTL events are not the only source of gene tree - species tree incongruence (there might also be ILS, but AleRax does not model ILS)
- the gene tree distribution does not cover the "true" gene tree (for instance, you ran MrBayes but the chains didn't converge, the model was not good enough, or the sequences were not correctly aligned)
(this is not specific to AleRax! All other reconciliation tools have the same limitations, as far as I know)
AleRax is parallelized with MPI, which means that you can also run it on distributed-memory systems (e.g. clusters), by calling it with
mpiexec -np NUMBER_OF_CORES alerax [args]
The optimal number of cores is the minimum between:
- the number of cores that you have on your hardware (don't try to parallelize with 20 cores on your laptop, it will only slow down the execution)
- the maximum number of cores for which AleRax will be parallel efficient. This number depends on the number of gene families and on their sizes. When you run AleRax, it will estimate this number for you and output it in the logs (e.g.
Recommended maximum number of cores: 16
).
In practice, you can also run AleRax with 10 and 20 cores (I just picked those two numbers arbitrarily, replace them if your own estimation) and check that the run with 20 cores is ~2x faster than the one with 10 cores. If not, then 20 is too much and you're wasting cores.