Skip to content

Latest commit

 

History

History
executable file
·
114 lines (93 loc) · 9.04 KB

tutorial_5.md

File metadata and controls

executable file
·
114 lines (93 loc) · 9.04 KB

replicate analyses & species-tree VS gene-tree


In the first half of this tutorial we will leverage replicate analyses: codeml models with a high number of parameters may have issues with local optima and it's good practice to run analyses multiple times with varying starting values in order to obtain the global optimum. Through the --replicates flag, BASE allows to carry out a user-specified number of replicate analyses incorporating random - from 0.0 to 3.2 - omega starting values. The replicate analysis which has the best likelihood values will be then selected, substantially strengthening our confidence in the result.

Let's run an analysis using as an alternative mode the parameter-rich model1, wich allows each branch of the phylogeny to have its own omega. Please note that - despite beeing frequently used - this "free ratio" model is discouraged by PAML authors themselves; nonetheless in this context it provides a good example of a parameter-rich model.

sh ../BASE.sh --analyze --input _ubiquitous_genes/ --output _ubiquitous_genes_0VS1_10rep 
--s_tree spp_tree.nwk --model_g m0.ctl --model_a m1.ctl --cores 4 -v --replicates 10

If we take a look to the likelihood_summary.txt we can appreciate that for some genes different replicates have generated different likelihood values.

gene    branch_model_g  site_model_g  model_g_np  model_g_LnL    rep_g  branch_model_a  site_model_a  model_a_np  model_a_LnL    rep_a  nRF   LRT      df  p.value  significance
OG3105  0               0             2           -5241.508221   1      1               0             8           -5234.516739   1      0     13.983   6   0.0298   n/s
OG3126  0               0             2           -4132.017209   1      1               0             12          -4116.792932   1      0     30.4486  10  7e-04    **
OG3158  0               0             2           -3818.030135   1      1               0             12          -3801.882145   1      0.25  32.296   10  4e-04    **
OG3164  0               0             2           -5374.980383   1      1               0             12          -5358.727936   1      0     32.5049  10  3e-04    **
OG3196  0               0             2           -3365.98055    1      1               0             12          -3355.771198   2      0     20.4187  10  0.0255   n/s
OG3197  0               0             2           -3680.783168   1      1               0             12          -3671.481745   1      0     18.6028  10  0.0456   n/s
OG3302  0               0             2           -9054.676043   1      1               0             12          -9028.373456   1      0     52.6052  10  0        ***
OG3342  0               0             2           -3444.803353   1      1               0             12          -3431.378852   4      0     26.849   10  0.0028   *
OG3347  0               0             2           -9597.616313   1      1               0             12          -9586.571177   10     0     22.0903  10  0.0147   n/s
OG3359  0               0             2           -3147.226951   1      1               0             12          -3126.908123   2      0     40.6377  10  0        ***
OG3362  0               0             2           -5722.861684   1      1               0             12          -5706.727937   2      0     32.2675  10  4e-04    **
OG3372  0               0             2           -3710.675722   1      1               0             12          -3695.631697   4      0     30.0881  10  8e-04    **
OG3387  0               0             2           -4067.125785   1      1               0             12          -4057.247711   6      0     19.7561  10  0.0316   n/s
OG3395  0               0             2           -3901.33739    1      1               0             12          -3841.403883   1      0     119.867  10  0        ***
OG3399  0               0             2           -3003.011552   1      1               0             12          -2990.356713   4      0     25.3097  10  0.0048   *
OG3600  0               0             2           -2867.120583   1      1               0             12          -2844.358272   1      0     45.5246  10  0        ***
OG3622  0               0             2           -3407.557493   1      1               0             12          -3388.589881   1      0     37.9352  10  0        ***
OG3640  0               0             2           -3342.883393   1      1               0             12          -3317.158542   4      0     51.4497  10  0        ***
OG3648  0               0             2           -10491.352418  1      1               0             12          -10484.242083  7      0     14.2207  10  0.1632   n/s
OG3682  0               0             2           -3703.868866   1      1               0             12          -3690.236956   3      0     27.2638  10  0.0024   *
OG3683  0               0             2           -2791.481942   1      1               0             12          -2774.797107   9      0     33.3697  10  2e-04    **

The rep_g and rep_a columns report which replicate analysis had the "best" likelihood: in the general model all replicates have the same likelihood - replicate 1 is selected when they are all equal - while in the alternative model different analyses have been chosen for some genes.

While replicates can be unimportant in "simple" analyses, they can play a key role when there are a lot of parameters (omega classes and / or species).


In the second half of this tutorial we will explore how to check for potential biases in omega estimation associated to artefactual substitution rate variation. This phenomenon can occur for example when substitutions that occur on discordant gene trees are analyzed in the context of a fixed species tree. For a primer on the topic you can check this papers.

In the Likelihood summary file which is generated - such as the one generated in the first half of this tutorial - you can spot a "nRF" column which includes normalized Robinson Fould distance between the species-tree and the gene-tree topologies. The user can then decide to exclude analyses where there is a strong conflict between gene-tree and species-tree, as this can lead to biased parameter estimation when they are inferred on a fixed species-tree.

In this instance, BASE allows to carry out analyses on the gene-tree topology inferred for each gene - using the --g_tree flag - such as in:

sh ../BASE.sh --analyze --input _non-ubiquitous_genes/ --output _ubiquitous_genes_0VS1_genetree 
--model_g m0.ctl --model_a m1.ctl --cores 4 -v --g_tree

Note that when using the --g_tree flag no species-tree needs to be specified. The subsequent extract step to retrieve specific branch(es) or clade(s) metrics can be still carried out: terminal branches will be allways present despite possible differences between gene-tree and the species-tree but internal branches / clades may not be present ( i.e. be paraphyletic or polyphyletic) in the gene tree. BASE will check wether the clade/branch of interest can be still identified in the gene tree topology and - only if present - report the associated metrics. For example we can run the extract step with this label file to extract the metrics of the branch leading to a species pair:

sh ../BASE.sh --extract --input _ubiquitous_genes_0VS1_genetree --labels smaller_clade --min_spp x

Then we can check the _non-ubiquitous_genes_0VS1_genetree/extract.smaller_clade.min.spp.2.dNdS.summary file.

branch/clade 	 gene   	 spp_n 	 dNdS 	 t 	 dN 	 dS
smaller_clade 	 OG3105 	 2 	 0.0724 	 0.270 	 0.0219 	 0.3029
smaller_clade 	 OG3126 	 2 	 0.0243 	 0.107 	 0.0035 	 0.1451
smaller_clade 	 OG3158 	 non-monophyletic
smaller_clade 	 OG3164 	 2 	 0.0955 	 0.112 	 0.0134 	 0.1405
smaller_clade 	 OG3196 	 2 	 0.1364 	 0.722 	 0.0931 	 0.6826
smaller_clade 	 OG3197 	 2 	 0.1076 	 0.290 	 0.0312 	 0.2902
smaller_clade 	 OG3302 	 2 	 0.0973 	 0.209 	 0.0233 	 0.2399
smaller_clade 	 OG3342 	 2 	 0.0298 	 0.222 	 0.0091 	 0.3045
smaller_clade 	 OG3347 	 2 	 0.0697 	 0.227 	 0.0196 	 0.2806
smaller_clade 	 OG3359 	 2 	 0.0129 	 0.149 	 0.0025 	 0.1922
smaller_clade 	 OG3362 	 2 	 0.0899 	 0.189 	 0.0184 	 0.2046
smaller_clade 	 OG3372 	 2 	 0.0325 	 0.143 	 0.0066 	 0.2019
smaller_clade 	 OG3387 	 2 	 0.1054 	 0.125 	 0.0127 	 0.1202
smaller_clade 	 OG3395 	 2 	 0.0054 	 0.109 	 0.0008 	 0.1554
smaller_clade 	 OG3399 	 2 	 0.0772 	 0.136 	 0.0121 	 0.1573
smaller_clade 	 OG3600 	 2 	 0.0368 	 0.108 	 0.0052 	 0.1424
smaller_clade 	 OG3622 	 2 	 0.0440 	 0.207 	 0.0117 	 0.2652
smaller_clade 	 OG3640 	 2 	 0.0447 	 0.112 	 0.0064 	 0.1429
smaller_clade 	 OG3648 	 2 	 0.1708 	 0.197 	 0.0304 	 0.1778
smaller_clade 	 OG3682 	 2 	 0.1103 	 0.135 	 0.0156 	 0.1418
smaller_clade 	 OG3683 	 2 	 0.0608 	 0.073 	 0.0055 	 0.0899

As we can see - for some genes - the target clade is non monophyletic in the gene tree and as such no metric will be reported in the output.


Back to the tutorials list.