-
-
Notifications
You must be signed in to change notification settings - Fork 65
PMSR model implementation #391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Implementation details:The implementation of the new Previously, site-specific frequencies were read by and stored in the Now the At the level of the |
Problems to solve:From both the theoretical and the algorithmic perspectives I do not see a reason why PMSF should be incompatible with GHOST. Yet in the current version it is: the In the current version the pairwise sequence ML distance seems not to account for rate mixtures under PMSF. There is some confusion between site-specific options in iqtree and alisim. There is confusing info in docs about using PMSF. We have a recommendation there:
This formulation is highly misleading, since whatever frequency was specified ( |
Thanks a lot for your contributions! I need more time to review, there is a lot of change. |
The model is still lacking proper checkpointing anyway. Somehow I forgot to add it. I hope to finish it today or tomorrow. |
@StefanFlaumberg Thanks a lot for your contributions. Because there are a lot of changes, I'd like to arrange a chat with you, which would help to review the pull request faster. Another member of the IQ-TREE core team will also attend. I will follow it up via email. So what's your email address? |
Hi @bqminh, Yes, arranging a chat would be great. I'd be happy to help with the code review. I've been going to make yet one more commit here introducing site rate normalization (the mean site rate not being equal 1.0 imposes a global rate on the model, therefore site rate and tree length rescaling is needed). So I suggest the proper review procedure be started only after the commit. I'll be ready with it by the end of the week. |
The pull request is now being transferred to the iqtree3 repo and will be closed after merging therein. All further changes will be added to the transferred version only. |
This PR introduces the following changes:
Generalized site-specific model description:
The methods used in the original site-specific frequency model are now also applicable to running analyses with site-specific rates or both site-specific frequencies and rates. Option descriptions are given under the
iqtree2 -h
command and below:Users can run analysis under the PMSF model just as before:
iqtree2 -s <alignment> -m LG+C40+G --mix-opt -ft <guide_tree>
iqtree2 -s <alignment> -m LG+G -fs <file.sitefreq>
New! Users can run analysis under the PMSR model:
iqtree2 -s <alignment> -m LG+FO+R6 -rt <guide_tree>
iqtree2 -s <alignment> -m LG+FO -rs <file.siterate>
NOTE:
The format of a site rate profile file
.siterate
is similar to that of a site frequency profile file.sitefreq
: there is no header, each line describes a single site, sites are given in the ascending order. Each line has the following format:site_num site_rate
NOTE:
The site-specific rate model is fully compatible with all kinds of state frequency modes (
+F
,+FQ
,+FU
,+FO
) except for state frequency mixtures (like+C20
).New! Users can run analysis under the combined PMSF+PMSR model:
iqtree2 -s <alignment> -m LG+C40+R6 --mix-opt -frt <guide_tree>
iqtree2 -s <alignment> -m LG -fs <file.sitefreq> -rs <file.siterate>
NOTE:
The
-frt
option (long alias--tree-freq-rate
) works in two steps:LG+C40+R6
) is fitted to the alignment and the guide tree, then the PMSF profile is computed under the fitted model (just as always);LG+SSF+R6
) is fitted to the alignment and the same guide tree, then the PMSR profile is computed under the fitted model.The following tree search analysis is run under the model constructed from the original model matrix (like
LG
orGTR
) and the computed PMSF and PMSR profiles (here underLG+SSF+SSR
).NOTE:
Both site-specific models and their combined usage are fully compatible with the optimizable Q matrices (like
GTR
andHKY
). However, some DNA Q matrices explicitly requiring equal state frequencies (likeJC
) are not allowed to be used with+SSF
models for an obvious reason.Rationale:
Following the findings of the original article on the PMSF approach (Wang et al. 2018) that PMSF models may surpass frequency mixture models in terms of tree reconstruction accuracy, I expect that PMSR models may similarly be more accurate compared to rate mixture models if provided with a guide tree of adequate quality. The expectation is also based on the common sense logic that an approximated site-specific rate is a more realistic assumption for modelling evolution of a site than weighted categories of rates, some of which would not fit the given site at all and which weights are the result of "averaging" over all the alignment sites (as in
+R
model) or are just taken equal (as in+G
model).When it comes to runtime, the PMSR approach, in contrast to the PMSF approach, is not going to save any time compared to the mixture models, but will do quite the opposite, as the guide tree itself is expected to be estimated under a rate mixture model to be accurate enough. So PMSR is all about quality. However, the combined PMSF+PMSR approach (the
-frt
option) is going to be faster even than the PMSF approach alone, as while both use a guide tree, the combined one runs the final tree estimation without using rate mixtures (+SSF+SSR
is clearly faster than+SSF+R4
).The model has been tested on various simple examples and against a
+SSF+SSR
model implemented in RaxML using per-site partitioning.See implementation details and some minor compatibility problems in the comments below.