Skip to content

PMSR model implementation #391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

StefanFlaumberg
Copy link

This PR introduces the following changes:

Generalized site-specific model description:

The methods used in the original site-specific frequency model are now also applicable to running analyses with site-specific rates or both site-specific frequencies and rates. Option descriptions are given under the iqtree2 -h command and below:

Users can run analysis under the PMSF model just as before:

  • starting with a guide tree and a frequency mixture model:
    iqtree2 -s <alignment> -m LG+C40+G --mix-opt -ft <guide_tree>
  • starting with a site frequency profile:
    iqtree2 -s <alignment> -m LG+G -fs <file.sitefreq>

New! Users can run analysis under the PMSR model:

  • starting with a guide tree and a rate mixture model:
    iqtree2 -s <alignment> -m LG+FO+R6 -rt <guide_tree>
  • starting with a site rate profile:
    iqtree2 -s <alignment> -m LG+FO -rs <file.siterate>

NOTE:
The format of a site rate profile file .siterate is similar to that of a site frequency profile file .sitefreq: there is no header, each line describes a single site, sites are given in the ascending order. Each line has the following format:
site_num site_rate

NOTE:
The site-specific rate model is fully compatible with all kinds of state frequency modes (+F, +FQ, +FU, +FO) except for state frequency mixtures (like +C20).


New! Users can run analysis under the combined PMSF+PMSR model:

  • starting with a guide tree and a frequency-and-rate mixture model:
    iqtree2 -s <alignment> -m LG+C40+R6 --mix-opt -frt <guide_tree>
  • starting with a site frequency profile and a site rate profile:
    iqtree2 -s <alignment> -m LG -fs <file.sitefreq> -rs <file.siterate>

NOTE:
The -frt option (long alias --tree-freq-rate) works in two steps:

  1. the provided frequency-and-rate mixture model (say LG+C40+R6) is fitted to the alignment and the guide tree, then the PMSF profile is computed under the fitted model (just as always);
  2. the frequency mixture is replaced with the computed PMSF profile, the resultant rate-mixture model (now LG+SSF+R6) is fitted to the alignment and the same guide tree, then the PMSR profile is computed under the fitted model.

The following tree search analysis is run under the model constructed from the original model matrix (like LG or GTR) and the computed PMSF and PMSR profiles (here under LG+SSF+SSR).

NOTE:
Both site-specific models and their combined usage are fully compatible with the optimizable Q matrices (like GTR and HKY). However, some DNA Q matrices explicitly requiring equal state frequencies (like JC) are not allowed to be used with +SSF models for an obvious reason.


Rationale:

Following the findings of the original article on the PMSF approach (Wang et al. 2018) that PMSF models may surpass frequency mixture models in terms of tree reconstruction accuracy, I expect that PMSR models may similarly be more accurate compared to rate mixture models if provided with a guide tree of adequate quality. The expectation is also based on the common sense logic that an approximated site-specific rate is a more realistic assumption for modelling evolution of a site than weighted categories of rates, some of which would not fit the given site at all and which weights are the result of "averaging" over all the alignment sites (as in +R model) or are just taken equal (as in +G model).
When it comes to runtime, the PMSR approach, in contrast to the PMSF approach, is not going to save any time compared to the mixture models, but will do quite the opposite, as the guide tree itself is expected to be estimated under a rate mixture model to be accurate enough. So PMSR is all about quality. However, the combined PMSF+PMSR approach (the -frt option) is going to be faster even than the PMSF approach alone, as while both use a guide tree, the combined one runs the final tree estimation without using rate mixtures (+SSF+SSR is clearly faster than +SSF+R4).


The model has been tested on various simple examples and against a +SSF+SSR model implemented in RaxML using per-site partitioning.

See implementation details and some minor compatibility problems in the comments below.

@StefanFlaumberg
Copy link
Author

Implementation details:

The implementation of the new +SSR and +SSF+SSR features is based mostly on rewriting the Alignment and the ModelSet classes.

Previously, site-specific frequencies were read by and stored in the Alignment class as if to be compatible with both the proper site-specific models and some kind of simplistic partition models (e.g. the see purpose of the site_model member and the ability to read .sitefreq files with multiple site indices per line). However, such dual compatibility makes reading site-specific files hard to code for and is anyways not supported at the level of the phylokernelnew.h lib (which expects the ModelSet object to have a model per each pattern, not per each partition of sites with same state frequencies). Furthermore, it is never actually used for partition models.

Now the Alignment class has only the 3 following members associated with site-specific models: pattern_first_site, ptn_state_freq and ptn_rate_scaler. The latter two store per-pattern model parameters that can be read by the readSiteParamFile function from the .sitefreq and .siterate files, which should be formatted to describe one site per line, site numbers following in the ascending order with a possibility of missing sites. Both the ptn_state_freq and the ptn_rate_scaler vectors can get rearranged by the regroupSitePattern function on reading another site-specific parameter file.

At the level of the ModelSet class, site-specific rates are implemented in the decomposeRateMatrix function by multiplying eigenvalues by the respective components of the phylo_tree->aln->ptn_rate_scaler vector. To be able to proper use PMSR with homogeneous frequencies (like in +FO+SSR) and optimizable Q matrices (like GTR) the interface of the ModelSet class operates through the front submodel of the modelset: it is the front submodel's parameters that are optimized, but the optimized parameters also get copied to all the other submodels. The wrapper MarkovModel of the ModelSet class is always provided with the background state frequencies (+F) to be used for the +I proportion calculation under the PMSF model.

@StefanFlaumberg
Copy link
Author

StefanFlaumberg commented Jan 19, 2025

Problems to solve:

From both the theoretical and the algorithmic perspectives I do not see a reason why PMSF should be incompatible with GHOST. Yet in the current version it is: the computeLikelihoodDervMixlen function of the phylokernelnew.h lib seems to be not finished for the SITE_MODEL case and the computeLikelihoodDervMixlenPointer of the PhyloTreeMixlen class is empty.
I've explicitly banned concomitant usage of various incompatible models with site-specific models (like *G with PMSF), but not the usage of +Hn with PMSF in hope for this drawback to be eliminated in the future.


In the current version the pairwise sequence ML distance seems not to account for rate mixtures under PMSF.
In both the AlignmentPairwise::computeFunction and the AlignmentPairwise::computeFuncDerv functions (both defined in alignment/alignmentpairwise.cpp) there is a separate code block for site-specific models which, in contrast to the general case instructions, ignores rate mixture models. While this may not be too important, it is still a drawback and can be readily fixed.


There is some confusion between site-specific options in iqtree and alisim.
According to the current docs the --site-freq option is equivalent to the -fs option in normal iqtree, but it also has a different meaning for alisim. Actually it worked in the alisim context only.
In the current PR I've put it the following way: if specified after the --alisim option, both --site-freq and --site-rate are interpreted in the alisim context, otherwise --site-freq == -fs and --site-rate == -rs.
Yet it is still confusing. Instead I recommend considering introducing options like --site-freq-type and --site-rate-type to replace --site-freq and --site-rate in the alisim context.


There is confusing info in docs about using PMSF. We have a recommendation there:

This will stop the analysis after the first phase and also write a .sitefreq file. You can now copy this .sitefreq file to another low-memory machine and run with the same alignment:
iqtree -s <alignment> -m LG+C20+F+G -fs <file.sitefreq> -b 100

This formulation is highly misleading, since whatever frequency was specified (+C20 or +C60 or +FO), it would be overridden by the specified site frequency profile and would not have any impact on the analysis.
As stated earlier here, I've explicitly banned concomitant usage of various incompatible models with site-specific models, as it is always better to get an error message, than to have a program running despite a wrongly specified model and unrealistic expectation of how it should work. Frequency mixture models are incompatible with PMSF for an obvious reason. Thus, typing -m LG+C20+F+G -fs <file.sitefreq> is now prohibited and should be changed to -m LG+G -fs <file.sitefreq>.

@bqminh
Copy link
Member

bqminh commented Feb 7, 2025

Thanks a lot for your contributions! I need more time to review, there is a lot of change.

@StefanFlaumberg
Copy link
Author

The model is still lacking proper checkpointing anyway. Somehow I forgot to add it. I hope to finish it today or tomorrow.

@bqminh
Copy link
Member

bqminh commented Mar 19, 2025

@StefanFlaumberg Thanks a lot for your contributions. Because there are a lot of changes, I'd like to arrange a chat with you, which would help to review the pull request faster. Another member of the IQ-TREE core team will also attend. I will follow it up via email. So what's your email address?

@StefanFlaumberg
Copy link
Author

Hi @bqminh,

Yes, arranging a chat would be great. I'd be happy to help with the code review.
My email: [email protected]

I've been going to make yet one more commit here introducing site rate normalization (the mean site rate not being equal 1.0 imposes a global rate on the model, therefore site rate and tree length rescaling is needed). So I suggest the proper review procedure be started only after the commit. I'll be ready with it by the end of the week.

@StefanFlaumberg
Copy link
Author

The pull request is now being transferred to the iqtree3 repo and will be closed after merging therein. All further changes will be added to the transferred version only.

@StefanFlaumberg StefanFlaumberg marked this pull request as draft May 7, 2025 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add site-specific rate profile support
2 participants