Skip to content

[Feature Request] Add site-specific rate profile support #244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
StefanFlaumberg opened this issue Jun 25, 2024 · 3 comments · May be fixed by #391
Open

[Feature Request] Add site-specific rate profile support #244

StefanFlaumberg opened this issue Jun 25, 2024 · 3 comments · May be fixed by #391
Labels
enhancement New feature or request

Comments

@StefanFlaumberg
Copy link

Dear IQ-Tree team,

Currently IQ-Tree2 implements inference of posterior mean site rate (PMSR) profiles, given an alignment and the corresponding tree with branch lengths. So I wonder, why cannot one use these inferred site-specific rates (after normalization by their mean) for further tree refinement, somewhat like it was done in the Mayrose et al. 2004 paper for branch length estimation?
It would be great to have a -rs option to pass a precomputed rate profile to for tree inference (analogous to the -fs option for PMSF profiles). Could you, please, implement such an option?

I understand that tree inference under PMSR profile model hasn't been extensively tested yet, but given the success of the PMSF profile approach, the similar usefulness of PMSR model is quite expected.

Best regards,
Stefan

@bqminh bqminh added the enhancement New feature or request label Jun 26, 2024
@bqminh
Copy link
Member

bqminh commented Jun 27, 2024

Hi Stefan, This is a very good suggestion, thanks for bringing it up. As you noted, the "PMSR" is already reported in .rate file if you run -wsr option. In principle, one can apply this PMSR to do tree inference in the same way as the PMSF, which is indeed something I thought about and discussed with Ed Susko a few years ago. Moreover, it'd be nice for users to test other ways of obtaining site-specific rates for tree inference. However, there are some caveats:

  • "LG+PMSR" may not save 4 times computation compared with LG+G4 vs LG because of the overhead in likelihood computation and keeping track of the rates compared with LG model. For example, you can see that (http://www.iqtree.org/doc/Complex-Models#site-specific-frequency-models) "LG+PMSF+G" is up to twice slower than "LG+G". So let's say we would expect PMSR to be twice slower than homogeneous model, meaning that it would only be twice faster than the LG+G4 model.
  • PMSF might have a potential bias due to the use of a guide tree, and one would like to repeatedly do PMSF and tree search (i.e., using best tree to obtain a new PMSF, which is then used to obtain a new best tree until no change in topology). This is acceptable in the context of using the +C60 model where the saving is huge, but might not be anymore in the context of +G4 model.
  • The complexity of implementation: while it's easy to talk about the model (mathematically), IQ-TREE literally has hundreds of different models (incl. partition and mixture models), and making a single new model work "peacefully" with existing models is not a small deal. And in the context of maximum likelihood, we normally have to think a lot about optimisation strategy.

The last point is actually the main point that holds me back from pursuing this idea, and we need to allocate developer to implement it, which in turn needs some funding or there is way of publishing it.

Whereas: there is already a way of doing this PMSR approach (not that efficiently though). I had some emails about this which I can dig out and reply later, if you wanna try it out.

@roblanf
Copy link
Collaborator

roblanf commented Jun 27, 2024

I would just throw in here also that PMSF is really just a shortcut way of doing very complex models like C60. Rate models tend to be quite a bit simpler than that for the most part (i.e. not 60 mixture classes). Still, the +R10 and greater models are tricky to optimise, so perhaps a PMSR model might be worth the time saving there.

I think the primary worry for me though is that you have to estimate the rate profile on a tree. I worry as Minh does that this could bias inference on exactly the nodes that matter (i.e. short branches which are hard to resolve). In fact this comment reminds me that we should get to work on testing exactly this potential bias in the PMSF models.

@StefanFlaumberg
Copy link
Author

StefanFlaumberg commented Jun 30, 2024

Dear Minh and Rob,
Thank you for your comprehensive replies!

You have written a lot about the possible impact the usage of the PMSR model may have on the running time. The model surely is not going to accelerate the tree inference in most of the usage cases (unless being compared against running with +R10, as you mentioned).

However, from my standpoint the main focus here should be on the tree reconstruction accuracy. Judging from the original article on the PMSF model, the model, given enough alignment data, doesn't produce biased results, when estimated on a reasonably optimal guide tree (like the +C20 tree). In fact, it was shown to produce more accurate results than the +C20 model. By analogy one could expect the same from the PMSR model, though a paper thoroughly testing PMSR and PMSF for biases, especially in application to single-protein tree reconstruction, would be much relevant.

It doesn't seem right to consider site-specific models to be just shortcuts of mixture models. In the logic of mixture approach, by which one sums the weighted log-likelihoods calculated over the whole alignment under different site-homogeneous models, every alignment site gets modeled by bad-fitting models in some of the categories, but the impact of such non-optimal modelling is mitigated by the weighting scheme. On the contrary, the site-specific approach, while possibly facing some risk of overfitting, models each alignment site by a model tailored to closely replicate the process governing the evolution of the site. The site-specific approach thus seems much more natural.

In line with the above, I suppose that using a joint site-specific frequency and rate profile model is the best thing one could do to fully model site-heterogeneity without extensive overfitting or time consumption. Such a PMSFR profile could be roughly obtained by inferring a PMSF profile on a +C20+R tree in the first step and then inferring a PMSR profile on the same tree under the +PMSF+R model in the second step.
I managed to use such a PMSFR profile for tree inference in the RAxML-NG tool by passing the profile as a per-site partition with AA frequencies and branch-length multipliers. However, due to a computational problem with per-site partitions, the approach turned out to be not very efficient (a 20-50 fold slow down compared to site-homogenous model, barring the estimation of profiles by IQ-Tree step).

Dear Minh, if, as you mentioned, there already exists a way to try the PMSR approach in IQ-Tree2 (or even to jointly use it with a PMSF profile), I am very interested to know about it. So I'm looking forward to seeing your reply. Thank you!

Best regards,
Stefan

@StefanFlaumberg StefanFlaumberg linked a pull request Jan 19, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants