-
Notifications
You must be signed in to change notification settings - Fork 3
Wrong tree inferred (clearly not parsimonious) despite being in short-branch-length regime #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
At first look, I think the problem might be that when CMAPLE tries SPR moves moving the 1b subclades AGT and GGC into the long branch above 1b (which should result in higher likelihood trees) it finds suboptimal pre-optimised trees since it first tries only SPR moves uppending the subtrees onto the middle of the long branch. These moves decrease the LK due to the length of this branch and due to the fact that the optimal appending point is at the bottom of the branch. For this reason CMAPLE then doesn't follow up with a more in-depth search of optimal branch lengths for the SPR moves, which would have resulted in higher likellihoods. Ultimately, I expect the issue is due to the length of this branch leading to 1b, it seems much longer than typical branches in SARS-CoV-2. Something we could do instead is to perform a deeper branch length search whenever we evaluate the score of an SPR move onto a long branch. As long as the number of long branches (>8mutations?) is small, then this would not impact the runtime and should improve accuracy. |
Thanks for the quick response @NicolaDM! Few thoughts on your response: This is mpox not SC2, the branch lengths are very short - this one has 70 mutations for a ~200k genome so only length 3*10^-4
In general cmample does better than iqtree for mpox, except for this one issue here. Branches of 70 mutations are quite common across pathogens where cmaple could be useful.
I'd be more than happy to accept the 5-fold slowdown if it improves the outcome, you could put this behind a search mode Also, I'm not sure about the 5-fold increase.
Sounds like a really good solution - in the case here of an all-clades mpox build, the branch in question is the 5th largest or so. You could add an extra option to tune the length above which SPR is tried on both ends. |
Why is the long branch not cut and reattached at the GGC node? That GGC branch is short so there shouldn't be that midpoint issue there. It's not sufficient to remove the GGC branch and attach on the long branch - that would result in a suboptimal tree topology. Maybe the optimal move is not tried? I could imagine this issue from arising if you only ever try to graft nodes that were cut off onto the subtree that still has the root. The single move that's required here requires the opposite: regrafting the subtree with the root onto the subtree without root. |
OK, I think we should have a discussion on the best way to address this and then come back to you. This could be approached either at the time of creating the initial tree (so placement of one sample at the time) or during SPR search. It would be best to solve it at both levels.
If I understand the question, the problem is that GGC descends from the long branch, and one cannot append a subtree onto one of its own branches.
I think that's right, but it seems to me the best tree can still be obtained with a combination of 2 successive SPR moves, each improving on the LK. So it seems to me that the optimal tree should be reachable from the current one.
Sorry, I don't think I understand this, but please let me know if my answer above doesn't address this! |
Thanks for the response! I'm curious how you think this could be fixed during stepwise addition as these inversions are likely to happen during greedy addition when there are relatively few sequences basally and more distally. You mentioned, "one cannot append a subtree onto one of its own branches." I think this might depend on how SPR moves are defined in the implementation. A general SPR move is symmetrical: Cut an edge, dividing the tree into two pieces (A and B). It sounds like the current implementation might only be doing "rooted SPR", where the piece without the original root is always the one reattached onto the piece containing the original root. The move I'm suggesting is the other way around: reattaching the piece with the original root onto the piece without it (specifically, onto the short GGC branch). This is a valid SPR move in general, but wouldn't be explored if the search is restricted to rooted SPR moves. Is it possible the SPR search is limited to only these "rooted" moves? |
I think the problem here is that AAT is being added to the tree before GGC. Then, when GGC is finally added, the branch length issue mentioned above means that CMPALE doesn't find the optimal placement of GGC. One could solve this issue similarly to what I mentioned above, by optimizing branch lengths during placement search when testing placement on long branches.
That's indeed the case. |
Uh oh!
There was an error while loading. Please reload this page.
I've noticed a consistent tree building failure when using cmaple for mpox trees.
The failure is pretty obvious but it'll take a few screenshots to show what I mean - an alternative is to run iqtree on the alignment and compare the result.
While in general tree builders might differ, in this case, I'm pretty convinced that the tree builder is wrong and this might point at a bug hiding somewhere in the code (either likelihood calculation or search heuristics). I'm using
EXHAUSTIVE
search already but this doesn't seem to help unfortunately. Maybe there needs to be aSUPEREXHAUSTIVE
mode? The builder runs for only 22 seconds, I'd be happy to wait extra time if it fixed this issue.I found this error as clade Ib has a clade that has 3 mutations reverting back to the root genotype and these are the only mutations, indicating that the clade should really be the basal part of clade Ib.
This is inset 1, showing clade Ib and the sibling clade Ia:
The bug is that the branch leading to Ib should not attach where it currently does, but at the green internal node (which shares genotype with the ancestor branch in the outset
The command used to run
cmaple
was like this:These are input and output files (remove the txt extension and untar): repro.tar.zst.txt
Mutations were inferred by treetime as cmaple doesn't output reconstructed states (yet).
If you struggle to understand my explanation of the error I'm happy to explain in further detail.
I suppose the ultimate proof would be to compare likelihoods of the wrong and correct trees. I'll see whether I can figure this out somehow.
The text was updated successfully, but these errors were encountered: