Skip to content

Wrong tree inferred (clearly not parsimonious) despite being in short-branch-length regime #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
corneliusroemer opened this issue Mar 28, 2025 · 6 comments

Comments

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Mar 28, 2025

I've noticed a consistent tree building failure when using cmaple for mpox trees.

The failure is pretty obvious but it'll take a few screenshots to show what I mean - an alternative is to run iqtree on the alignment and compare the result.

While in general tree builders might differ, in this case, I'm pretty convinced that the tree builder is wrong and this might point at a bug hiding somewhere in the code (either likelihood calculation or search heuristics). I'm using EXHAUSTIVE search already but this doesn't seem to help unfortunately. Maybe there needs to be a SUPEREXHAUSTIVE mode? The builder runs for only 22 seconds, I'd be happy to wait extra time if it fixed this issue.

I found this error as clade Ib has a clade that has 3 mutations reverting back to the root genotype and these are the only mutations, indicating that the clade should really be the basal part of clade Ib.

This is inset 1, showing clade Ib and the sibling clade Ia:

Image

The bug is that the branch leading to Ib should not attach where it currently does, but at the green internal node (which shares genotype with the ancestor branch in the outset

Image

The command used to run cmaple was like this:

cmaple \
     -aln results/all-clades/masked.fasta \
     -st DNA \
     --search EXHAUSTIVE \
     --out-mul-tree \
     --make-consistent \
     -nt AUTO \
     --overwrite

These are input and output files (remove the txt extension and untar): repro.tar.zst.txt

Mutations were inferred by treetime as cmaple doesn't output reconstructed states (yet).

If you struggle to understand my explanation of the error I'm happy to explain in further detail.

I suppose the ultimate proof would be to compare likelihoods of the wrong and correct trees. I'll see whether I can figure this out somehow.

@NicolaDM
Copy link
Collaborator

At first look, I think the problem might be that when CMAPLE tries SPR moves moving the 1b subclades AGT and GGC into the long branch above 1b (which should result in higher likelihood trees) it finds suboptimal pre-optimised trees since it first tries only SPR moves uppending the subtrees onto the middle of the long branch. These moves decrease the LK due to the length of this branch and due to the fact that the optimal appending point is at the bottom of the branch. For this reason CMAPLE then doesn't follow up with a more in-depth search of optimal branch lengths for the SPR moves, which would have resulted in higher likellihoods.

Ultimately, I expect the issue is due to the length of this branch leading to 1b, it seems much longer than typical branches in SARS-CoV-2.
Performing a more extensive branch length search in MAPLE/CMAPLE could be possible, but if we do this for every proposed SPR move this would slow down the algorithm by something like 5-fold.

Something we could do instead is to perform a deeper branch length search whenever we evaluate the score of an SPR move onto a long branch. As long as the number of long branches (>8mutations?) is small, then this would not impact the runtime and should improve accuracy.

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Mar 28, 2025

Thanks for the quick response @NicolaDM!

Few thoughts on your response: This is mpox not SC2, the branch lengths are very short - this one has 70 mutations for a ~200k genome so only length 3*10^-4

Ultimately, I expect the issue is due to the length of this branch leading to 1b, it seems much longer than typical branches in SARS-CoV-2.

In general cmample does better than iqtree for mpox, except for this one issue here. Branches of 70 mutations are quite common across pathogens where cmaple could be useful.

Performing a more extensive branch length search in MAPLE/CMAPLE could be possible, but if we do this for every proposed SPR move this would slow down the algorithm by something like 5-fold.

I'd be more than happy to accept the 5-fold slowdown if it improves the outcome, you could put this behind a search mode EXTREME flag then people can opt in to it.

Also, I'm not sure about the 5-fold increase. Shouldn't the SPR search be the part that parallelizes well? In my case discussed here, CPU usage never goes above 100%, so the bottleneck seems to be things other than SPR search. In that case, increasing duration of that part 5x would not cause a noticeable change. (multiprocessing only applies to SH-aLRT apparently)

Something we could do instead is to perform a deeper branch length search whenever we evaluate the score of an SPR move onto a long branch. As long as the number of long branches (>8mutations?) is small, then this would not impact the runtime and should improve accuracy.

Sounds like a really good solution - in the case here of an all-clades mpox build, the branch in question is the 5th largest or so. You could add an extra option to tune the length above which SPR is tried on both ends.

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Mar 28, 2025

Why is the long branch not cut and reattached at the GGC node? That GGC branch is short so there shouldn't be that midpoint issue there.

It's not sufficient to remove the GGC branch and attach on the long branch - that would result in a suboptimal tree topology.

Maybe the optimal move is not tried? I could imagine this issue from arising if you only ever try to graft nodes that were cut off onto the subtree that still has the root. The single move that's required here requires the opposite: regrafting the subtree with the root onto the subtree without root.

@NicolaDM
Copy link
Collaborator

OK, I think we should have a discussion on the best way to address this and then come back to you.

This could be approached either at the time of creating the initial tree (so placement of one sample at the time) or during SPR search. It would be best to solve it at both levels.

Why is the long branch not cut and reattached at the GGC node? That GGC branch is short so there shouldn't be that midpoint issue there.

If I understand the question, the problem is that GGC descends from the long branch, and one cannot append a subtree onto one of its own branches.

It's not sufficient to remove the GGC branch and attach on the long branch - that would result in a suboptimal tree topology.

I think that's right, but it seems to me the best tree can still be obtained with a combination of 2 successive SPR moves, each improving on the LK. So it seems to me that the optimal tree should be reachable from the current one.
Even better, by implementing the aforementioned improvement at the level of sample placement (initial tree construction) then one is probably not going to fall into such a suboptimal tree in the first place.

Maybe the optimal move is not tried? I could imagine this issue from arising if you only ever try to graft nodes that were cut off onto the subtree that still has the root. The single move that's required here requires the opposite: regrafting the subtree with the root onto the subtree without root.

Sorry, I don't think I understand this, but please let me know if my answer above doesn't address this!

@corneliusroemer
Copy link
Contributor Author

Thanks for the response! I'm curious how you think this could be fixed during stepwise addition as these inversions are likely to happen during greedy addition when there are relatively few sequences basally and more distally.

You mentioned, "one cannot append a subtree onto one of its own branches." I think this might depend on how SPR moves are defined in the implementation.

A general SPR move is symmetrical: Cut an edge, dividing the tree into two pieces (A and B).
Reattach piece A onto an edge within piece B, OR reattach piece B onto an edge within piece A.

It sounds like the current implementation might only be doing "rooted SPR", where the piece without the original root is always the one reattached onto the piece containing the original root.

The move I'm suggesting is the other way around: reattaching the piece with the original root onto the piece without it (specifically, onto the short GGC branch). This is a valid SPR move in general, but wouldn't be explored if the search is restricted to rooted SPR moves.

Is it possible the SPR search is limited to only these "rooted" moves?

@NicolaDM
Copy link
Collaborator

I'm curious how you think this could be fixed during stepwise addition...

I think the problem here is that AAT is being added to the tree before GGC. Then, when GGC is finally added, the branch length issue mentioned above means that CMPALE doesn't find the optimal placement of GGC. One could solve this issue similarly to what I mentioned above, by optimizing branch lengths during placement search when testing placement on long branches.

It sounds like the current implementation might only be doing "rooted SPR"

That's indeed the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants