Replies: 1 comment
-
I suspect this might be a really bad idea! I only suspect that because we did something a little bit similar to this many years ago in this paper: https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-015-0283-7 In that paper we used k-means to partition data. It looked like it worked well in all of our simulations (and it did work well!). However, later a paper came out showing some issues: http://www.sciencedirect.com/science/article/pii/S1055790316302780 And this (and concerns raised independently by Brian Moore and colleauges) led us to disable the method for nucleotide data in PartitionFinder: brettc/partitionfinder@19d7fe4 The commit has details of the issue. But the core of it is that k means (and GHOST, I suspect...) tends to group invariant sites together. This may not be much of a problem in a mixture model like GHOST, because the likelihood of every site is calculated under every model, so in that sense it's a bit similar to how we already use gamma rates and free-rate models. However, if you convert the top GHOST class into a partition, you lose that nuance, and now you are almost certainly grouping invariant sites into a single partition, and you'll get the same issues we saw with k-means. I'd be very interested to see what happens if you try it! If you do try it, be sure to simulate a lot of datasets under a LOT of rate distribution scenarios. Note that Brian Moore et al identified the issue with k-means on small trees (4-taxon trees, from memory). We missed it in our initial analyses because we used much larger trees, and on larger trees the method seems to work well (even when there are invariant sites). To be honest, despite a lot of effort we never got to the bottom of it, and could only conclude that k-means in the form that was in partitionfinder could sometimes be misleading. Since we couldn't say for sure what kinds of dataset it would be good or bad to use it on, it was better to just get rid of it! |
Beta Was this translation helpful? Give feedback.
-
Hi,
Could we use GHOST models to inform a partition model where each site in an MSA falls under a certain partition based its likelihood of being part of a GHOST class? (either based on a user-defined threshold or most likely class. Not sure if all sites are equally likely). The user can then allow for descriptions of within-partition substitution rate heterogenity within the partitions that mirror the GHOST classes. This way we can benefit from the unbiased binning of sites into classes by GHOST while allowing for description of rate heterogeneity within the partitions informed by GHOST (also reducing degrees of freedom based if not branch-unlinked).
Beta Was this translation helpful? Give feedback.
All reactions