-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All clusters merged into a single one when trying to assign new strains #194
Comments
Hi Florent, This can often be caused by a low quality sample (or more than one) which our QC for |
Hi John, |
Hi Florent, Thanks for the detailed investigation! That's really helpful, and I think I have some ideas about how to better deal with this in future. I'm going to add this on my todo list for v2.5.0, so I hope that in the next release this won't be an issue. In more detail, I think there are two possible ways this can happen, which may happen separately or together:
For 1) we need to fix the distance QC as you say, check for cluster linking, and too many zero distances. For 2) it is a little more difficult, but I am expecting that similar measures will be able to prune these isolates from the network (but still allowing assignments) without needing a re-fit. |
thanks John for that rapid response. I agree with your analysis suggesting a combo of 2 problems, including something to do with the boundary. This echoes some of the findings and questions that I and others in my team (Avril, Astrid) have been gathering by experimenting with PopPUNK to investigate this issue (as we all consistently run into it with different and diverse datasets). As a short summary of our experimentations, I can say that:
It seems that the most likely fix for this clumping issue would be to make the distance QC work properly in We were also thinking : a convenience fix would be that we do not update the model when trying to assign new strains i.e. keep the boundary in the 2-D distance space where it is, and only assign strains to existing or potentially new strain clusters based on their distance profile. This approach would have the benefit of allowing independent users to classify strains consistently relative to a reference database that may be available publicly, without these genomes having to be included in the updated reference db. Do you think it would be possible to implement this option for |
a note on the test presented in #194 (comment) : when running With that concern in mind, I re-ran the tests but this time bringing a fresh copy of the I can confirm the results of the test still hold, with the same anomalous genomes causing the same mischief. However, I noticed results do change slightly! For instance the number of clusters that clump due to the most anomalous genome #1051 changed from 86 to 84, and one other merger occurred on a query genome batch that did not beforehand. I don’t know how much this is to be ascribed to possible stochastic variables that could make each run unique, or to the difference of using a pristine vs. already queried database. In any case, this reinforces my opinion that it would be nice to have an option in |
This should be fixed in v2.5.0 (and see new docs too). Please feel free to reopen if there are still issues |
Hi Nick and John,
I'm back with other worries on using PopPUNK not on Strep pneumo - but it's been a while I've run the below now so maybe this will all have been addresed in version 2.4.0?
Versions
poppunk_assign 2.3.0
poppunk_sketch 1.6.2
Command used and output returned
see an excerpt of the log:
Describe the bug
Not really a bug, just that I'm puzzled by the output as upon assigning new strains, I've got all the 345 clusters previously defined in the reference database that got merged into a single one! not really helpful for strain classification...
Can you advise on what has gone wrong and how to address it?
note that the reference database was built with the following commands, with options notably to address wide variation in accessory genome among the input set (see previous posts in #135):
Cheers,
Florent
The text was updated successfully, but these errors were encountered: