-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low quality/mutated BA.4/BA.5 sequences have disproportionately high likelihood to be classified as BA.2 #713
Comments
Going to close this one after noticing I basically created a duplicate of issue #645. |
Thanks @Sinickle, the query makes for a nice-sized example for testing. If you run pangolin in default (usher) mode, with the
Scorpio is still having a tough time distinguishing between BA.2, BA.4 and BA.5 with good sensitivity and specificity, and often overrides the usher or pangoLEARN call with |
Without actually having looked at the training set, I suspect that it might have more lower quality or varied sequences classified as BA.2, than there are for those classified as BA.4/BA.5.
This is an issue with countries that have more dropout in their sequences, but it seems it also is causing BA.4/BA.5 sequences that have some additional mutations to be classified as BA.2.
Let's take Botswana as an example.
There are 29 samples with S:486V and s:452R in the last 3 months.
https://cov-spectrum.org/explore/Botswana/AllSamples/Past3M/variants?aaMutations=s%3A484a%2Cs%3A486v&pangoLineage1=ba.2*&
Only one of them is labeled as either BA.4 or BA.5.
Let's exclude the ones that have a dropout...
Now we are down to 10
https://cov-spectrum.org/explore/Botswana/AllSamples/Past3M/variants?aaMutations=s%3A452r%2Cs%3A486v%2Corf1a%3A116v%2Cn%3A418q%2Corf1a%3A41e&pangoLineage1=ba.2*&aaMutations2=s%3A452r%2Cs%3A486v&pangoLineage2=ba.2*&
Now if we throw in some extra pieces to specify that various residues are set to the wild-type amino acid (as they are in wildtype, BA.2, and BA.4/BA.5)...
https://cov-spectrum.org/explore/Botswana/AllSamples/Past3M/variants?aaMutations=s%3A452r%2Cs%3A486v%2Corf1a%3A116v%2Cn%3A418q%2Corf1a%3A41e%2Corf1a%3A1m%2Cn%3A19g%2Cs%3A3v%2Corf1b%3A1156m&pangoLineage1=ba.2*&aaMutations2=s%3A452r%2Cs%3A486v&pangoLineage2=ba.2*&
Now the only one left is the one labeled as BA.4/BA.5!
...without actually having looked into the training set, I suspect this could be because either there is more BA.2 in the training set than BA.4/BA.5, or that there is lower variance in the BA.4/BA.5 samples than BA.2?
Given the current expectation for BA.4/BA.5 to become a dominant lineage, I believe it makes sense to promote the model to become less conservative with their designations though.
The text was updated successfully, but these errors were encountered: