-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about creating a custom RSVA nextclade dataset with NC_038235 reference #237
Comments
HI @YangJingqii
QC in nextclade consists of multiple different rules. In your results, what rules have scores > 100? One situation I can think of is that if you align to a reference that does not match the input sequences well, you will get many mutations - and one of the QC rules could treat this as a failure, e.g. the "Private mutations" (P) rule. If you want to analyze more diverse sequences and still have good QC results, then you might need to adjust the parameters of the QC rules or to turn them off, depending on your needs. This can be done in nextclade_data/data/nextstrain/rsv/a/EPI_ISL_412866/pathogen.json Lines 24 to 60 in 745ffb9
Note that the QC rules in Nextclade are empirical measures and by no means trying to set a benchmark or a reference point. You can tweak and tune the numbers to your needs.
I am no expert, so I don't know the answer, but I'll transmit the question to our scientists. RSV nomenclature seems to be a hot topic. There are many different publications and opinions. From your side, do you have a convincing justification why your proposed references should be chosen instead? (or additional to) It could be that the other references are also considered, but perhaps the team just did not have time yet to build the datasets. In the meantime, you can check the readme file in each datasets Let us know what you think. |
Hi YangJingqii, regarding your quality scores, I suspect this has to do with the alignment you use for the tree or the duplicated region that is not present in your reference. It is important that the alignment used for the tree is in reference coordinates, that is all gaps relative to the reference sequence must be stripped away. You can achieve this using richard |
When I used the command augur align --sequences data/sequences.fasta --reference-sequence config/rsva.fasta --output results/aligned.fasta --fill-gaps --nthreads 60, I understand that gaps relative to the reference sequence must be removed. However, it seems unreasonable that none of these sequences contain any gaps. When I used mafft --thread 60 --auto --keeplength --addfragments data/sequences.fasta config/rsva.fasta > results/allocation.fasta, gaps appeared in the alignment. Why are there no gaps at all when using augur align? Additionally, I noticed that the reference sequence you chose for surveillance is from 2017. If I'm conducting evolutionary analysis that includes sequences from both the 20th century and recent years, should I use NC_038235 as the reference sequence instead? Looking forward to your response! |
I think this is because you are running
This option replaces gap characters with I think even for analysis of all available RSV data, using a recent reference makes sense. Earlier sequences will have a gap when others have a duplication. But you should use a proper multiple sequence alignment rather than a reference alignment like Nextclade. |
I'm trying to create a custom RSV nextclade dataset following the tutorials from https://docs.nextstrain.org/en/latest/tutorials/creating-a-phylogenetic-workflow.html#annotate-the-phylogeny and https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-creation-guide.md.
I have two questions:
For the reference tree, I used the same sequences as provided in the Underlying data from https://nextstrain.org/rsv/a/genome/6y. I also used identical parameters in pathogen.json as the official dataset. However, my QC results differ significantly from the official nextclade results - many samples that pass QC in the official dataset are marked as "bad" in my custom dataset. What could be causing this discrepancy?
I noticed that the official nextclade datasets use EPI_ISL_412866 (for RSVA) and OP975389 (for RSVB) as references, while many academic publications, such as Nature Communications' "Distinct patterns of within-host virus populations between two subgroups of human respiratory syncytial virus", use NC_038235 (RSVA) and NC_001781 (RSVB) as references. What's the rationale behind these different reference choices?
Would appreciate any insights into these questions. Thank you!
The text was updated successfully, but these errors were encountered: