Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about creating a custom RSVA nextclade dataset with NC_038235 reference #237

Open
YangJingqii opened this issue Oct 29, 2024 · 5 comments

Comments

@YangJingqii
Copy link

I'm trying to create a custom RSV nextclade dataset following the tutorials from https://docs.nextstrain.org/en/latest/tutorials/creating-a-phylogenetic-workflow.html#annotate-the-phylogeny and https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-creation-guide.md.

I have two questions:

For the reference tree, I used the same sequences as provided in the Underlying data from https://nextstrain.org/rsv/a/genome/6y. I also used identical parameters in pathogen.json as the official dataset. However, my QC results differ significantly from the official nextclade results - many samples that pass QC in the official dataset are marked as "bad" in my custom dataset. What could be causing this discrepancy?
I noticed that the official nextclade datasets use EPI_ISL_412866 (for RSVA) and OP975389 (for RSVB) as references, while many academic publications, such as Nature Communications' "Distinct patterns of within-host virus populations between two subgroups of human respiratory syncytial virus", use NC_038235 (RSVA) and NC_001781 (RSVB) as references. What's the rationale behind these different reference choices?
Would appreciate any insights into these questions. Thank you!

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Oct 29, 2024

HI @YangJingqii

many samples that pass QC in the official dataset are marked as "bad" in my custom dataset. What could be causing this discrepancy?

QC in nextclade consists of multiple different rules. In your results, what rules have scores > 100?

One situation I can think of is that if you align to a reference that does not match the input sequences well, you will get many mutations - and one of the QC rules could treat this as a failure, e.g. the "Private mutations" (P) rule.

If you want to analyze more diverse sequences and still have good QC results, then you might need to adjust the parameters of the QC rules or to turn them off, depending on your needs. This can be done in pathogen.json file. Here is a snippet from the official RSV-A dataset:

"qc": {
"privateMutations": {
"enabled": true,
"typical": 50,
"cutoff": 150,
"weightLabeledSubstitutions": 2,
"weightReversionSubstitutions": 1,
"weightUnlabeledSubstitutions": 1
},
"missingData": {
"enabled": false,
"missingDataThreshold": 2000,
"scoreBias": 500
},
"snpClusters": {
"enabled": false,
"windowSize": 100,
"clusterCutOff": 10,
"scoreWeight": 50
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 8
},
"frameShifts": {
"enabled": true
},
"stopCodons": {
"enabled": true,
"ignoredStopCodons": [
{
"codon": 320,
"cdsName": "G"
}
]
}
},

Note that the QC rules in Nextclade are empirical measures and by no means trying to set a benchmark or a reference point. You can tweak and tune the numbers to your needs.

What's the rationale behind these different reference choices?

I am no expert, so I don't know the answer, but I'll transmit the question to our scientists. RSV nomenclature seems to be a hot topic. There are many different publications and opinions. From your side, do you have a convincing justification why your proposed references should be chosen instead? (or additional to)

It could be that the other references are also considered, but perhaps the team just did not have time yet to build the datasets.

In the meantime, you can check the readme file in each datasets
https://github.com/nextstrain/nextclade_data/blob/master/data/nextstrain/rsv/a/EPI_ISL_412866/README.md
as well as the resources linked there, notably the "clade definitions" and "workflow" repos, which have some explanations there, to see if it clarifies it a little or not.

Let us know what you think.

@YangJingqii
Copy link
Author

Thank you for your answer!
I compared the results of the two dataset, could you help me check what threshold is reasonable to choose under such circumstances.?
Snipaste_2024-10-29_14-13-24
Snipaste_2024-10-29_14-14-06
Another point is that when I built the reference tree with augur, augur align --sequences data/sequences.fasta --reference-sequence config/rsva.fasta --output results/aligned.fasta --fill-gaps --nthreads 60 There are no gaps in the files that emerge from this step, which I feel is unreasonable. So I used mafft --thread 60 --auto --keeplength --addfragments data/sequences.fasta config/rsva.fasta > results/ allocation.fasta Is this operation feasible, why is there no gap?

As for the selection of the reference sequence, the sequence you selected is 2017, while NC_038235 is the reference sequence on ncbi. I would like to know why you chose the 2017 sequence as your reference. I can't find the relevant explanation in the link, and what problems will be caused by using this sequence for analysis?
Looking forward to your answer!

@rneher
Copy link
Member

rneher commented Oct 29, 2024

Hi YangJingqii,
the RefSeq sequences you are using as a reference are rather old and don't contain the duplication in the G gene that appeared in both RSV-A and RSV-B. Nextclade is aimed at supporting genomic surveillance and this is why recent diversity is most interesting to NextClade's users. We therefore chose two recent sequences are reference that are also designated as reference sequences in GISAID.

regarding your quality scores, I suspect this has to do with the alignment you use for the tree or the duplicated region that is not present in your reference. It is important that the alignment used for the tree is in reference coordinates, that is all gaps relative to the reference sequence must be stripped away. You can achieve this using augur align or mafft --keeplength as you already tried. The fact that there are no gaps is the desired outcome!

richard

@YangJingqii
Copy link
Author

When I used the command augur align --sequences data/sequences.fasta --reference-sequence config/rsva.fasta --output results/aligned.fasta --fill-gaps --nthreads 60, I understand that gaps relative to the reference sequence must be removed. However, it seems unreasonable that none of these sequences contain any gaps. When I used mafft --thread 60 --auto --keeplength --addfragments data/sequences.fasta config/rsva.fasta > results/allocation.fasta, gaps appeared in the alignment. Why are there no gaps at all when using augur align?

Additionally, I noticed that the reference sequence you chose for surveillance is from 2017. If I'm conducting evolutionary analysis that includes sequences from both the 20th century and recent years, should I use NC_038235 as the reference sequence instead?

Looking forward to your response!

@rneher
Copy link
Member

rneher commented Nov 11, 2024

I think this is because you are running augur align with the option --fill-gaps

augur align --sequences data/sequences.fasta --reference-sequence config/rsva.fasta --output results/aligned.fasta --fill-gaps --nthreads 60

This option replaces gap characters with N.

I think even for analysis of all available RSV data, using a recent reference makes sense. Earlier sequences will have a gap when others have a duplication. But you should use a proper multiple sequence alignment rather than a reference alignment like Nextclade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants