-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotations schema updates #1281
Conversation
The annotations schema is updated to more accurately represent how this data is used by Auspice version 2.46.1. Specifically, the strand of the 'nuc' property is not used by auspice, and start/end coordinates are required properties. Tests are updated because the tests were incorrect. Currently the annotations section of node-data JSONs are produced by `augur ancestral` and `augur translate`. Each adds a 'nuc' property, and it's required for Auspice, so I think it's appropriate to make it a required property for the schema. The changes to the schema here are only reflected when parsing node-data files, the next commit will make this available for our typical auspice JSON validation.
34692f4
to
d763c05
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1281 +/- ##
==========================================
+ Coverage 69.74% 69.86% +0.12%
==========================================
Files 67 67
Lines 7146 7155 +9
Branches 1742 1744 +2
==========================================
+ Hits 4984 4999 +15
+ Misses 1855 1849 -6
Partials 307 307
☔ View full report in Codecov by Sentry. |
"strand": { | ||
"description": "Positive or negative strand", | ||
"type": "string", | ||
"enum": ["-","+"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Add enum
back with additional values.
Even though Auspice only cares if the value is '-'
or not, Augur can also export values '?'
and None
as suggested and implemented in 31f0b26. Defining the possible values here will improve the effectiveness of validation.
I started this in #1279 but that PR can be closed and the change included in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review -- really appreciated!
We're going to have to think through this. All annotations are interpreted by Auspice as CDSs, so a strand of "?"
or "."
(which we represent as None
) doesn't make sense as a CDS - neither for Auspice to display nor for Augur to translate.
I can allow them in the schema and then have Auspice filter to ["+", "-"]
, which is probably the most technically correct, but I would think that happily translating "?"
/ "."
features is questionable. For augur translate
+ GenBank annotations, only CDS features are read and (I believe) it's not possible to encode a CDS in GenBank that's not +/-. I'm not sure what we do for augur translate
+ GFF annotation.
Update: Auspice PR now ignores any non-nuc annotation which is not explicitly +/- strand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have shifted this conversation to #1279
d763c05
to
35c688c
Compare
35c688c
to
7aea997
Compare
Matches the new schema id (to be) in Augur, see <nextstrain/augur#1281>. Resolves <#704>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple additional small changes.
I've done the .org side of things in nextstrain/nextstrain.org#705.
augur/validate.py
Outdated
if refs: | ||
# Make the validator aware of additional schemas | ||
schema_store = {k: json.loads(resource_string(__package__, os.path.join("data", v))) for k,v in refs.items()} | ||
resolver = jsonschema.RefResolver.from_schema(schema,store=schema_store) | ||
schema_validator = Validator(schema, resolver=resolver) | ||
else: | ||
schema_validator = Validator(schema) | ||
|
||
# By default $ref URLs which we don't define in a schema_store are fetched | ||
# by jsonschema. This often indicates a typo (the $ref doesn't match the key | ||
# of the schema_store) or we forgot to add a local mapping for a new $ref. | ||
# Either way, Augur should not be accessing the network. | ||
def resolve_remote(url): | ||
# The exception type is not important as jsonschema will catch & re-raise as a RefResolutionError | ||
raise Exception(f"The schema used for validation attempted to fetch the remote URL '{url}'. " + | ||
"Augur should resolve schema references to local files, please check the schema used " + | ||
"and update the appropriate schema_store as needed." ) | ||
schema_validator.resolver.resolve_remote = resolve_remote | ||
|
||
return schema_validator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
If we upgraded on dep on jsonschema
to the 4.x series, we could use the newer and nicer APIs for reference handling here. Doesn't have to be now, though.
See added comments for details
7aea997
to
56fb8a8
Compare
See <nextstrain/auspice#1684> for the context for these additions.
56fb8a8
to
c530539
Compare
Schema updated to remove the |
repr() will add appropriate quotes around the value. Related-to: <#1281 (comment)>
"description": "Type of the feature. could be mRNA, CDS, or similar", | ||
"$comment": "Unused by Auspice 2.0", | ||
"type": "string" | ||
"strand": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late to the party, but shouldn't we in principle allow each cds fragment to have its own strandedness? Here the strand is fixed for all fragments, which will be fine in most cases but who knows, maybe sometimes fragments might come from opposite strands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChatGPT tells me that strandedness never changes within a CDS so then what we have here should work in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With trans-splicing it might work, but doesn't seem to happen in viruses so we should be good for now: https://en.wikipedia.org/wiki/Trans-splicing
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs are not modified in this commit. Supersedes #1333
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333
This PR is ready for review, but note that I won't merge it until the companion Auspice PR has been merged and that there may be some very slight schema changes (e.g. property names changing) before then. Neither of those should hold up review.
See commit messages for details.