-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entropy panel mk2 #1684
Entropy panel mk2 #1684
Conversation
Remove unused prop, fixup typos
Essentially an empty commit so that we can more easily track changes in subsequent commits.
1269c72
to
a9f55a5
Compare
a9f55a5
to
f102a11
Compare
this is very cool! Looking forward to finally having this in auspice! A few points, some of which you touch upon above.
|
f102a11
to
b047629
Compare
Thanks @rneher -ve strand CDSs are now working, except that zooming within a CDS is backwards (will fix tomorrow). I've abandoned the use of frame for the y-offsets and instead just stack them as efficiently as possible to avoid any CDSs overlapping (in the bottom view, anyways, the top view used too much space that way so they overlap a little bit).
Yeah! I played around with this yesterday, and I think it'll be very helpful. Here's how it looked for Spike (clicking on the local annotation toggled the highlighting). I'm going to leave it out of this PR, or at least get this PR finished before adding it in. I'd have confidence that we should add it to the JSON spec now tho. |
b047629
to
7da7446
Compare
7da7446
to
53078a3
Compare
53078a3
to
75db2b7
Compare
Currently the JSON annotation block is overly simple, and there are many features we want to parse & display in Auspice. This will inevitably involve an update to the JSON schema, but here we introduce a new hierarchical structure `genomeMap` to represent Genome → Chromosome[] → Gene[] → CDS[] → CDS_Segment[] Parsing multiple CDS segments from a single CDS in the current JSON schema is only possible for CDSs which wrap around the origin. We provide a default color per gene, but allow per-CDS overrides from the JSON. Future commits will remove the `entropy.annotations` and `entropy.geneMap` as the intention is that `entropy.genomeMap` is the canonical source of this information. Note - previously auspice treated CDSs & genes as the same thing. They are now distinguished from each other, but this change will take time to propagate through the codebase. Note - Typescript errors in existing code are addressed largely using "any" types, as the plan is to remove this code as part of this PR.
This is in preparation for upcoming changes to how annotations are drawn. The offsets (position calculations) remain the same, but now use named constants to better convey the intention. Different elements are now separated out so that we may move them independently -- e.g. the nav annotations, brush and axis are no longer tied together by the same offset. There still remain some hardcoded pixel values, especially in the brush code, which we will improve when we update that later on. Wireframes are added to the brush elements to help with development. They will be removed in an upcoming commit (on this branch), as the zoom / brush will be significantly changed.
Previously we only had one tooltip (active while hovering on bars), generated by `<InfoPanel>`. Here we abstract out the logic for tooltip positioning & display and then delegate the contents of the tooltip to the triggering event. This is in preparation for future commits which will add tooltips for other events in the entropy panel.
Includes offsetting CDS segments by frame and on-hover tooltips. Negative strand CDSs are not yet considered. The styles aren't quite right but I'll tackle that once all the functionality is in place.
This also includes a (overdue) update to the offset calculations. The rendered CDSs are similar to as before, but now with a frame offset similar to that implemented for the navBar annotations
Previously the entropy panel could toggle between "aa" and "nuc", and we stored this in `state.controls.mutType`. The aim of this PR is to be able to select a CDS (or select "nucleotide view") within the entropy panel rather than a aa-nuc toggle. We could simply add another bit of state to represent this, but here I've chosen to remove mutType and introduce `state.entropy.selectedCds`. Note that `mutType` still appears in some functions, but it is derived from `selectedCds` rather than taken from redux state. This involved a lot of changes to the codebase, but it is significantly improved because of it. The Entropy panel now no subscribes to redux colorBy state, which addresses a series of bugs related to state-change- ordering (see removed comment from `./src/actions/colors.js`). There should be minimal changes to the UI, but there are a few: * The AA-NT toggle has gone from the entropy panel. AA entropy can still be shown by using an appropriate genotype colorBy (via UI or URL). * When switching to nuc view in the entropy panel from an aa-genotype- colorBy, or vice vera, we previously would change the colorBy to the default (often "country"). This is no longer the case, as I don't think it's that user-friendly. But it's easy to add back if people like it! * The way the entropy panel automatically zooms upon genotype changes isn't working quite as before. This will be a big focus of a future commit, and the current implementation is rather ad-hoc, so I'm not paying it too much attention here.
I plan to refactor the entropy/counts calculation code in the next commit so it's helpful to have some sanity tests (as well as inspecting the UI). It turns out no code uses the "x" property for aa-entropy/count data, so we remove it here.
genomeMap better describes the annotation structure we will be using going forward in Auspice and having one source of truth (in redux state) is important in avoiding bugs etc. There are other pieces of state which also contain similar information which i'll replace with genomeMap in forthcoming commits. There are two functional changes: - For nucleotide positions we used to check if it fell inside a gene, which allowed on-hover info boxes to show information such as "Nucleotide x (Codon y in protein z)". This never considered the case where a nucleotide fell in multiple genes (CDSs), so it was somewhat misleading. It was a generally useful idea, so we should bring it back in an improved form. - CDSs which have multiple segments now have their bars (mostly) correctly displayed -- e.g. if a CDS wraps the origin then bars will appear at the start & end of the genome. This will be fully addressed when a subsequent commit moves the UI to representing the CDS by itself, rather than as segments on the genome. Aside: the terminology used around auspice re: "gene" vs "CDS" is now rather inconsistent. I'll hopefully tidy this up in a forthcoming commit.
The redux state for entropy panel zoom bounds was confusing and contained a number of little bugs. The requested bounds were stored as "zoomMin" and "zoomMax" in both controls and entropy state. The controls state is only ever updated by a URL query (i.e. dataset load / narrative page change), it doesn't respond to actions which update the zoom bounds ('CHANGE_ZOOM'), although these actions do change the URL query; I've not changed this here (other than to add a comment), however a future commit will revamp the entropy zooming code & can attend to this. The entropy state, while remaining up-to-date, is never subscribed to, so we remove it. Similarly, `entropy.zoomCoordinates` was never used. The checking of these bounds on page load / 'CHANGE_ZOOM' action previously used `controls.absoluteZoomMin` and `controls.absoluteZoomMax`, but we now use the canonical `genomeMap` state. These can therefore be removed, as can `entropy.lengthSequence` which was used to create the controls state! The checking of zoom bounds is much improved on page load, and previous oddities where the URL query wouldn't update appropriately are no longer (e.g. you can no longer have `?gmax=x` if x is beyond the genome length).
We now make a guarantee at dataset-load time whether genotype colorings are available: if mutations exist on the tree and the genomeMap exists (which implies the `metadata.genome_annotations` exists and at least defines the nuc coordinates) then we will always be able to have a genotype coloring. This knowledge will allow upcoming work as we can now be sure genotype colorings can only exist if they are supported by the data. Observations and bug-fixes from this commit: - It's not possible to get a colorBy for a CDS (or 'nuc') which is not in the genomeMap, or to specify a valid CDS but invalid position. If such a CDS is specified in the URL query, we fallback to a default colorBy at load time. - It is no longer possible to have a partially-valid color-by. Specifically, it was possible to specify multiple (genotype) positions, and as long as one was valid then the color-by would be considered "ok". This is now caught and the color-by changed to only those positions which are valid, and a console.error printed. - It's possible to have a CDS defined in the genomeMap which is never observed on the tree. This will work as expected -- the tree coloring will always be grey or a single color, depending on if the root-sequence JSON is available. No entropy/counts bar will be created, so it will appear to be zero. - It's possible to have a CDS defined as a branch mutation which is not present in the genomeMap, but this cannot be accessed as a coloring. It will be shown in the branch hover/click info-box.
This is unused (and I believe it always was, as `geneMap` was essentially the same data structure).
The information for gene (CDS) lengths is better found within the `genomeMap`, as that covers more complex genome annotations than the `geneLength` did. Furthermore, the `geneLength` calculations included an off-by-one bug¹ which didn't allow us to color-by the final aa in the CDS². The validity of genotype positions is no longer checked within color- scale creation, due to the genomeMap not being available to that function and a lot of prop-drilling necessary to make it so. However a previous commit guaranteed that genotype color-bys must be valid prior to this point, so it's ok to skip this check. ¹ In the JSON structure, both start and end are 1-based (following GFF) so (end-start) will undercount the nuc length by 1, which resulted in aa lengths of `<int>.6667`. ² This maps to the stop codon, so it's arguable whether this was a feature or a bug, but there can be mutations here, and when there are Auspice should be able to display them.
Slightly complicated by the observation that ribosomal frameshifts (e.g. in SARS-CoV-2's ORF1ab) mean that a single nucleotide can appear twice within a CDS.
A python script to post-process a nCoV dataset to create annotations which more closely match the biology. There may be some off-by-one errors there, and it's incomplete, but it's very useful for testing. Of course we would use Nextclade to actually generate the translations P.S. The get-data script is run for Heroku review-apps
Allows CDSs to have both a display name and a description. The general style and layout of the info-boxes is also improved.
GFF allows for strands '?' (features whose strandedness is relevant, but unknown) and '.' (features that are not stranded), which are represented by augur as '?' and null, respectively. (null comes from `None` in python.) In both cases it's not a good idea to make an assumption of strandedness, or to assume it's even a CDS.
The appearance now follows the date slider (sidebar), and the on-hover icon correctly indicates whether bidirectional or unidirectional dragging is possible. Changes requested in PR review <#1684 (comment)>
It's not necessary (or a good idea) to keep a copy of the annotations in redux state as opposed to using the canonical entropy.genomeMap. This is the final piece of ~duplicated redux state identified as part of <#1684>
6c6dc37
to
d080a31
Compare
A number of small improvements, many of which were suggested in review. Added a "help" icon, following the pattern we use extensively in the sidebar. Updated the zoom triangle appearance and their cursor icons. Items 1 and 4 in <#1684 (comment)> Changed the "return to nuc" button to a "RESET LAYOUT" button which will always return you to a fully zoomed out view of the genome. See added comment for why this isn't greyed out when it is a no-op. Item 2 in <#1684 (comment)> Changed the panel title to indicate if we're looking at the genome or a CDS, and add the CDS name to the title in the latter case. Also added a small axis label indicating that the numbering system is AA for selected CDSs (it's a bit cluttered to show 'Nt #' as well) <#1684 (comment)>
d080a31
to
c24d5f7
Compare
A number of small improvements, many of which were suggested in review. Added a "help" icon, following the pattern we use extensively in the sidebar. Updated the zoom triangle appearance and their cursor icons. Items 1 and 4 in <#1684 (comment)> Changed the "return to nuc" button to a "RESET LAYOUT" button which will always return you to a fully zoomed out view of the genome. See added comment for why this isn't greyed out when it is a no-op. Item 2 in <#1684 (comment)> Changed the panel title to indicate if we're looking at the genome or a CDS, and add the CDS name to the title in the latter case. Also added a small axis label indicating that the numbering system is AA for selected CDSs (it's a bit cluttered to show 'Nt #' as well) <#1684 (comment)>
c24d5f7
to
f3f41e5
Compare
A number of small improvements, many of which were suggested in review. Added a "help" icon, following the pattern we use extensively in the sidebar. Updated the zoom triangle appearance and their cursor icons. Items 1 and 4 in <#1684 (comment)> Changed the "return to nuc" button to a "RESET LAYOUT" button which will always return you to a fully zoomed out view of the genome. See added comment for why this isn't greyed out when it is a no-op. Item 2 in <#1684 (comment)> Changed the panel title to indicate if we're looking at the genome or a CDS, and add the CDS name to the title in the latter case. Also added a small axis label indicating that the numbering system is AA for selected CDSs (it's a bit cluttered to show 'Nt #' as well) <#1684 (comment)>
f3f41e5
to
2697a02
Compare
Thanks for all the reviews. I think this PR is at a good point to wrap up and release which I plan to do ~early next week (and we can keep improving this after merge if we notice things).
Good idea! I've added a
Yup! Both done.
(Discussed elsewhere) the best way forward is to continue to annotate these as separate CDSs, rather than a single segmented CDS which I demoed here. This allows people to continue to label mutations in an ORF1b coordinate system which has become quite common.
Both possible -- I tried the arrows approach on the lower track and it looked quite messy (lots of aliasing), but I'm sure if we played around with it enough it'd work. I haven't tried the faint lines but think it's worth pursuing. The zooming approach will need further work to support large (>200kb) genomes well.
Done. It's a bridge too far to make this greyed out at the present moment, but we can tackle that aspect later on. |
See <nextstrain/auspice#1684> for the context for these additions.
The precursor protein is cleaved into ~10 proteins including the precusor membrane protein (prM). prM is subsequently cleaved by the host protease furin resulting in protein M (aka MP). The previous annotation included CDSs PRO and MP, but not the (parent) prM protein. prM is cleaved directly from the polyprotein and is functional, forming heterodimers with protein E (similar behaviour across all flaviviruses I believe). NCBI's genome viewer shows both prM as well as both the sub-proteins after cleavage, pr and M. pr would correspond to the previous PRO annotation. I've not seen protein pr referenced anywhere else so I've not included that here. Note that this change wasn't desirable before Auspice could display overlapping CDSs <nextstrain/auspice#1684>. References: <https://www.microbiologyresearch.org/content/journal/jgv/10.1099/vir.0.18723-0> <https://www.ncbi.nlm.nih.gov/gene/7751225>
See <nextstrain/auspice#1684> for the context for these additions.
See <nextstrain/auspice#1684> for the context for these additions.
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs are not modified in this commit. Supersedes #1333
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333
For the full story please read through the commit messages. I'll try to highlight the main UI changes here, but perhaps it's better to just jump in and use the review app with fresh eyes.
Main UI changes
Overlapping CDSs are better shown
Click for details
By using opacity and jittering them based on the frame the CDS is in. The aim is not to have a perfect genome viewer (e.g. see how much space a "proper" genome browser needs!) but to expose what's there and allow you to hover over them / click them (clicking a CDS selects it, see next section).
nCoV, current Auspice on top, this PR on bottom; much better communication of the CDSs present (they were always there in the JSON, just visually obscured).
Hovering over the CDSs is now possible too:
Selecting a CDS focuses the entropy panel on the translated CDS
Click for details
When viewing a CDS we now have the main axis represent this and only this CDS, always shown going L→R, with tick labels representing amino acids. This better represents biology as we are able to ignore any artifacts upstream of a translation. The UI to view a CDS is to either click on a CDS or select a genotype color-by for that CDS.
This screenshot (top, bottom) shows the pre-capsid CDS for HepB. In the top view other CDSs are overlapping so most of the bars you see are from a different CDS (pre-capsid CDS is the orange coloured CDS, and then the green CDS is overlapping), and it was not possible to see the actual diversity of pre-capsid¹.
¹ I guess in this example the CDSs are in frame, so the diversity is identical, but barcharts from different CDSs obscuring each other has always been a problem in auspice
Segmented CDSs (nCoV example)
Click for details
Top view is our current genome annotation for nCoV, bottom is an example of what it could be (incomplete, as I've only added in 2 of the 16 proteins cleaved from the polyproteins)
Specifically, we can now view ORF1ab (PP1ab) in its entirety:
(The main axis now has two segments, which is what we have been referring to as ORF1a and ORF1b, separated by a -1 ribosomal frameshift.)
And since overlapping CDSs are better shown we can (finally!) show all the proteins which are produced, e.g. RdRp (a complex one, since it spans the -1 frameshift). The entropy panel conveys the expected lack of diversity here well I think.
CDSs which wrap around the origin (HepB example)
Click for details
Most of HepB's CDSs wrap the origin! Displaying the translation (i.e. the top half of the panel) is no different from any segmented gene, it's the zooming that's tricky here. My solution was to ~invert the brush; perhaps easier to explore than to explain!
Segmented CDSs may have codons which span multiple segments (referred to as the phase of the segment), and we handle this well now:
Entropy panel state is decoupled from the color-by
Click for details
Previously the entropy panel was tightly coupled to the color-by. It is still linked to it, but it can also operate independently if (and only if) you start interacting with it. Any change to the (genotype) color-by will bring it back in sync. For instance, I can now select Spike 452 color-by (which will always jump to that position in the entropy panel + select Spike), and then use the UI to select another CDS in the entropy panel without changing the colorBy. If (via the sidebar) I change to a different Spike position then the entropy panel will jump back to show Spike.
Proposed JSON schema
The current proposal is mostly backwards compatible, except segmented CDSs are ignored by previous Auspice versions. For instance here's how ORF1ab looks:
And you can specify a CDS which wraps around the origin by using an end value beyond the end value of the genome (or create the segments yourself, either way works):
Test datasets
Available via the review app:
Things still to do (either here or in subsequent work):
styling
Negative strand CDSs
misc
state.metadata.genomeAnnotations
(used by the tree info boxes to order the genes).future work?
Issues closed
Closes #1596
Closes #1651
Closes #1145
Closes #299
Closes #1516